DIT Resources for Data Management Plans - - IT Service Desk

DIT Resources for Data Management Plans

Table of contents

Purpose
DMP Tool
Data Management Plan sections
DIT and UMD libraries tools and services
Persistent identifiers (Provided by UMD libraries)
- Cybersecurity practices and tools
- Cloud infrastructure
Data Management Plan elements
- NSF
- NIH
Additional links

Purpose

NSF and NIH require a Data Management Plan (DMP) with submissions for funding. Other funders or data providers may also require a DMP as part of the proposal process. The content in this article is designed in conjunction with DIT and the UMD Libraries to assist with the development of Data Management Plans for submissions with research proposals, and to be used in conjunction with the other resources available on the Data Management Planning webpage.

If you would like to suggest any changes or additions to this article, please email it-research-consult@umd.edu.

Regular DMP workshops are available through the Libraries Research Commons, but additional workshops are available upon request.

DMP Tool

The DMP Tool is a convenient way to draft your Data Management Plan, and provides digital templates that help you organize and write your plan.

Data Management Plan sections

The elements of the Data Management Plan will vary by sponsor and may have additional requirements depending on the directorate or sub-group of the sponsor. However, there are some key areas that the plans tend to have in common:

Data, Access, Security, and Reproducibility of data.
Standards related to data formats and metadata.
Reuse and Redistribution of data.
Preservation or Archiving.

DIT and UMD libraries tools and services

The following services are available to campus and have been organized based on their primary function and features. There are some example statements provided that are not intended to be boilerplate language, but are provided as a guide in how to include these tools in your plan. Your plan should reflect the unique characteristics of your research project, your processes, and any unique technology infrastructure you might use.

Data collection and collaboration

UMD Survey Tool Qualtrics

Data will be collected using Qualtrics, a survey and data collection software managed by the UMD Division of IT. Responses will be transmitted over HTTPS and stored in the Qualtrics platform servers for analysis by the research project team. Access to the data requires authentication with campus Directory ID, password, and multi-factor authentication. Access will be limited to research project team members using the built-in Qualtrics access control lists. Qualtrics provides data analysis features and data export capabilities for analysis in additional systems.

UMD Box

UMD Box will be used to store data and documents, and to allow multiple research team members to work concurrently on the same file. UMD Box is a cloud based storage platform. All data stored in Box is encrypted in transit and at rest, files are monitored with threat detection software, and all data is stored in the US. Access to the data will require authentication with a campus Directory ID, password, and multi-factor authentication, and access is controlled using built-in Box roles that are controlled by the file/folder owner.

UMD Google Drive

UMD Google Drive will be used to store data and documents, and to allow multiple research team members to work concurrently on the same file. UMD Drive is a cloud based storage platform. All data stored in Drive is encrypted in transit and at rest, and all data is stored in the US. Access to the data will require authentication with a campus Directory ID, password, and multi-factor authentication, and access is controlled using built-in Box roles that are controlled by the file/folder owner.

Data security - Confidentiality

Secure Share

In lieu of email, Secure Share, the campus tool for sending and receiving confidential files, will be used to receive or send sensitive data. Secure Share requires campus authentication to access, and encrypts and stores data temporarily until it can be accessed by the recipient, or until the specified access time period expires. If sharing with outside collaborators, they will use guest accounts in the system to access files.

CUI Environment

All data will be stored and analyzed within the CUI Environment, a secure, on-premise campus computing and data enclave that adheres to all 110 NIST SP 800-171 controls. The system is logically and physically separated from other computing environments, outbound network access is blocked by default, and all data uploaded to the CUIE is encrypted using end-user specific private keys. Once inside the system, the data is limited only to collaborators who have specifically been authorized by the data owner. Data is analyzed on Virtual Machines and encrypted storage within the enclave. Connection to the system uses a specially designed application and is secured via a TLS tunnel from the end-user machine to the VM that uses public keys and temporary private keys to create a secure channel key that cannot be snooped by any other party, including admins.

UMD Box

All data related to the project will be stored in UMD Box, a cloud-based storage solution that requires campus Directory ID, password, and multi-factor authentication to access. All data in Box is encrypted, and all data is stored within the US. Additional capabilities that are in place include automatic detection of viruses, alerts to admins based on atypical end-user behavior, and policy based controls. Access is limited using built in Box roles, and access can be shared with external end-users who have Box accounts. Box will also be used to request files, in lieu of transfers over email.

Data security - Backups

Code42

Any workstations used to store and or analyze data will be backed up using the campus workstation backup service, Code42. Backups and versioning are iterative and are created every 15 minutes. Data is encrypted during transit and at rest in Code42’s cloud. The service is approved for data categorized as high risk. Code42 costs are included in the budget of this proposal.

Spectrum Protect

This project will include the use of dedicated computing servers located in our data center. The servers used to store and analyze data in this project will be backed up on a regular basis using the campus data backup tool IBM Spectrum Protect. The backup process will encrypt data in transit and at rest using AES 256-bit encryption. The costs for this service are included in the budget of this proposal.

Data stored in commercial cloud (AWS, Google Cloud, Azure)

Data stored in commercial cloud offerings will be backed up appropriately using service-adjacent backup offerings configurable per service (for example, database replication for database services). When backups are not a built-in feature of the service, Division of IT will be consulted to implement appropriate backup mechanisms to protect against data deletion, corruption, and malicious intent.

Data analysis

Software available through TERPware

Data analysis will be completed using Stata/MP installed on research team members computing devices, and purchased through the campus licensing agreement.

Data analysis - Virtual Desktop Computing

UMD Virtual Workspace

This project will use campus provided the UMD Virtual Workspace to perform certain aspects of data analysis. The computers follow enterprise security protocols; they are patched regularly, access is controlled using campus authentication, anti-virus software is installed, and security logs are reviewed by the IT Security team. Data will not be permanently stored in the UMD Virtual Workspace; its storage is considered to be ephemeral. Instead, data and reports will be copied and stored in UMD Box.

Data Analysis - High Performance Computing

Zaratan

Computational analysis will be performed using campus high performance computing resources. Data will be transferred to the system temporarily for analysis before being moved back to the project data storage solution.

Technical specifications: UMD's flagship cluster, intended for large, parallel jobs, housed off campus and maintained by the Division of Information Technology. It consists of over 380 nodes with dual socket (128 cores per node) AMD Milan processors. Twenty nodes also each contain four Nvidia A100 GPUs. All nodes have at least 512 GB of RAM, with six large memory nodes having 2 TB of RAM. All nodes have HDR-100 infiniband (100 Gb/s) interconnects, and there is 2 PB of fast BeeGFS scratch storage.

Project management, open data, sharing and preprints

Open Science Framework (OSF) (Provided by UMD Libraries and VPR)

This project will use the campus open source cloud-based project management platform (The Open Science Framework or OSF) to manage project resources and share data. The project space will be connected to campus storage (UMD Box) to store data, and used to control project and resource access. OSF will be used to provide version control, persistent URLs, and DOI registration. Research outputs will be shared using OSF as open access articles and preprints. The project team will ensure research reproducibility in part by accompanying data with relevant software repositories for use in analysis.

Data repositories

DRUM (Provided by UMD Libraries)

If used for data sharing and access:

Research products from this project will be archived at the Digital Repository at the University of Maryland (DRUM) unless a more appropriate facility can be identified. DRUM is a long-term, open-access repository managed and maintained by the University of Maryland Libraries. Researchers and the general public can download data and code files, associated metadata and documentation, and any guidelines for re-use. All records in DRUM are assigned a persistent DOI to support consistent discovery and citation. The project description will be automatically indexed in Google and Google Scholar to support global discovery. Whenever possible, digital curation specialists in the University Libraries work with researchers to document and format materials for long-term access.

If Used For Long Term Preservation:

The research products archived in DRUM will be available indefinitely. The University of Maryland Libraries’ DRUM repository is built on DSpace software, a widely used, reliable digital repository platform. DRUM performs nightly bit-level integrity tests on all files, and all contents are regularly copied to back-up storage. DRUM conforms to the digital preservation principles outlined in the University of Maryland Libraries’ Digital Preservation Policy.

See also Data Repositories - Research Data Services | UMD Libraries.

Persistent identifiers (Provided by UMD libraries)

The UMD Libraries can work with researchers to obtain a DOI (Digital Object Identifier). This project will use ORCiD to create a persistent identifier that will be used to share research outputs across platforms.

Cybersecurity practices and tools

This project will follow appropriate security controls to protect the confidentiality of sensitive and restricted data by utilizing systems that conform to university policies and IT standards. All research team members have completed annual campus cybersecurity training and insider threat training. These systems include UMD Box and the UMD Virtual Workspace. Researcher computing devices will be managed and patched by the university, protected with FireEye endpoint protection, and backed up daily using Code42. Research Lab servers are joined and managed by the campus Active Directory, protected by FireEye endpoint protection, backed up using Spectrum Protect, and encrypted using FIPS 140-2 compliant cryptographic modules. The computing devices and server logs are forwarded to the campus SIEM for event analysis, response, and investigation.

Cloud infrastructure

Commercial cloud options provide scalability, flexibility, novel computing solutions, and access to Quantum resources. Cloud resources could be configured to meet various elements of a DMP - analysis, storage, reproducibility, or data archive. In most cases, cloud is useful for novel solutions to analysis and collaboration.

Google Cloud Platform

Data storage, pipeline, and analysis cyberinfrastructure will be built on the Google Cloud Platform. Google Cloud Storage, BigQuery, and AutoML will be used to store data, perform SQL queries, and create and train machine learning models. BigQuery will also be used to share data set segments with collaborators for additional analysis. Using Google Cloud Platform will allow efficient use of computing resources and remove unnecessary costs and overhead related to physical hardware procurement, setup, and management.

Amazon Web Services

AWS EC2 instances with attached graphical processing units will be used to provide short term access to extremely powerful computing resources necessary for training machine learning models. Data will be stored in AWS S3 cloud storage buckets. Using cloud based resources will assist with developing accurate estimates for the physical hardware resources necessary to operationalize machine learning models in future phases of work.

Microsoft Azure

A Microsoft Azure subscription will be utilized in this project to submit jobs in Q# to run on Quantum simulators and ultimately on quantum computers.

Data Management Plan elements

Data Management Plan sections or elements will differ depending on sponsor, directorate, and discipline, so be sure to review the official guidance for your specific use case. Links to full descriptions from NSF and NIH may be found in the See Also section at the end of this article. Below are high level description of DMP sections and relevant technology considerations.

NSF

Products of research

Relevant services: TERPware, UMD Box, Google Drive, OSF Institutions.

What files formats will your data use?
What software will be used to organize, manage and analyze data?
What technology output might result from the project? Software, scripts, images?

Data format standards

Relevant services: OSF Institutions.

How will data be stored such that access is stable and formats are non-proprietary?
How will metadata files be generated for the data sets that are part of the project? Where and how will they be stored?
What technology will be required for reproducibility?

Access and sharing

Relevant services: UMD Box, CUI Environment, Secure Share, UMD Virtual Workspace, Network Storage, UMD Google Drive, OSF Institutions.

How will the protection of privacy and confidentiality be protected as it pertains to digital copies of data?
What technical security controls will be used? ID and Password, Multi-Factor Authentication, Least Privilege, Encryption, Backups?
How will access be logged and monitored?
Who will be responsible for administrating technical security controls?

Policies and provisions (re-use and redistribution policies)

Relevant services: OSF Institutions, Persistent Identifiers, UMD Box, Network Storage.

How and when will data be made available to people outside the research team?
Will access to the data need to be controlled or restricted? What technology will enable this process?
What steps will need to be completed to access the data?

Archive of data

Relevant services: DRUM, Persistent Identifiers.

Will data be submitted to the Digital Repository at University of Maryland or will data be submitted to an alternative data repository? Where wiil non-published data be stored?
Have you established a persistent identifier for the research output?
How will the reproducibility of results result be made possible?

NIH

Data type

Relevant services: UMD Box, Secure Share, CUI Environment, UMD Virtual Workspace, Network Storage.

What files formats will be used, and how have decisions been made about handling, sharing and preserving data?
What is the modality or level of aggregation of the data, and how will it be processed? What metadata will be associated with files?
What protections will be used to protect privacy and confidentiality of sensitive data?

Related tools, software, code

Relevant services: TERPware, OSF Institutions.

Are any specialized tools or software needed to access or manipulate shared data?
How will technology be used to ensure reproducibility? (For example, using containerization to easily share data analysis environments).
Will tools and technology be open source and freely available, or require special permission/access?

Standards

Relevant services: UMD Box, OSF Institutions.

What standards will be applied to the scientific data and associated metadata?
How will metadata be generated and associated with data?
Will the project use common data elements?

Data preservation, access, and timeline

Relevant services: DRUM, Persistent Identifier, OSF Institutions, UMD Box, Secure Share, Network Storage, Cloud Services.

Where will scientific data be archived for long term preservation?
How will data be findable, and will data access be unrestricted?
If data access will be restricted, what technology and process will be used?

Access, distribution, and reuse

Relevant services: Persistent Identifiers, OSF Institutions, UMD Box.

What limitations on sharing or reuse might apply to the project data?
Are planned sharing limitations in line with community expectations?

Oversight

Relevant services: OSF Institutions, UMD Box, DRUM.

Who is responsible for executing the various plan elements?
What expertise will be required?

Additional links