An Overview and Practical Use
February 9, 2022
Dir, Research Technology Operations Research Data Program Manager Assistant Director Research Computing Services Baker Knowledge and Library Services DRFD Research Administration
Melissa Velez, PhD Rachel Wise, MLIS Dir, Data Services and Project Operations HBS Archivist Research Computing Services Baker Knowledge and Library Services
Agenda
Goals
- Brief introductions
- Why research data management
- Introduction to our narrative
- RDM lifecycle:
- Planning
- Data Acquisition
- Storage & Analysis
- Data Sharing & Archiving
- Closing remarks
Become familiar with the RDM lifecycle
Understand the details and requirements at each stage
Know what resources and services are available and recommended at HBS, Harvard, and beyond
Engage in utilizing best practices throughout your work at Harvard and in your future careers
Intro: Discussion Questions
Who can describe to me what Research Data Management is all about?
So what? Why should I care? Why should you care?
What’s in it for the faculty member? For the University?
Who might be partners in this endeavor?
Is anyone legally obligated?
What phases are part of the RDM lifecycle?
Data come in many ways…
“The active and ongoing management of data __through __ its lifecycle of interest and usefulness to scholarship, science, and education.”
— The University of Illinois’ Graduate School of Library and Information Science
- Benefits for yourself, researchers, and science:
- Your future self will thank you!
- Facilitate and ensure seamless team transitions
- Conduct analysis effectively when collaborating with others
- Check and verify research results
- Support FAIR principles
- Findable, Accessible, Interoperable, & Reusable
Compliance , to minimize risks:
Be compliant with University and School policies
Be __compliant with research funding organizations __ that require a data management plan and data accessibility
Be __compliant with journals __ that require to submit your data accompanying the article
Responsibility as a proxy for faculty researchers
Dual role of minding of administrative responsibilities and conducting the faculty research program
Provides a reference framework for how to conduct research in line with best practices
Practices aligned with career- and established research professionals (such as RCS personnel)
You may have to exercise some, many, or all of the following recommendations…
Harvard Research Data Lifecycle Preview
DISSEMINATION & PRESERVATION
Evaluate & Archive
Share & Disseminate
Store & Manage
PLANNING
Plan & Design
Access & Reuse
Store & Manage
ACTIVE
Store & Manage
Collect & Create
Analyze & Collaborate
Data can be cyclical in use
Storage & management, including security, is central to all points in the cycle
Source : Harvard Research Support Website prototype
Current RDM Support website, with refresh in Fall 2021*
Source: https://researchdatamanagement.harvard.edu/
*Coming soon: new Harvard Research Support website
RDM is a Collaborative Effort
Your part in this is essential!
_But you have _
partners that will help you during each step!
You have just come on board with a research group. Its lead, Professor Smith, would like your help expanding on a previous research study she worked on five years ago with a different research associate. In this study, Professor Smith surveyed CEOs of fast-casual restaurants (e.g., Chipotle; Shake Shack) to learn about their behaviors. The data and code for this study were saved on the department’s research computing environment, and you have been granted access to the files.
Recently, Professor Smith received a grant from the federal government to build upon this research by exploring what CEO-reported behaviors from five years ago are related to current company financial records. To explore this, she will ask you to obtain company financial data compiled by a firm called FinanceCorp. These data include confidential company financial data, and are therefore considered extremely sensitive. The FinanceCorp data will be merged with the data Professor Smith collected five years ago from company CEOs.
In addition to you, Professor Smith’s team also includes another professor from the University of Southern California. As a result, all of the files you receive and create for the project will need to be easily accessible to individuals outside of Harvard.
Finally, note that as a condition of receiving the grant to explore this topic area, Professor Smith has agreed to de-identify the data and make it available for public use at the conclusion of the study.
How can I best manage my data throughout the lifecycle of my research to save time and money in the future?
Goals:
Learn about Planning for Data Management
Utilize DMP Checklist for work
We talk about a Data Management Checklist. What is this in lieu of?
What is the purpose of this checklist?
What research artifacts is it targeted towards?
Who should be using this? And for how long?
- The checklist will help you define:
- How the data will be created
- How it will be documented
- Who will be able to access it
- Where it will be stored
- Who will be back it up, and how & when
- Whether and how it will be shared & preserved
- Planning is not simply naming files and folders so that only your research team understand their content. Instead…
- Putting standards and guidelines into action
- Documenting detailed metadata for better data discovery and illumination
- Ensuring the value and accessibility of your research long after your project is complete.
- The checklist can inform a Data Management Plan (DMP), which is often required by funding agencies or philanthropic funders
- E.g. NSF, NASA, Gates Foundation, Sloan Foundation
Any data, code, documentation used throughout the research lifecycle:
Quantitative and qualitative data
Primary and secondary data
Notebooks
Codebooks
Records and notes
Code or software used to run analysis
Workflows or pipelines
Metadata or documentation describing the data (’data dictionaries’)
Record and retain sufficient information to enable others to understand and reproduce your work (aka winning the lottery scenario)
- Talk to your department’s library and research computing staff early in the planning process
- Use the Data Management Checklist to plan for your research
- NB ! Some funders may require a Data Management Plan
- Manage all notes, code, data, etc. to enable others to understand and reproduce your work
- References:
- Whyte, A., Tedds, J. (2011). ‘Making the Case for Research Data Management’. DCC Briefing Papers. Edinburgh: Digital Curation Centre
- Briney, K. (2015). Data Management for Researchers: Organize, maintain and share your data for research success . Pelagic Publishing Ltd.
- 'Everyone Needs a Data Management Plan', Nature 555, 286 (2018); doi: 10.1038/d41586-018-03065-z
- http://sites\.nationalacademies\.org/sites/reproducibility\-in\-science/index\.htm
How can I acquire data in an efficient, ethical, and secure way, and how can I ensure that my data is used appropriately?
Goals:
Know what services available
Understand DUAs, NDAs, and IRB
Plan for data security at all stages
Name a few partners and their role in data acquisition.
Why use data templates?
What tools might one use for data collection? Why?
Tell me about Data Security…
Give me an example of L3 and L4 data
Highlight some differences between L3 and L4 data
How are a DUA, an NDA, an IRB submission, and a Data Safety plan related? Or not?
Data generated by investigator:
Data acquired from others:
Does the data you need already exist? Do you know how & where to find it?
Is it already licensed by Harvard or need to be acquired? Are appropriate funds available if needed
Does it require a Data Use Agreement (DUA) or IRB submission?
Experiment | A scientific procedure undertaken to make a discovery, test a hypothesis, or demonstrate a known fact |
---|---|
Observation | The action or process of observing something or someone carefully or in order to gain information |
Simulations | The production of a computer model of something, especially for the purpose of study |
Derived / compiled |
Base data on a logical extension, modification, or collection of items |
HBS Services can help faculty and their teams acquire data:
For persons from other schools, please contact your local library's data service professionals, or see https://hlrdm.library.harvard.edu/network.
Baker Library Subscriptions | Wide range of data available |
---|---|
Baker Research Services | Custom discovery and delivery of data |
Baker Faculty Data Licensing Service | Negotiation of licenses/DUAs with vendors (for faculty acquisition or purchase) |
Behavioral Research Services | Supports the data collection needs of HBS faculty and doctoral students conducting a broad range of experimental and behavioral research |
DRFD Research Administration | Supports DUAs and IRBs |
Research Computing Services (RCS) | Data collection via web scraping; wrangling via cleaning, matching, merging, etc. |
Consider using tools, templates, & data dictionaries when collecting data
Increases accuracy & efficiency
_Promotes collection & preservation of metadata (source, year, …) _
Promotes consistency & reliability (where, how, what, …)
- For collecting data, use electronic notebooks
- OneNote/O365
- Documents in HBS SharePoint/O365
- Evernote ($$)
- FileMaker Pro ($$)
- Open Science Framework (OSF)
- RSpace _ as a possible Harvard-wide tool_
- For surveys:
- HBS Qualtrics (Data at <= L3)
- HMS Redcap (Data at <= L4)
Again, keep appropriate data security in mind with external or 'synchronized' services
More at http://bit\.ly/2RCosb4
-
The need for data security touches upon all steps of the data lifecycle!
-
A thorough understanding of the data, metadata, and its custodianship will drive the RDM narrative
-
All persons should understand and comply with the HBS IT and HU data security requirements
-
Based on your understanding of the data and the data security requirements, this will inform:
- Your options for acquiring / transferring data
- Your options for storing the data
- Your options for analyzing the data
-
Example: PII / Human Subjects data can be stored on L4 research storage as part of the HBS RC environment, but not on Windows & Mac desktops & laptops
-
The need for data security touches upon all steps of the data lifecycle
-
A thorough understanding of the data, metadata, and its custodianship will drive the RDM narrative
-
All persons should understand and comply with the HBS IT and HU data security requirements
-
Based on your understanding of the data and the data security requirements, this will inform:
- Your options for acquiring / transferring data
- Your options for storing the data
- Your options for analyzing the data
-
Is important to consider while on- and off-campus
- Email at home?
- Using your mobile phone or tablet
- What about while traveling?
- Even more important in our remote-work/pandemic status
-
E.g. PII / Human Subjects data can be stored on L4 research storage as part of the HBS RC environment, but not on Windows & Mac desktops & laptops
- Security is more than where you store it – it's how you approach the care, handling, and movement of data
- This will vary depending on sensitive data level
- May be determined by Data Safety plan.
- See IT Security handout for appropriate considerations
- And these other helpful websites:
Data Security via Data Safety Portal
-
Submit data security plans at the Harvard Data Safety Portal.
-
A Data Safety plan will be required for all DUAs & IRB submissions deemed to include sensitive data*
-
Helps faculty research groups plan and execute good RDM practices, including:
- What resources should be used
- What persons should be involved in the data acquisition, analysis, and sharing
- What restrictions may apply based on the data content, stewardship, or geographic location/source of the data
-
Will dictate compliance with Harvard L2, L3, or L4 data security protocols
-
You might be involved in helping to prepare the Data Safety plan.
-
Your local IT Security, RC Center, Research Administration, or Library Data group can
- Help you create a data security plan compliant with data and university requirements
- Discuss what may be the best options for short- and long-term projects.
- Tip! Use the Data Safety Plan User Guide for examples & guidance
*This will be covered in just a few slides
Data Protection Regulations & Policies
- There are a number of regulations and policies already in play:
- HIPAA (18+ identifiers – alone or in combination datasets) Informed Consent
- FERPA (education information and special protections)
- MA data protection law (security requirements to handle private data from state residents)
- Stem Cell data and Genomics data must be published in approved repository, but also must be de-identified.
- GDPR (General Data Protection Regulation in Europe)
- PIPL for data coming from China
- California Data Privacy ( CCPA + CPRA )
- Harvard Data retention (7 years)
- This is a rapidly-changing landscape!
- China's policy effective November 2021
- GDPR regulations have changed 2x in the several years
- California's laws have been changed/amended 2x, expanding the scope
- DRFD Research Administration & OVPR are here to help
See PDF at https://security.harvard.edu/handout-research-data-security-levels-examples
These are legally-binding documents that should be signed only by authorized representatives of the school or University
-
DUAs are almost always required when there is transfer of data
- July 15, 2021 Harvard Research Data Security Policy went live (HU OVPR)
- Balances risks & challenges & considers regulatory and contractual constraints
- Some exceptions are permitted; consult your Research Admin office if unsure
-
Several group at HBS can help
- Assist with the process of DUA preparation, review, & signing:
- Done in coordination with via Harvard's Office of Sponsored Programs
- No-cost DUAs: Alain Bonacossa (DRFD) or Katherine McNeill (Baker)
- Else, contact the Data Licensing Service: Katherine McNeill (Baker)
-
These govern access to and treatment of data:
- May be required by a data provider with Harvard for use in your (local or school-level) research, or
- Provided by Harvard to an outside organization for use in its research.
-
Can be referred to as:
-
License agreement,
-
Confidentiality Agreement,
-
Non-disclosure agreement,
-
Memorandum of Understanding,
-
Memorandum of Agreement
-
…but these are all distinct and separate types of agreements with different purposes
-
__IRB approval is required when conducting __ human subjects research
Research = systematic investigation, including development, testing, and evaluation, designed to develop or contribute to generalizable knowledge.
Human subject = living individual about whom an investigator conducting research obtains (1) data or biospecimens through intervention or interaction with the individual; or (2) identifiable private information or identifiable biospecimens.
Human Subjects Research = the systematic collection of information about people designed to develop or contribute to generalizable knowledge.
__Note: __ Not all research is human research. You may be conducting a systematic investigation that involves people, but it may not be generalizable. Or it may be generalizable, but it is not about people.
Primary data collection | Secondary data collection |
---|---|
Experiments (field, online, lab) Surveys, interviews, observations |
Analysis of individual-level identifiable data Scraping data from (non-public) websites Merging data from multiple sources |
Best resources for information and to contact:
Harvard University Area IRB for main campus and Allston:
Committee on the Use of Human Subjects (CUHS)
Longwood Area IRB for Medical School, Dental School, and T.H. Chan School of Public Health:
Office of Human Research Administration (OHRA)
HBS: _ _
Alma Castro is available to advise you on federal & state regulations and university policies that apply to research with human subjects.
The DRFD Research Administration team also reviews IRB applications on behalf of Harvard’s Committee on the Use of Human Subjects (CUHS).
- DRFD Compliance: https://inside.hbs.edu/Departments/drfd/Pages/research-compliance.aspx
- CUHS (HUA/IRB): https://cuhs.harvard.edu/
__IRB approval is required when conducting human subjects research __
Please contact:
Harvard University Area IRB for main campus and Allston: Committee on the Use of Human Subjects (CUHS)
Longwood Area IRB for Medical School, Dental School, and T.H. Chan School of Public Health: Office of Human Research Administration (OHRA)
HBS: __ __ Alma Castro is available to advise you on federal & state regulations and university policies that apply to research with human subjects. The team also reviews IRB applications on behalf of Harvard’s Committee on the Use of Human Subjects (CUHS).
Primary data collection | Secondary data collection |
---|---|
Experiments (field, online, lab) Surveys, interviews, observations |
Analysis of individual-level identifiable data Scraping data from (non-public) websites Merging data from multiple sources |
- DRFD Compliance: https://inside.hbs.edu/Departments/drfd/Pages/research-compliance.aspx
- CUHS (HUA/IRB): https://cuhs.harvard.edu/
- Two broad groups of non-public data
- Confidential/Proprietary: Data that is licensed, provided by DUA, NDA, etc
- IRB-related: Sensitive and Non-Sensitive (but confidential)
- Sensitive data usually include Personally Identifiable Information (PII), health data, financial data, etc
- Deidentification may be tedious yet important step to be done cautiously & thoroughly
- _Highly recommend _ this be done by the data provider before receipt of the data
- Re-identification by grouping secondary data (or indirect identifiers) is very possible
- Consider multiple approaches to permit data granularity and fidelity while preventing re-identification
- E.g. if 1 st _ three digits of ZIP codes + year of birth == 0.04% of individuals can be re-identified vs ZIP + birthday + sex == 87% (Sweeney et al.; 2000) _
- Just as important to promote preservation and re-use of sensitive data
- Don’t promise to destroy your data
- Don’t promise not to share your data
- Do get consent to retain and share data
- Do incorporate data-retention and -sharing clauses into IRB templates
- Many evolving techniques to safeguard privacy yet promote reuse
- (HBS) Contact RCS, KLS RDP, or DRFD Research Admin if you have any questions. (Others) Please contact your Research Admin office.
- _Professor Smith has asked you to obtain company financial data compiled by a firm called _ FinanceCorp _. These data include confidential company financial data, and are therefore considered extremely sensitive. The _ FinanceCorp _ data will be merged with the data Professor Smith collected five years ago from company CEOs. _
- How will you plan for this study?
- Who could help you determine :
- If the IRB should be involved?
- _Is a DUA is needed? _
- Are the data are affected by GDPR?
- What Harvard security level the data might be?
- Who could help you store and transfer the data?
Services are available to help with data acquisition, no matter if acquired by the investigator (primary) or from others (secondary).
Use software tools to aid in efficient & accurate collection
In whatever manner the data are acquired, be mindful of requirements related to IRB, DUAs, and Data Security levels
Confidential / sensitive data requires special precautions at all stages
References
https://inside.hbs.edu/Departments/it/security/Pages/default.aspx
Electronic (Lab) Notebooks: http://bit.ly/2RCosb4
https://grid.rcs.hbs.org/transferring-data
https://inside.hbs.edu/Departments/it/security/Documents/InfoSecQuickGuide20200414-HBS.pdf
https://vpr.harvard.edu/files/ovpr-test/files/dua_policy_statement_final.pdf
https://ras.fss.harvard.edu/files/ras/files/safety_submission_guide.pdf
https://researchdatamanagement.harvard.edu/human-subjects-research
https://huit.harvard.edu/remote
https://www.harvard.edu/coronavirus/work-remotely
https://inside.hbs.edu/Departments/it/howto/Pages/work-remote.aspx
What a re my optio ns for effectively organizing, storing, securing, computing, and analyzing my research data?
Goals:
Know what resources are available
Understand best practices
Know where to get help
What speaks to you about Data Security?
What storage resources are available at HBS?
How might you collaborate with other when using data?
What compute resources do you have to use??
- This is the most difficult & time-consuming RDM stage
- Likely need to perform, rinse, & repeat
- So..
- Should be effortless if one has planned well…
- 5Ps: Proper Planning Prevents Poor Performance
- …and if done well 1st time around
- Security is just as important during these steps!
-
This will vary based on data sensitive data level and indicated by DUA, IRB, or Data Security plan
-
See IT Security handout for appropriate considerations
-
May often be directed by faculty or RC Center member
-
Consult local research computing center / environment
- Research storage associated with a compute cluster
- Database server
- School and HU collaboration tools (E.g. SharePoint, OneNote)
-
HBS:
-
IT-issued desktops / laptops storage (usually SSD)
-
Collaboration or project folders on research storage for group work
- Associated with HBSGrid cluster
- \\hbsfiles storage
-
Other schools:
-
Lab folders offer equivalent functionality
https://researchdatamanagement.harvard.edu/storage-analysis-computation
- FASRC's compute environment*
- IQSS' compute environment
- Cloud providers: Mass OpenCloud, AWS, Azure, GCP, etc*
- 3rd party-licensed providers
- Qualtrics, Zotero, etc
- DropBox, OneDrive, Box, etc
- See websites for data transfer options:
*Some costs may be associated with use. Please contact RCS first
The University has determined that the Zoom cloud does not have the appropriate controls to protect Level 4 data. This means that it cannot be used to record research interviews, as recordings include conversations which could cause social harm to the participants should they be obtained by individuals with ill intent, which is considered to be Level 4 data even if the full scope of the video is not intended to be used . Unfortunately, at this time, the University has not approved any cloud-based solutions for video recording research interviews.
What does "consumer" mean? A "consumer" account is a service which you have signed up for on your own. Even if it is being paid for with a Harvard credit card, it is considered a consumer account, unless it is protected by a Harvard contract. Consumer Versions of cloud software not recommended for University business.
https://security.harvard.edu/collaboration-tools-matrix
-
Local computing environment:
- HBS-issued desktop / laptops (data-intensive work – please talk to RCS/RSS)
- Home computer, with appropriate security measures
-
Remote environments
- HBSGrid compute cluster , FASRC Cannon* cluster, IQSS' RCE, HMS O2
- Be thoughtful and strategic about use and efficiency
- Offload long-running work to the compute cluster
- If something isn't running as expected, troubleshoot or ask for help
-
Cloud commercial vendors*:
- Amazon Web Services (AWS), Google Cloud, Microsoft Azure
- Please sign-up under Harvard contract (tenant)
- They provide support for secure storage & compute, BUT ensure they meet your security requirements (storage location, sufficient security)
-
Open-source Cloud systems (not vetted)
- OpenStack, OpenNebula, Mass. OpenCloud
-
National Supercomputing Centers
- XSEDE umbrella of compute resources
*some costs may be associated with use
https://researchdatamanagement.harvard.edu/storage-analysis-computation
How might one use Version Control?
Describe an example of good project organization?
Why are workflow tools important?
We organize our recommendations into the following topics ( Box 1 ):
Data management: saving both raw and intermediate forms, documenting all steps, creating
tidy data amenable to analysis.
Software: writing, organizing, and sharing scripts and programs used in an analysis.
Collaboration: making it easy for existing and new collaborators to understand and contribute to a project.
Project organization: organizing the digital artifacts of a project to ease discovery and understanding.
Tracking changes: recording how various components of your project change over time.
Manuscripts: writing manuscripts in a way that leaves an audit trail and minimizes manual merging of conflicts.
https://doi.org/10.1371/ journal.pcbi.1005510
https://drivendata.github.io/cookiecutter-data-science/
Put each project in its own directory, which is named after the project |
---|
Create folders that will separate your code and data |
In your data folder, ensure that your raw data are separated from any data you have processed (i.e., your clean datasets) |
Create additional folders as needed for project. E.g., report folder for output; references folder for reference material such as survey instrument. Create a "README" file that outlines basic information about the project and the folder/file structure. Name files in a way that their content or function can be easily identified. Use relative addressing to make the project portable |
---|
- Programming languages for processing and analyzing data in research:
- Most used: Python (Spyder as editor) and R (RStudio as editor)
- Others: Scala, Java, Julia
- Statistical packages:
- Stata, R, SAS, & SPSS
- Big Data tools:
- Spark (Hadoop), Kubernetes/containers
- Data Visualization tools:
- ggplot2, Tableau, D3, Shiny, Plotly, Pandas, WorldMap (from HU's CGA)
*This list is not meant to be exhaustive!
- Whatever tool you use, document all steps in your analysis and data transformations. Some tools to help with that:
- RMarkdown and RMarkdown Notebooks (used with R)
- Jupyter Notebooks: support for most languages (Python, R, Stata, MATLAB)
- Dyndoc (notebook for Stata)
- Templates with OneNote (or EverNote)
- Workflow/Pipeline Tools
- These help document and track process order:
- Consider Drake (R); SnakeMake, doit or py-Make (Python); and make for other systems
- See https://github.com/pditommaso/awesome-pipeline for a full list of options
- Be sure to update your data dictionary/codebook as you make changes to your data
- Your future self will thank you!
- Incredibly important given the duration and lifespan of projects
- What may start out as small, test idea may grow organically into multi-person and multi-site research project
- Two approaches given, for small to large…
- Manual versioning
- In most cases, data will be organized in files under directories:
- Use phase title, unique identifiers, and descriptive filenames
- Prefix by date (yyyy, yyyy.mm.dd, yyyy_mmdd, yymmdd)
- Reserve / display 3-letter file extension for file format, such as .txt, .pdf, or .csv.
- Note all changes in a ReadMe.txt or Changes.txt document
- In most cases, data will be organized in files under directories:
- Use a version control system via Git or Github.com
- Use your judgement, and talk to your faculty advisor; BUT their non-use does not prevent your use
- Utilize Github.com web interface for external (non-HU) collaboration, and code.harvard.edu for internal-only use
- Command-line (terminal/shell) Git or Git-GUI like GitKraken
- NB ! HBS/IQSS Version Control Class offered each semester
Source : PHD Comics. 2012. __Piled Higher and Deeper. __ http://phdcomics.com/comics/archive.php?comicid=1531
-
Storing and use data in a database for easy, fast queries!
-
When data contain complex relationships or relating data to multiple sets of files
-
"Structured" data: SQL databases (MySQL, PostgreSQL, MariaDB)
-
Textual/"unstructured" data: Non-SQL databases (MongoDB, Cassandra)
-
Benefit: data is read-only, unless explicitly changed
-
Consider versioning the data / databases also
-
Update your data dictionary describing data, types, use, etc.
- Consider storing it, as well as change information, as a part of the database
-
HBS RCS provides MariaDB as a part of the RC environment:
- Provision and guidance on data modeling and DB development
- Advise on best practices and performance tuning
-
FASRC & HMS offer equivalent resources
-
Cloud vendors have similar offerings
- RC Centers & HU Libraries offers tool & analysis environment training, both on-campus and remote
- E.g. Intro to R / Python, Automating Work, Version Control, etc.
- Data collections & analytical methods
- Web scraping, causal inference, natural language processing
- Offered fall and spring, and announced through newsletters, websites, and Harvard Training Portal (Category: Research Computing)
You have transferred the data and are storing it on the _ HBSGrid _ along with the older data. You notice that the data and code from five years ago are a bit...disorganized.
Recall that project materials will need to be easily accessible to a collaborator from USC. Who could you contact to ensure that a collaborator can access the data and code?
How would you go about organizing and documenting the previous files?
How will you organize the new code and files?
How will you document your processes?
What type of version control system will you employ?
Many resources are available for storage and computation (desktop, laptop, HBS grid, cloud), but the storage must be appropriate for the security level.
Important to organize code/files and document processes
RCS offers consultations and trainings on storage and analysis.
References:
Baker Library Research Data Program: https://www.library.hbs.edu/Services/Research-Data-Program
HU Libraries Data Networks: https://hlrdm.library.harvard.edu/network
HU Working Remotely: https://www.harvard.edu/coronavirus/work-remotely
http://security.harvard.edu/dct
https://github.com/pditommaso/awesome-pipeline
Github.com
Github Enterprise @ Harvard: http://code.harvard.edu
Cookie Cutter Data Science: https://bit.ly/2NXTVGI
Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510
https://researchdatamanagement.harvard.edu/storage-analysis-computation
Harvard Training Portal: Research Computing
What help can I get with managing my data at the end of a project, so that it can be safeguarded for the long-term?
Why is it worthwhile to share my data?
Goals, to Understand:
Options for managing data long-term
Resources available
Data Acquisition & Collection
Storage, Security, & Analysis
Dissemination & Preservation
What data have you ever used that was originally collected by someone else?
Where did you find and access this data?
What is one thing (in research or in life) that you put somewhere for safekeeping but then couldn’t find later?
What have you heard before about reasons and requirements for preserving and sharing data?
- Not sure what to do with your data at the end of a project?
- Retention requirements
- HBS Faculty Papers Program
- Interested in sharing, but not now?
- Embargos
- Ready to share now
- Data repositories
- Baker Library Data Deposit Service for HBS Faculty
- Harvard Library Data Curation Services
- Consultation service to help you decide may be available from your School
If you're not yet ready to think about data sharing, but still want to secure your data, the Library can help
- How long do I need to keep research records?
- Harvard Office of the Vice Provost for Research
- "Research Records should be retained, generally, for a period of no fewer than seven (7) years after the end of a research project or activity"
- What exactly is a "research record"?
- Transcripts of interviews
- Photographs
- Videos/ audio files
- Data (qualitative and quantitative)
- Agreements with research subjects
- Project proposals
- For HBS Faculty, Baker Library offers secure storage through the Faculty Papers Program.
- If keeping locally, follow security protocols (for example, a locked cabinet for paper records; secure/restricted network space which is routinely backed up)
- HBS Archives includes research papers of faculty that trace the innovations in business education pioneered at the Business School.
- We collect for the HBS Archives research data with long term historical value/ importance to the School
- We can consult with HBS faculty members and help add their research to the HBS Archives collection. We can place an "embargo" on access to meet privacy or other concerns, but also ensuring this data is available for future researchers.
- Other Archives
- Inter-university Consortium for Political and Social Research (ICPSR)
Comparison of Output with Hours of Sleep, ca. 1930. Western Electric Company Hawthorne Studies Collection.
- Increases your reputation and the visibility of your research
- Increases your impact and informs new research
- Because your data can be cited:
- Informs you on how you data is being used
- Allows you to measure your greater impact
- Maximizes transparency, accountability and scrutiny of research findings
- To ensure data can be re-used, work to make it FAIR:
- Findable
- Accessible
- Interoperable
- Reusable
Wherever HBS faculty want to share their data, Baker Library staff will do the deposit, including:
Help researchers select the best repository for their needs
Help researchers consider what data can be shared
Advise on preparing data and documentation
Liaise with the data repository
Deposit the files
Create repository metadata
https://www.library.hbs.edu/Services/Data-Deposit-Service
HBS Dataverse
Repository run by Harvard for sharing your data
Enables immediate sharing of data and associated documentation
Widely discoverable and citable with a DOI
Online analysis features for selected formats
Other data repositories, e.g.,:
ICPSR: full-service data archive; expert in managing and providing restricted access to sensitive data
Journals may specify place of publication
Mission : Connect members of the Harvard community to services and resources that span the research data lifecycle, to help ensure that Harvard’s multi-disciplinary research data is findable, accessible, interoperable, and reusable ( FAIR ).
GUIDING PRINCIPLES
Connect a distributed network of services, resources, stakeholders & participants
Resource sharing: reduce duplication
Openness: communicate options
Scalability: assess needs, scale services
Ease-of-access: minimize administrative barriers for users
PROGRAM OBJECTIVES
Serve the Harvard community across the research data lifecycle in 4 key areas
Services & Resources
Partnerships & Collaborations
Communications & Outreach
Communities of Practice
Collaborative services offered by IQSS Dataverse and Harvard Library (HL-RDM & Metadata Svs.)
Anticipated launch: Late-fall 2019
HARVARD DATAVERSE REPOSITORY |
---|
-FAIR data –Free data deposits –Self-curation –DOIs –Data citations |
CONSULTATION |
---|
–Free consultation & assessment –Fee-based extended consultation services |
DATA CURATION |
---|
–Dataverse setup & file ingest –Ongoing dataverse administration –Custom curation services |
Organize & share data in repository
Organize & share data in repository
Dataverse data repository & DASH
Data curation services*
Consultations, referrals & best practices
Harvard Dataverse data repository
Data curation services
Consultations, referrals & best practices
Finally, note that as a condition of receiving the grant to explore this topic area, Professor Smith has agreed to de-identify the data and make it available for public use at the conclusion of the study.
As Professor Smith's new staff member, take 5 minutes and discuss with your breakout group:
How might you figure out which of the project's data, from the different sources, can be shared publicly?
Given what else you have learned in this session, how do you think you should organize and document your final data files so that they can be used by other researchers?
What type of features do you think Professor Smith would want in a data sharing repository?
Think early and often about how you can keep your data organized and well-documented _ while you're doing your research_ , to save a ton of time at the end.
You don't have to share your data to properly keep it.
When you're thinking about what your faculty can do with their data long-term, think of Baker Library
We can help your faculty find a solution that meets their unique needs--contact us for a consultation anytime as you're doing your research.
More info -
https://www.library.hbs.edu/Services/Safeguard-Your-Data-Long-term
Plan your trajectory using the DMP Checklist
Think carefully about data collection, including security & legal documents
There are many options for storage, security, and analysis – choose thoughtfully & wisely
Data dissemination and preservation are important considerations throughout all phases of your work
Please reach out to your department/school's library and research computing groups with your questions. And please reach out sooner.
We wish you success in your research
Please fill out our class survey: http://bit.ly/rcs_class_eval