Suggested Information for Data and Code Hosting
Journals and institutions have assessed a number of trusted repositories:
- CoreTrustSeal has a certification process
- re3data.org lists research data repositories
- Nature, F1000Research, and PLOS (and soon the AEA) have lists of trusted repositories.
- These generally include at least the following:
- Many universities have formal document repositories that may be able to assume such a role; talk to your (data) librarian
- Restricted-access and secure data centers also assume the role of trusted repositories:
- Note that some data archives can also handle restrictions on data dissemination for various reasons
- ICPSR can handle certain types of confidential data
- Zenodo can handle license-based restricted access
- We note that acceptable access restrictions are limited to concerns of confidentiality or third-party licensing. We do not accept (permanent) access restrictions where
- the author is the sole arbitrar of access
- sharing is not allowed because of personal interests (future publications, patents, etc.)
A variety of (unfortunately) commonly used web-accessible locations are not acceptable as data repositories for the purpose of an article’s supplementary materials:
- Github, Gitlab, etc. because a project’s owner can delete a git repository at any time (but see this page on how to leverage Zenodo to enable proper archiving of code and software) (see also questions in the FAQ);
- Google pages, university and personal faculty web pages - they can all be deleted by the owner or by the employer (the university) without regards to archival characteristics of its contents (but talk to your university library - they may have a way to facilitate archiving of web pages - and investigate the Wayback Machine for a similar purpose);
- Dropbox, Box.com, and similar cloud-based data and file sharing services - again, they can all be deleted at short notice, or when payment stops
“Immigration Restrictions as Active Labor Market Policy: Evidence from the Mexican Bracero Exclusion, Replication files and raw data” (Michael Clemens)
- Hosted on Harvard Dataverse at https://dataverse.harvard.edu/dataverse/bracero
- Contains two datasets:
- Clemens, Michael, 2017, “Raw scanned PDFs of primary sources for workers, wages, and crops”, https://doi.org/10.7910/DVN/DJHVHB, Harvard Dataverse, V1
- Clemens, Michael, 2018, “Replication Data for: Immigration Restrictions as Active Labor Market Policy: Evidence from the Mexican Bracero Exclusion”, https://doi.org/10.7910/DVN/17M4ZP, Harvard Dataverse, V1
“United States Newspaper Panel, 1869-2004” (Gentzkow, Shapiro, Sinkinson)
- Hosted on ICPSR at https://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/30261
- Gentzkow, Matthew, Shapiro, Jesse M., and Sinkinson, Michael. United States Newspaper Panel, 1869-2004. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2014-12-10. https://doi.org/10.3886/ICPSR30261.v6
“Socioeconomic High-resolution Rural-Urban Geographic Dataset for India (SHRUG)” (Asher and Novosad)
Challenges in Hosting of Data and Code at Restricted-Access Data Centers
Users of restricted-access data centers (RADC, such as FSRDCs, CASD, etc.) face certain challenges in the handling of data and code as described in this document:
- researchers (end-users) may not be able to provide DOI or similar persistent identifiers for some data
- researchers may not be able to discern the presrvation policy for certain data sets
- researchers may not be able to remove all code from the center, or such removal is subject to restrictions
- data citation guidance may be lacking, or may not be obvious (see Data Citation Guidance for general guidance)
A few guidelines
- Request as much code as the RADC will allow the researcher to remove. Subsequently handle it equivalently to the general code guidance, but make special note (placeholders, explanatory text) of any redacted information.
- In addition, some RADC may provide the ability to deposit code internally and confidentially. Use such interal repositories, and make a note of their location in the publicly deposited code or in supplementary documents.
If a RADC has at least an archival or backup policy of sufficient length (e.g., 10 or more years), but does not offer a formal repository, then the following procedure allows users to find and request code and data
- As before, request as much code as is feasible, and deposit it in a public repository (e.g., openICPSR, Dataverse, Zenodo). Don’t publish it yet.
- If possible at such repositories, pre-register a
- At Zenodo: click the appropriation request button, and a
DOI will be assigned, e.g., 10.5281/zenodo.
- At openICPSR: projects are called
DOI is derived from the project number as 10.3886/E
- If you already have a DOI assigned to your manuscript or (published) paper, you can alternatively use that (see 10.1093/restud/rdw057 for an excellent example).
- In the RADC, create a two-level directory with the name of the
- Move both data (following guidelines outlined here) and all code (not just the confidential part) to subdirectories. The resulting directory structure will look something like this:
- Confirm with the RADC’s administrative staff how long project files are kept as archives or in backup (often 5-10 years)
- Add a statement to the public README.md (and to article materials). See Sample RADC Statement 1 and Sample RADC Statement 2.
- Fort (2016) 10.1093/restud/rdw057, in the supplementary materials (local copy)
- Groen, Kutzbach, and Polivka (2019, forthcoming) (link to be supplied upon publication)