Requested Information for Data
Overview
We request the following information from all articles:
- Data Description: detailed description of all data, including input data:
- name of data set
- (original and current) location of data set - this is also called the “provenance” of the data
- characteristics of data set: At a minimum, all variables that are used in the paper should be well-described (variable/column labels, value labels, summary statistics).
- Data Access: complete description of how the data can be accessed
- Data Persistence: how long will the data continue to exist in the form the author used it
Details follow.
Generic Diagram
graph LR;
subgraph "Data Provider";
DB[(Data source)] ;
end;
DB -.- E>Extract] -.-> A;
subgraph Researcher;
A((Input data)) --> B{Further analysis};
end;
Data description
- Data citation:
- This must include the official name of the dataset (might be “Data source” or “Input data” in the figure above), as well as the name of the creator (“Data Provider” in the figure above)
- including if the article’s author is the creator
- It must include the location of the data - download site, physical archive, etc.
- If the data is original to the article, then this will be an integral part of the replication archive
- If the data was acquired elsewhere, this should be the ORIGINAL location of the data (“Data Source” in the figure above)
- It must include the date the data was created or downloaded
- If possible, provide a Digital Object Identifier (DOI).
- Statistical characteristics of the data
- This description of data is often called “metadata”
- At a minimum, all variables that are used in the paper should be well-described (variable/column labels, value labels, summary statistics) through a codebook.
- If this information is already provided in the article or an online appendix, the README should point to it (“For summary statistics of variables, see Table 1 in the paper”). Note however that the human-readable variable name (e.g. “Earnings”) still needs to be mapped to the name in the dataset, if different (e.g., “var31a”)
- If confidentiality protection is a concern,
- talk to your data provider.
- You may be able to apply simple count and fraction rounding rules (see code in this Github repository), or remove some summary statistics (e.g., max/min or specific quantiles).
- For additional guidance, see
- Where multiple versions of the data exist, describe the version of the data used by the author.
- This information can be generated by the paper author, or provided by pointing to a URL of the data provider (for instance an online codebook).
Practical guidance: Data description
It doesn’t need to be complicated, but should be complete. For more extensive descriptions, best practices, and possible support, see
Other services may be available at your own institution. Some services may charge a fee, or only be available to researchers affiliated with that institution.
We provide a few suggestions for how to do this in Stata, R, and SAS.
However, if you have the ability to do a more robust data description, you should. See a self-citing example.
Data access description
The description of data access should provide enough information so that an uninformed user could theoretically access the data. Data access descriptions are often referred to as “Data Availability Statements”.
In the flow diagram above, it is the answer to the question: Who can access “Data Source” at “Data Provider” and create an extract.
The access should be persistent, i.e., not rely on a transitory website or the presence of a particular person who might change jobs at any time.
- This can be as simple as a download URL.
- It might be a pointer to a directory in the replication archive.
- This might also be the URL for a description of the application procedure (dataprovider.com/application_procedure.html). Real examples include
- NCHS, https://www.cdc.gov/rdc/leftbrch/userestricdt.htm ,
- PSID https://simba.isr.umich.edu/restricted/ProcessReq.aspx)
- The access description should include an estimate of the monetary and time cost of the application process.
- If you provided a Data Management Plan (DMP) as part of a grant proposal, much of the information may already be present in the DMP.
- Provide the date when the data access description was valid. For long-running projects, this may be the information when YOU accessed the data, or it may be the access information for somebody who wants to access the data TODAY.
- Provide information on the access conditions:
- This is often defined within a license.
- Open-access data is often under a Creative Commons - Attribution (CC-BY) license.
- If any license is posted or stated, link to it.
- Terms of use and license are not exactly the same thing.
- Does a downloader need to agree to simple terms of use? (Example: IPUMS-USA https://usa.ipums.org/usa/cite.shtml)
- Does the researcher need to provide information (project proposal, computer security plan, etc.)? (Example: PSID https://simba.isr.umich.edu/U/CondUse.aspx)
- Is there perhaps a residency or citizen requirement?
Practical guidance: data access
- While non-academic data providers may not always consider reproducibility when you sign a contract with them, we have found numerous such providers which are open to at least the idea of “reproducibility checking” or replication. Examples of agreements which allow third-parties to access confidential data for the purpose of replication,
- For additional guidance on licensing, see Licensing guidance.
- For more information on Data Access Descriptions, see Data Access Descriptions.
For additional sample “Data Access Descriptions”, see
Data persistence
Data should remain available for a sufficiently long time.
- By depositing in the journal-based data (and code) repositories, the data (and code) will remain available indefinitely.
- This is also true if the data is in various other repositories
- This may also be true if the data cannot be shared (restricted access data).
- Again, this may already have been defined in your DMP.
- A good minimum benchmark is 10 years, but this may not always be feasible with data the author does not control.
You should describe the data persistence, or point to a data archive’s policy in that matter.
See Requested_information_hosting for more details on data repositories.
What is a data provider
A “data provider” in this sense can be
- a public repository where the data can be found (ICPSR),
- a website that provided the data (IPUMS),
- a statistical agency or private company that granted access to the data (U.S. Census Bureau, Twitter, Acme Inc.).
The author may also be the data provider, for instance, because the author conducted the survey used in the article. However, in many cases, the data provider may not be a data archive (see the page on Requested information about hosting).
Practical guidance: data provider and data archives
If the data provider is not an archive (i.e., the data persistence is insufficient, and data might go away), you should investigate depositing the data at a data archive.
Planning Ahead
Many of the items above can be planned ahead of time. In fact, funding agencies require data management plans, and these are core elements of data management plans.
Consider from the start what is needed to share the data.
If collecting data, consider early on the need to have just enough restrictions.
Consider breaking your data archives into manageable pieces. For instance, raw data might be in a different location/repository/archive than the analysis data, even if you are providing access to both for the purpose of replication. Examples:
Clemens, Michael, 2017, “Raw scanned PDFs of primary sources for workers, wages, and crops”, https://doi.org/10.7910/DVN/DJHVHB, Harvard Dataverse, V1