One of the most vexing issues is how to cite data. This document goes through a few common scenarios not covered elsewhere.
Many authors initially neglect to add data citations, or do not know how to add a data citation. Often, we see authors cite papers with supplementary data, but not databases or other data:
We use data acquired from the NHL, dates of power outages collected by Tremblay et al (2018), augmented with information on the language and grammar skills of hockey players provided by the Ethnologue database.
(note absence of citation for NHL and Ethnologue data). In the above example, three datasets are used, but only one is cited in some fashion.
The above example can be improved as follows:
We use data acquired from the NHL (NHL, 2018), dates of power outages collected by Tremblay et al (2018, 2019), augmented with information on the language and grammar skills of hockey players provided by the Ethnologue database (Eberhard et al, 2019).
with the reference list having the following entries:
The Data Citation Principles note that (emphasis added):
Sound, reproducible scholarship rests upon a foundation of robust, accessible data. For this to be so in practice as well as theory, data must be accorded due importance in the practice of scholarship and in the enduring scholarly record. In other words, data should be considered legitimate, citable products of research. Data citation, like the citation of other evidence and sources, is good research practice and is part of the scholarly ecosystem supporting data reuse.
Electronic content presented without formal ties to a publisher or sponsoring body has the authority equivalent to that of unpublished or self-published material in other media.
They also note that
Authors should note that anything posted on the internet is “published” in the sense of copyright and must be treated as such for the purposes of complete citation and clearance of permissions, if relevant.
Several standard citations options may be relevant for data citations:
When citing information from websites, including data downloaded from websites, use the general website citation style for data:
Note that this does NOT apply when the data have a permanent URL, a DOI, or a suggested citation!
CMOS has a recommendation for online databases:
which would be cited in the text as
NASA/IPAC Extragalactic Database.
The CMOS provides examples of how to cite supplementary materials that are attached to a specific article:
The AEA guidance used to provide an example, in which the citation links to the article landing page:
Note however that modern data citation guidance suggest that both the article and the data used by the article should be cited, and this can lead to confusion. With the 2019 move of the AEA to a data archive, the correct citation for the above supplement would be:
with the article also cited as:
The key to data citations is that the creator, the name, the location, and the date last accessed for a data source should be clear. This pertains to online data, offline data, physical data, whether it is in boxes or on tapes, or in a corporate database behind a firewall.
ICPSR notes that a citation should include the following items:
Note that all but the URN would apply also for an offline database. Consider the citation of objects in archives:
Often, the creator of a dataset is an organization. The same way that an organization as a work’s author can be cited:
an organization can be cited as the creator of a dataset:
In many cases, the data are not distributed by the creator. This means the distributor takes on the role of a publisher (of a book, of data). So if using Compustat through the Wharton Research Data Services, one might cite as
If using the S&P 500 data, there may be multiple providers:
with hopefully the same content. Note that often, such data is subject to copyright and redistribution restrictions (see the page at FRED on SP500).
In some cases, it isn’t clear when the dataset was published, though it may be clear what time period the dataset covers. One way to address this may be by using the “n.d.” abbreviation for the date of publication:
A related issue may arise when the dataset is comprised of multiple years, each of which has its own DOI. For instance, when accessing multiple years of American Community Survey data on ICPSR, each of which has its own DOI:
|American Community Survey (ACS): Public Use Microdata Sample (PUMS),||2002||(ICPSR 3893)|
|American Community Survey (ACS): Public Use Microdata Sample (PUMS),||1998||(ICPSR 3888)||2008-05-21|
|American Community Survey (ACS): Public Use Microdata Sample (PUMS),||1997||(ICPSR 3886)||2008-05-21|
|American Community Survey (ACS): Public Use Microdata Sample (PUMS),||2003||(ICPSR 4117)||2009-12-01|
|American Community Survey (ACS): Public Use Microdata Sample (PUMS),||2004||(ICPSR 4370)||2008-10-14|
|American Community Survey (ACS): Public Use Microdata Sample (PUMS),||2009||(ICPSR 33802)||2013-04-04|
|American Community Survey (ACS): Public Use Microdata Sample (PUMS),||2008||(ICPSR 29263)||2011-11-08|
|American Community Survey (ACS): Public Use Microdata Sample (PUMS),||1996||(ICPSR 3885)||2008-05-21|
One approach to this is to create a composite citation, with additional information available in an online data appendix or a Data Availability Statement:
(and listing of exact DOIs in an appendix table).
Many datasets are available only under license, memorandum, contract, etc., and do not have a formal online presence. This is quite similar to traditional offline archives, for instance manuscript collections. For such collections, CMOS suggests:
and usage in the text as
Alvin Johnson, in a memorandum prepared sometime in 1937 (Kallen Papers, file 36), observed that …
Similar citations can be constructed for offline databases:
Similar forms may be used for confidential databases when no DOI exists:
where the data, in this case, were accessed via the “Department of Treasury,” acting as a secure distributor (of access, not downloads). If the same data had been accessed via a secure research data center, the reference should have instead noted that access mechanism:
If multiple databases within the same secure confines are used and combined, they should be cited (within reason) separately. Guidance here may be: Can and do researchers combine various extracts in different ways? For instance, do some combine the IRS 1040 database with death records, and others merge elements from the IRS 1040 database with information returns? Then the information returns, and the 1040 file should be cited separately.
In some cases, governments have list of their (named) registers. For instance, Statistics Denmark provides the full list of registers at http://www.dst.dk/extranet/forskningvariabellister/Oversigt%20over%20registre.html. These can be used to craft data citations, for instance
where the “author” is Statistics Denmark, but the “[publisher]” is the research service of Statistics Denmark. You should note the version (for instance, the current register goes through 2019, but you may have had access to an earlier version, so you should adjust accordingly). In the manuscript, you would then cite “Statistics Denmark (2020)”. If available, the README can point to the codebook for each register, e.g., https://www.dst.dk/extranet/ForskningVariabellister/DOD%20-%20D%C3%B8de%20i%20Danmark.html for the aforementioned “DOD” register. An example can be found in Fadlon and Nielsen (forthcoming as of June 2020).
If a DOI exists, the formal citation generated from that DOI should be used:
In some cases (not infrequently), access to data is through informal means. The CMOS allows for citation of such information, without inclusion in the references.
We would deviate from that suggestion, ask for inclusion in the reference list, and simply suggest using unpublished data as the locator, similar to a URN, in the reference list:
In some cases, the data provider (often a firm) must remain anonymous. This does not prevent citation, and the provider should be mentioned in much the same way as when there is no formal access mechanism: