Logo

Data and Code Guidance by Data Editors

Guidance for authors wishing to create data and code supplements, and for replicators.

Frequently Asked Questions

On this page:

… although some are not frequently asked, but might nevertheless be useful. Below questions and answers in random order.

Do I need to release my entire code history?

We work on github already. Is there anyway to make just a “release” public in github without making public the entire code history and issue tickets?

It all depends if you are comfortable with that, and if this is one paper out of multiple ones stemming from the same project.

If not, then what I suggest is to do the following

(careful with subdirs)

My repo is complex - I only want to share a portion

Scenario: Author keeps several papers related to an ongoing or long-running project in a single repo, and wants to isolate code for tidy submission alongside an “offshoot paper”.

In this case, it has been suggested to use the git filter-branch approach, which is essentially a way to split off a subset of a repo into a new repo (e.g. a subdirectory paper-1). It will conveniently inherit the commit history as well (but see above if that is not desired). See guide at https://help.github.com/en/articles/splitting-a-subfolder-out-into-a-new-repository.

(Thanks to @MichaelChirico for suggesting)

Do you have any examples of data citation for proprietary data that does not have an online link? All the examples are much more formal than the random spreadsheets [agency 1] sent over! [agency 2] has a more formal naming system, but again, there’s no persistent indicator for the datasets because they are not accessible.

A known problem with no clear solution. Here’s how I try to approach the problem:

Note that Chicago-style does not actually require the locator information - see the examples for “E-book” Kindle.

“Electronic content presented without formal ties to a publisher or sponsoring body has the authority equivalent to that of unpublished or self-published material in other media.”

Thus, one can construct the citation as - Author: the provider of the data (the agency) - Title: whatever the provider would recognize and re-furnish to a third-party - Publisher: the provider of the data (or its higher-level institution, e.g., the government department or ministry) - Locators (optional): maybe something along the lines of “Multiple electronic files” with possible additional information “, provided to the author under DUA (date or any contract number)”

An example might thus be:

Agency 1. 2007. Name of DataSet. Boston: Commonwealth of Massachusetts. Multiple electronic files.

For more information, see the AEA Data Editor’s guidance on Data Citations.

The URL works for me. Why are you complaining that it is not robust/persistent/permanent?

This may happen for both data files and documents by non-standard publishers. We explain what this means, and various ways to deal with it.

A URL is simply a locator for a file on the internet. However, they are not all created equal. In particular, files on sites from anything other than robust institutions (archives, journals, newspapers) should be considered to be transitory: here today, gone tomorrow. This often applies to even big companies that are not in the business of publishing with the goal of long-term preservation. It also includes URLs that obviously point to storage providers: Dropbox.com, Google Drive, AWS, Github.io, etc.

For instance, the file https://s3.amazonaws.com/aws.upl/nwica.org/unitedstates2014.pdf is hosted on Amazon AWS, a commercial provider, presumably by the website of NWICA (nwica.org). If tomorrow NWICA decides to change suppliers for their webservices, and migrates their website nwica.org to the site of myhoster.com, that URL will change. What is less likely to change is the original “landing page” from which “unitedstates2014.pdf” could be downloaded, though that can also change a few years from now. For preservation, the PDF could be copied onto an institution whose business it is to preserve files, such as Zenodo, or Archive.org, where it becomes “permanently” available (or at least at much longer).

Solutions

NWICA. 2015. “How WIC Impacts the People of the United States of America.” Accessed at https://nwica.org/documents/unitedstates2014/ on August 2, 2017.

and provide a copy of the file, if copyright and license permit.

NWICA. 2015. “How WIC Impacts the People of the United States of America.” Accessed via Archive.org at https://web.archive.org/web/20200205043504/https://s3.amazonaws.com/aws.upl/nwica.org/unitedstates2014.pdf on February 4, 2020.

Many databases with an online interface make it hard to find an easy-to-cite URL. Nevertheless, a few clicks can often show a referenceable URL. Here’s an example:

OECD Statistics

The top-level URL is https://stats.oecd.org/, and (almost) never changes. However, the User Guide, page 34, shows how to share a URL for a particular query. The usual citation rules for URLs then can be applied.

Some econometrics papers might be accompanied by (for example) an R or Stata package (perhaps published on CRAN or SSC). What about surfacing references to associated packages more prominently?

First, packages on CRAN and the Statistical Software Components can be cited. AEA citation guidance is currently silent on software components, but it is not wrong to cite them, and other disciplines do it regularly. CRAN in fact has elements of a “proper archive” (SSC does NOT). All R packages can generate a (Bibtex) citation.

Second, it is possible to submit such packages to various journals, where they are reviewed and published with DOI:

I have been told by the Data Editor to remove PSID data from my submitted materials. What do I do?

Per the PSID website, authors are not allowed to post extracts of their data online. The reason is that any user needs to agree to the PSID terms of use before being given access to the data. However, the PSID has provided authors with the ability to deposit their data extracts and/or their derived data in a repository, precisely for the purpose of allowing for sharing in compliance with their Terms of use.

In order to comply with the PSID Terms of use, you should do the following:

I use confidential data. I am allowed to provide the data to the Data Editor for the purpose of replication, but you are not allowed to publish the data. How do I proceed?

First, all sharing - whether privately with us, or publicly through the data publication process - should be in compliance with all IRB rules, data use agreements, etc. We will never ask you to share data that you do not have the right to share with us or anybody else.

Second, there is a difference between sharing with us, and publishing the data. We can accept private data sharing for the purpose of replication, conduct our reproducibility checks, and delete the data provided. You are in control of the publication of any data (though it has happened that we have had to point out to authors that they do not, in fact, have the rights to publish data that they were going to publish).

Third, the inability to publish the data does not absolve you from creating an archive of the data as it was used for the article. This archive, for private/confidential/proprietary data, should remain private - on your own systems, or appropriate university archives. But it must exist, so that you can reliably answer queries from authors in future years.

How should you proceed?

The best way to think of this is as a set of layers. Your working directory WD, from which you derived the tables and figures in the paper, is composed of confidential data CD, non-confidential data NCD, and programs/code P (and possibly temporary files TF). So WD = CD + NCD + P + TF. For the purpose of replication archives, you should create two archives:

You should then test: create an empty directory, unpack the two archives, and verify that they are sufficient:

(unzip A.zip) + (unzip B.zip) == NCD + P + CD == WD - TF

You should then import A.zip into the openICPSR archive, and ensure that B.zip is properly and securely archived, in compliance with all rules that you are subject to.

You can provide B.zip to us for the purpose of replication, but B.zip would not be published.

How can I ensure that the confidential data is preserved?

The ideal promise that has the highest credibility is a commitment from the provider of the confidential data to maintain the data for a number of years.

A second-best solution requires that you inspect your data use agreement. Can you archive the data in a robust fashion (using whatever tools your university has)? Some DUA allow for that, others don’t. The promise then becomes that you/your university will guarantee persistence of the data for X years.

A better variant of that is when the DUA allows you to share the data with others that have also demonstratably signed a DUA with the provider. The data provider controls the access, you (your archive) provides the data.