Guidance on Synthetic and Fake Data
Some journals require that fake or synthetic data be provided in order to verify basic functionality of the code. This page provides some guidance on how to do this.
- one can either create synthetic data that just proves the estimation code works. This is closely related to basic statistics. If you assume your data is normal, generate normal data that has the same mean and variance, or whatever, and then run your algorithm on that. Easy when all you have is a simple OLS, more complex for more complex data.
- one can use as an input the actual data, and then perturb or otherwise synthesize it. This relates more to data privacy, and can be complex. Combined with sampling, it may be sufficient, but never attempt without checking with your data provider if this is considered sufficient or privacy-preserving.
From the Econometric Society Data Editor’s FAQ
Extracted from https://www.econometricsociety.org/publications/es-data-editor-website/FAQs
If my restricted-access data provider has a public use testing sample available, can I provide this sample instead of a simulated/synthetic dataset?
If a public use testing sample is available, it is generally preferred over a simulated/synthetic dataset (but less preferred than providing temporary access to the original data) as long as the testing sample can be published with your replication package. If the sample cannot be published, it is advisable to provide a simulated/synthetic dataset that can be included in the package.
What is the procedure followed by the Econometric Society if I supply simulated/synthetic datasets?
In the case of simulated or synthetic datasets, they will be published along with the replication package. Although these datasets may not represent the actual data, their structure is designed to closely mimic the original dataset, providing readers with a better understanding of the data used. You must ensure that the manipulation process used to create the synthetic/simulated datasets is clearly described in the README file.
Why am I requested to supply simulated/synthetic data?
When reproducibility checks cannot be performed on real data, running these checks on simulated/synthetic datasets still offers advantages. They help verify the completeness, self-contained nature, and error-free execution of the code. For future users of the package, it also allows them to run the codes and learn from them.
My article estimates a non-linear model, and the algorithm does not converge with randomly generated data. What should I do?
In such cases, it is strongly recommended to simulate data using your model as the data generating process. If that is not feasible, it is important to contact the Data Editor and provide a detailed explanation of why this is the case. The Data Editor will assist you in finding a suitable solution and may propose alternative approaches to handle the situation to your managing co-editor.
How do I decide whether to produce a simulated or a synthetic dataset?
To generate a dataset that closely resembles the original data, the synthetic option may be easier to implement. There are various open-source routines available that can assist in this process. However, consider two main disadvantages: (i) one needs to ensure proper anonymization of the data when using a scrambling/perturbation algorithm, and (ii) non-linear estimation routines may face convergence challenges when applied to synthetic data, whereas artificial datasets generated by the model being estimated are more likely to converge.
Useful Links
(None of these links have been tested by us.)
Synthpop
- https://doi.org/10.18637/jss.v074.i11
- https://doi.org/10.29012/jpc.v7i3.407 (Journal of Privacy and Confidentiality)