O teste e validação de aplicações ou produtos de dados (data products) requer a utilização de dados que nem sempre nos são disponibilizados. Como alternativa à utilização de dados reais temos a possibilidade de gerar dados falsos (fake data) ou sintéticos (synthetic data) com um formato semelhante aos dados reais.
Com o R podemos gerar vectores de dados de acordo com determinadas distribuições, recorrendo às funções: rnorm, rexp, rpois, runif, rmultinom, sample...
Mas também podemos recorrer a packages desenvolvidos para gerar dados falsos ou sintéticos:
bindata - Generation of correlated artificial binary data.
MultiOrd - A method for multivariate ordinal data generation given marginal distributions and correlation matrix based on the methodology proposed by Demirtas (2006).
PoisBinOrdNonNor - Generation of a chosen number of count, binary, ordinal, and continuous random variables, with specified correlations and marginal properties.
simstudy - a collection of functions that allow users to generate simulated data sets in order to explore modeling techniques or better understand data generating processes. The user defines the distributions of individual variables, specifies relationships between covariates and outcomes, and generates data based on these specifications.
wakefiled - designed to quickly generate random data sets.
rcorpora - a collection of datasets
charlatan - makes fake data, inspired from and borrowing some code from Python's faker
fakir - The goal of {fakir} is to provide fake datasets that can be used to teach R.
fabricatr - helps researchers imagine what data will look like before they collect it.
GenOrd - Simulation of Discrete Random Variables with Given Correlation Matrix and Marginal Distributions
SimMultiCorrData - Simulation of Correlated Data with Multiple Variable Types
synthesis - Generate Synthetic Data from Statistical Models
conjurer - A Parametric Method for Generating Synthetic Data
sdjlinkage - generate synthetic dataset using different approaches
sim.survdata - Simulating duration data for the Cox proportional hazards model. Generating survival data.
survsim - Simulation of simple and complex survival data including recurrent and multiple events and competing risks
synthpop - generating synthetic versions of sensitive microdata for statistical disclosure control
datasynthR - Functions to procedurally generate synthetic data in R for testing and collaboration.
sdjlinkage - generate synthetic dataset using different approaches
fakeR - Simulates Data from a Data Frame of Different Variable Types
SynthTools - Tools and Tests for Experiments with Partially Synthetic Data Sets
OpenSDPsynthR - A project to generate realistic synthetic unit-level longitudinal education data to empower collaboration in education analytics.
sgr - Sample Generation by Replacement
humanleague - Synthetic Population Generator
Alguns links que vale a pena espreitar: R-Vogg-Blog; UNT; R Views; R-bloggers; Data from GANs