domingo, 21 de fevereiro de 2021

Dados falsos com o R

O teste e validação de aplicações ou produtos de dados (data products) requer a utilização de dados que nem sempre nos são disponibilizados. Como alternativa à utilização de dados reais temos a possibilidade de gerar dados falsos (fake data) ou sintéticos (synthetic data) com um formato semelhante aos dados reais.

Com o R podemos gerar vectores de dados de acordo com determinadas distribuições, recorrendo às funções: rnorm, rexp, rpois, runif, rmultinom, sample...

Mas também podemos recorrer a packages desenvolvidos para gerar dados falsos ou sintéticos:

bindata - Generation of correlated artificial binary data.

MultiOrd - A method for multivariate ordinal data generation given marginal distributions and correlation matrix based on the methodology proposed by Demirtas (2006).

PoisBinOrdNonNor - Generation of a chosen number of count, binary, ordinal, and continuous random variables, with specified correlations and marginal properties.

simstudy - a collection of functions that allow users to generate simulated data sets in order to explore modeling techniques or better understand data generating processes. The user defines the distributions of individual variables, specifies relationships between covariates and outcomes, and generates data based on these specifications.

wakefiled - designed to quickly generate random data sets.

rcorpora - a collection of datasets

charlatan - makes fake data, inspired from and borrowing some code from Python's faker

fakir - The goal of {fakir} is to provide fake datasets that can be used to teach R.

fabricatr - helps researchers imagine what data will look like before they collect it.

GenOrd - Simulation of Discrete Random Variables with Given Correlation Matrix and Marginal Distributions

SimMultiCorrData - Simulation of Correlated Data with Multiple Variable Types

synthesis - Generate Synthetic Data from Statistical Models

conjurer - A Parametric Method for Generating Synthetic Data

sdjlinkage - generate synthetic dataset using different approaches

sim.survdata - Simulating duration data for the Cox proportional hazards model. Generating survival data.

survsim Simulation of simple and complex survival data including recurrent and multiple events and competing risks

synthpop -  generating synthetic versions of sensitive microdata for statistical disclosure control

datasynthR - Functions to procedurally generate synthetic data in R for testing and collaboration.

sdjlinkage - generate synthetic dataset using different approaches

fakeR - Simulates Data from a Data Frame of Different Variable Types

SynthTools - Tools and Tests for Experiments with Partially Synthetic Data Sets

OpenSDPsynthR - A project to generate realistic synthetic unit-level longitudinal education data to empower collaboration in education analytics.

sgr - Sample Generation by Replacement

humanleague - Synthetic Population Generator




Alguns links que vale a pena espreitar: R-Vogg-Blog; UNT; R Views; R-bloggers; Data from GANs