Center for Education Policy Research Logo
Convening standard banner for intro slide

Have Your Cake & Synthesize it Too

Billy Buchanan, Ph.D.

Senior Research Scientist, SAG Corporation
SAG Corporation Logo Service Disabled Veteran Owned Small Business Emblem
https://wbuchanan.github.io/sdpConvening2023

Get Ready for a Game

  • Get a blank index card and something to draw with
  • Draw your best example of the object listed on the index card in the middle of the table (or in your group)
  • Once everyone has created their example collect all the examples and put the topic index card on top

Key Objectives

  1. Distinguish between user- and model-based synthesis
  2. Gain familiarity with some current synthesis methods
  3. Understand the applications of synthesis methods

But first, some questions about the problems we face...

What are the biggest barriers you face to share data with external stakeholders?

How do you currently analyze data for your smallest demographic groups?

Synthetic Data: The Origin Stories

In the beginning, there was imputation...

  • Statistical agencies want to share data broadly while protecting privacy
  • First references to synthetic data come from Statistical Disclosure Control
  • Initially, synthetic data are a form of multiply imputed data
  • Brought to you by the makers of Multiple Imputation: Donald Rubin

and then computer scientists came along...

  • Computer vision researchers find they need more image data
  • Early methods are based on lossy compression techniques
  • In 2014, Goodfellow and his colleagues develop Generative Adversarial Networks (GAN)
  • Now in addition to images, GANs are synthesizing tabular data

What is synthetic data?

  • Data created by a computer
  • Can be created by:
    • User instructions/specifications
    • Models fitted to observed data
  • It is not synthetic control

Model-Defined Approaches

Shallow Learning

  • Most common/popular - Classification & Regression Trees (CART)
  • First variable is sampled from the observed data; the rest is predicted from models
  • CART is prone to overfitting
  • As the number of variables increases, performance can be worse
  • Requires reshaping longitudinal data to wide format
Diagram to illustrate how the shallow learning methods generate synthetic data.

Deep Learning

Considerations

  • Privacy Protection vs Data Utility
  • Cross-Sectional vs Longitudinal Data
  • Time Sensitivity/Computational Resources
  • Validation Server

Who is using synthetic data?

AgencySynthetic Product
Census BureauSIPP Synthetic Beta
Census BureauSynthetic Longitudinal Business Database
CMSMedicare Claims SynPUF
IRSLow-Income Info Returns SynPUF
IRSIndividual Tax Payer SynPUF
Veterans' AffairsPseudoVet
Veterans' Health AdministrationMDClone
Maryland SLDSSee Bonnéry et al. (2019)

Want to play a game?

The phases of the game

  1. Draw phase
  2. Evaluate phase
  3. Telephone phase

How GANs work

Image showing process of data flow for Generative Adversarial Networks

Code Examples

CART-Based Synthesis

CTGAN-Based Synthesis

CPAR-Based Synthesis

You've Got Questions,

I've (hopefully) Got The Answers

Thank You


wbuchanan@sagcorp.com

Slides & Materials Available at:
github.com/wbuchanan/sdpConvening2023

Thank You