Longitudinal Synthesis Example¶

This example will show how to use Conditional Probabilistic Auto-Regressive (CPAR) GAN models to synthesize longitudinal data. To make sure any/everyone is able to view the same data, we'll use some publicly available school-level data available from the Urban Institutes Education Data Portal.

The Sample¶

The data from the Civil Rights Data Collection (CRDC) is limited in the Urban Institutes Education Data Portal by year and topic. I've found an end point that will help create a relatively short panel (e.g., 3 waves) spanning four years. While it is possible to use other methods to generate shorter panels, it becomes challenging if there are missing data that need to be addressed or if the panel is unbalanced (e.g., 3 periods of observation for some, and less for others). A major advantage of CPAR is that it is fully capable of creating synthetic data for balanced and unbalanced panels without any additional effort on your part.

In [1]:
from urllib.request import urlopen
from json import loads
import pandas as pd

# API Endpoint for the CRDC Discipline Instances Data
url = "https://educationdata.urban.org/api/v1/schools/crdc/restraint-and-seclusion/"

# API Endpoint for the CCD Directory Data
demog = "https://educationdata.urban.org/api/v1/schools/ccd/directory/"

# Define the value labels for the categorical variables from the CCD directory we will keep
valueLabels = { 'school_type': {1: 'Regular school', 2: 'Special education school', 3: 'Vocational school', 4: 'Other/alternative school', 5: 'Reportable program', -1: 'Missing/not reported', -2: 'Not applicable', -3: 'Suppressed data'}, 
               'elem_cedp': {0: 'No', 1: 'Yes', -1: 'Missing/not reported', -2: 'Not applicable', -3: 'Suppressed data'}, 
               'middle_cedp': {0: 'No', 1: 'Yes', -1: 'Missing/not reported', -2: 'Not applicable', -3: 'Suppressed data'}, 
               'high_cedp': {0: 'No', 1: 'Yes', -1: 'Missing/not reported', -2: 'Not applicable', -3: 'Suppressed data'}, 
               'title_i_eligible': {0: 'No', 1: 'Yes', -1: 'Missing/not reported', -2: 'Not applicable', -3: 'Suppressed data'}, 
               'title_i_schoolwide': {0: 'No', 1: 'Yes', -1: 'Missing/not reported', -2: 'Not applicable', -3: 'Suppressed data'}, 
               'charter': {0: 'No', 1: 'Yes', -1: 'Missing/not reported', -2: 'Not applicable', -3: 'Suppressed data'}, 
               'magnet': {0: 'No', 1: 'Yes', -1: 'Missing/not reported', -2: 'Not applicable', -3: 'Suppressed data'}, 
               'lunch_program': {0: 'No', 1: 'Yes participating without using any Provision or the CEP', 2: 'Yes under the Community Eligibility Provision (CEP)', 3: 'Yes under Provision 1', 4: 'Yes under Provision 2', 5: 'Yes under Provision 3', -1: 'Missing/not reported'
}}

# Mapping to recode missing values for the CRDC data, missing and N/A values will be treated as 0, while suppressed 
# data will be coded as missing; this should allow the model to learn that suppressed data are different from the other
# two classes of data.
crdcMap = { 'instances_mech_restraint': { -1: 0, -2: 0, -3: None}, 'instances_phys_restraint': { -1: 0, -2: 0, -3: None}, 'instances_seclusion': { -1: 0, -2: 0, -3: None}}

# The list of variables from  the CCD directory to retain
ccdvars = ['year', 'ncessch', 'leaid', 'fips', 'school_type', 'elem_cedp', 'middle_cedp', 'high_cedp', 'title_i_eligible', 'title_i_schoolwide', 'charter', 'magnet', 'teachers_fte', 'lunch_program', 'direct_certification', 'enrollment']

# Defines a helper function that deals with some of the data cleaning 
def getyear(year):
    # Only retrieves numbers for all disabled students
    discipline = urlopen(url + str(year) + '/instances/?disability=99')
    # Decodes the byte stream and parses the json into a Python object
    dis = loads(discipline.read())
    # Retrieves the school directory data
    schinfo = urlopen(demog + str(year) + '/')
    # Decodes the byte stream and parses the json into a Python object
    demo = loads(schinfo.read())
    # Converts the Python object into a deduplicated Pandas data frame and recodes the variables named in crdcMap
    disp = pd.DataFrame(dis['results']).drop_duplicates(subset = ['year', 'ncessch', 'leaid', 'fips'], keep = 'first', ignore_index = True).replace(crdcMap)
    # Converts the Python object into a deduplicated Pandas data frame and recodes the variables named in valueLabels
    demographics = pd.DataFrame(demo['results'])[ccdvars].drop_duplicates(subset = ['year', 'ncessch', 'leaid', 'fips'], keep = 'first', ignore_index = True).replace(valueLabels)
    # Keeps only school records that have some enrollment
    demographics = demographics.loc[demographics['enrollment'] > 0]
    # Joins the data using the NCES School Identifier, year, LEA ID, and FIPS number
    return disp.merge(demographics, on = ['ncessch', 'year', 'leaid', 'fips'], how = 'left')

# Combine all years of data into a single data frame
df = pd.concat([getyear(i) for i in [2013, 2015, 2017]], ignore_index = True).drop(columns = 'crdc_id')
In [2]:
# Show some of the data
df.head(10)
Out[2]:
ncessch year fips leaid disability instances_mech_restraint instances_phys_restraint instances_seclusion school_type elem_cedp middle_cedp high_cedp title_i_eligible title_i_schoolwide charter magnet teachers_fte lunch_program direct_certification enrollment
0 010000201705 2013 1 0100002 99 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 010000201706 2013 1 0100002 99 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2013 1 99 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 010000500870 2013 1 0100005 99 0.0 0.0 0.0 Regular school No Yes No Yes Yes Not applicable No 33.0 NaN NaN 632.0
4 010000500871 2013 1 0100005 99 0.0 0.0 0.0 Regular school No No Yes No Not applicable Not applicable No 64.0 NaN NaN 1117.0
5 010000500879 2013 1 0100005 99 0.0 0.0 0.0 Regular school Yes No No Yes Yes Not applicable No 31.0 NaN NaN 679.0
6 010000500889 2013 1 0100005 99 0.0 0.0 0.0 Regular school Yes No No Yes Yes Not applicable No 43.0 NaN NaN 780.0
7 010000501616 2013 1 0100005 99 0.0 0.0 0.0 Regular school Yes No No Yes Yes Not applicable No 27.0 NaN NaN 476.0
8 010000502150 2013 1 0100005 99 0.0 0.0 0.0 Regular school Yes No No Yes Yes Not applicable No 56.0 NaN NaN 1029.0
9 010000600193 2013 1 0100006 99 0.0 0.0 0.0 Regular school No Yes No Yes Yes Not applicable No 23.0 NaN NaN 439.0
In [3]:
# Load the torch library
import torch 

# Check to see if a GPU is available so you use a GPU later 
gpu = torch.cuda.is_available()

# Check how many GPU are available
ngpus = torch.cuda.device_count()

# For GPU setups
if ngpus >= 1 and ngpus is not None:
    # Get the device properties for each GPU available
    for i in range(ngpus):
        print(torch.cuda.get_device_properties(i))

# Print the result so you can make sure the GPU is ID'd in case something goes wrong with CUDA
print('GPU Available = {0}\n# of GPUS = {1}'.format(gpu, ngpus))
_CudaDeviceProperties(name='Quadro RTX 5000', major=7, minor=5, total_memory=16117MB, multi_processor_count=48)
GPU Available = True
# of GPUS = 1
In [4]:
# If you have multiple GPUs here is where you can set which GPU to use based on the index
# starting from 0
torch.cuda.set_device(0)

Slightly Older SDV Library¶

To avoid additional potential issues, I am providing the synthesis example using the version of the SDV library I have loaded on my machine. However, I am also providing some example syntax for the current API at the end of this notebook to illustrate some of the differences.

In [5]:
# This is how it worked previously:
# Entity columns identified the unit of observation
entity_columns = ['ncessch' ]

# Context columns identified time invariant variables
context_columns = [ 'leaid' ]

# The sequence index identified the same thing that it does in the metadata object now
sequence_index = 'year'
In [6]:
# Imports the module for CPAR
from sdv.timeseries import PAR

# Imports the time module so we can time things
import time

# Define the number of epochs to train the model
eps = 1

# Creates an instance of the CPAR model
cpar = PAR(cuda=True, 
            context_columns=context_columns,
           entity_columns=entity_columns,
           sequence_index=sequence_index,
            verbose=True, 
            epochs = eps)

# Start the clock
stime = time.time()

# Train the model for eps# epochs
cpar.fit(df)

# Stop the timer
etime = time.time()

# Show how long it took to train for eps# epochs
print('Took {} minutes to complete {} epochs'.format((etime - stime)/60, eps))
Epoch 1 | Loss 0.0006023218156769872: 100%|████████████████████████████████| 1/1 [01:31<00:00, 91.35s/it]
Took 2.0508145729700726 minutes to complete 1 epochs
In [7]:
# Generate synthetic samples from the model
synthdf = cpar.sample(250)

# Save the synthetic data
synthdf.to_csv('cparSynthetic.csv')
100%|██████████████████████████████████████████████████████████████████| 250/250 [00:07<00:00, 35.25it/s]

Updates to the SDV Library¶

In the current release of the SDV, metadata needs to be provided when defining the model. This will basically identify which variables are time invariant, time variant, how time is identified, and how the units of observation are identified. However, there are are a few time saving options you can use to define the metadata that I'll show below. Note: there will likely be errors when running the example below.

In [ ]:
# Loads the metadata module
from sdv.metadata import SingleTableMetadata

# Instantiates a SingleTableMetadata class object
metadata = SingleTableMetadata()

# Quick way to start:
metadata.detect_from_dataframe(data = df)

# To look at the metadata
metadata.to_dict()

# Validate the metadata
metadata.validate()

# Updating metadata
metadata.update_column(column_name = 'ncessch', sdtype = 'id', regex_format = '[0-9]{12}')
metadata.update_column(column_name = 'charter', sdtype = 'categorical')

# If you're not sure about the numeric type you can always default to 'Float', but if you know how much storage would be needed
# for the data you can likely specific that type to hopefully optimize memory consumption a bit.
metadata.update_column(column_name = 'enrollment', sdtype = 'numerical', computer_representation = 'Int32')

# IMPORTANT FOR LONGITUDINAL SYNTHESIS!!!!!! This identifies your observations.
metadata.set_sequence_key(column_name = 'ncessch')

# IMPORTANT FOR LONGITUDINAL SYNTHESIS!!!!!! This identifies your measure of time.
metadata.set_sequence_index(column_name = 'year')
In [ ]:
# This is an example of how you would do this with the newest version of the library

# Import CPAR from the Synthetic Data Vault library
from sdv.sequential import PARSynthesizer

# Create the synthesizer
cpar = PARSynthesizer(metadata)

# Train the model
cpar.fit(df)

# Synthesize data
synthdf = cpar.sample(num_rows = 2000)