Synthesis Example using Conditional Tabular GAN (CTGAN)¶

Unlike the example in R, this example uses a library in Python that implements a few different types of Generative Adversarial Networks, or GANs, specifically Conditional Tabular GAN and Conditional Probabilistic Auto-Regressive (CPAR) GANs. This example will focus on CTGAN which can be used to synthesize data in general and can be used as a way to synthesize records in a manner similar to SMOTE.

Differences in approaches¶

While the CART-based approach synthesizes a single variable at time, some GAN-based methods synthesize vectors (the entire record) at one time. CTGAN is one of those methods. However, while it might seem like this would speed things up substantially, the computational complexity of the underlying architecture may not always play out that way, especially if you are doing this on CPU only.

In [1]:
# Loads library to allow setting the pseudorandom number generator seed
import random

# Sets the pseudorandom number generator seed
random.seed(7779311)

# Load libary to read the data
import pandas as pd 

# Columns from the Faketucky file that we will keep and their new names
cols = { 'sid': 'stdid', 'first_dist_code': 'distid', 'first_hs_code': 'schcd', 
        'first_hs_alt': 'altsch', 'first_hs_urbanicity': 'urbanicity', 
        'chrt_ninth': 'cohort', 'male': 'male', 'race_ethnicity': 'race', 
        'frpl_ever_in_hs': 'frleverhs', 'sped_ever_in_hs': 'swdeverhs', 
        'lep_ever_in_hs': 'eleverhs', 'gifted_ever_in_hs': 'tageverhs', 
        'ever_alt_sch_in_hs': 'alteverhs', 'scale_score_6_math': 'mthss6', 
        'scale_score_6_read': 'rlass6', 'scale_score_8_math': 'mthss8', 
        'scale_score_8_read': 'rlass8', 'pct_absent_in_hs': 'pctabshs', 
        'pct_excused_in_hs': 'pctexcusedhs', 'avg_gpa_hs': 'hsgpa', 
        'scale_score_11_eng': 'acteng11', 'scale_score_11_math': 'actmth11', 
        'scale_score_11_read': 'actrla11', 'scale_score_11_comp': 'actcmp11', 
        'collegeready_ever_in_hs': 'evercollrdyhs', 'careerready_ever_in_hs': 'evercarrdyhs', 
        'ap_ever_take_class': 'aptakenever', 'last_acadyr_observed': 'lastobsyr', 
        'transferout': 'transfer', 'dropout': 'dropout', 'still_enrolled': 'stillenrolled', 
        'ontime_grad': 'gradontime', 'chrt_grad': 'gradcohort', 'hs_diploma': 'diploma', 
        'enroll_yr1_any': 'yr1psenrany', 'enroll_yr1_2yr': 'yr1psenr2yr', 
        'enroll_yr1_4yr': 'yr1psenr4yr', 'enroll_yr2_any': 'yr2psenrany' }

# Load the data and rename the columns to shorter names
df1 = pd.read_stata('https://github.com/OpenSDP/faketucky/raw/master/faketucky.dta', columns = cols.keys()).rename(columns = cols)

# Create the combined school/district ID
df1['schid'] = df1['distid'].astype(str) + df1['schcd'].astype(str)

# Get a sample of school IDs
schids = df1['schid'].drop_duplicates().sample(n = 60, random_state = 7779311)

# Inner Join to select the subset of cases with the sampled school IDs
df = df1.merge(schids, how = 'inner', on = 'schid')

# Remove the school and district codes that created schid
df.drop(columns = ['distid', 'schcd'], inplace = True)

# Make sure a couple columns/variables are correctly typed for the synthesis software
df['urbanicity'] = df['urbanicity'].astype(str)
df['schid'] = df['schid'].astype(str)

# Show some of the data
df.head(20)

df.dtypes
Out[1]:
stdid              int32
altsch              int8
urbanicity        object
cohort             int16
male             float64
race              object
frleverhs        float64
swdeverhs           int8
eleverhs            int8
tageverhs           int8
alteverhs           int8
mthss6           float64
rlass6           float64
mthss8           float64
rlass8           float64
pctabshs         float64
pctexcusedhs     float64
hsgpa            float64
acteng11         float64
actmth11         float64
actrla11         float64
actcmp11         float64
evercollrdyhs       int8
evercarrdyhs        int8
aptakenever         int8
lastobsyr          int16
transfer            int8
dropout             int8
stillenrolled       int8
gradontime          int8
gradcohort       float64
diploma             int8
yr1psenrany      float64
yr1psenr2yr      float64
yr1psenr4yr      float64
yr2psenrany      float64
schid             object
dtype: object
In [2]:
# Load the torch library
import torch 

# Check to see if a GPU is available so you use a GPU later 
gpu = torch.cuda.is_available()

# Check how many GPU are available
ngpus = torch.cuda.device_count()

# For GPU setups
if ngpus >= 1 and ngpus is not None:
    # Get the device properties for each GPU available
    for i in range(ngpus):
        print(torch.cuda.get_device_properties(i))

# Print the result so you can make sure the GPU is ID'd in case something goes wrong with CUDA
print('GPU Available = {0}\n# of GPUS = {1}'.format(gpu, ngpus))
_CudaDeviceProperties(name='Quadro RTX 5000', major=7, minor=5, total_memory=16117MB, multi_processor_count=48)
GPU Available = True
# of GPUS = 1
In [3]:
# If you have multiple GPUs here is where you can set which GPU to use based on the index
# starting from 0
torch.cuda.set_device(0)
In [4]:
# Import CTGAN from the Synthetic Data Vault library
from sdv.tabular import CTGAN

# Use this to set the number of epochs (passes over your data) for training
eps = 1

# You can import this module if you want to see the timing
import time

# This is where you can define/tune your specific architecture for your CTGAN
# See: https://sdv.dev/SDV/api_reference/tabular/api/sdv.tabular.ctgan.CTGAN.html#sdv.tabular.ctgan.CTGAN 
# for a description of all the parameters that are available.
# cuda = whether to use a GPU or not
# discriminator_steps = Typically this is 5 or 10, it is how many times the discriminator is updated before updating the generator
# pac = # of samples grouped together when training the discriminator
# batch_size = How many observations from your dataset should be processed at a time
# verbose = whether to print updates during training indicating some type of progress
# embedding_dim = # of nodes to use for the embedding used for conditioning
# generator_dim = Specify the number of layers/nodes (#, #) each # is the number of nodes in the layer
# discriminator_dim = Specify the number of layers/nodes (#, #) each # is the number of nodes in the layer
# epochs = The # of passes over your dataset to train the model
# You define the architecture without any data.  Once defined you pass data to it using the fit() method.
ctmod = CTGAN(cuda = gpu,
              discriminator_steps = 5,
              pac = 50,
              batch_size = 4000,
              verbose = True,
              embedding_dim = 256,
              generator_dim = (512, 512, 256, 256, 128, 128),
              discriminator_dim = (1024, 512, 256, 128), 
              epochs = eps)

# Leave this uncommented if you want to time the training
start = time.time()

# This trains the model
ctmod.fit(df)

# Leave this uncommented if you 
end = time.time()

# Leave this uncommented if the timing stuff is uncommented to see how long it took
print('{} epochs took {} minutes to complete'.format(eps, (end - start)/60))
/home/billy/anaconda3/lib/python3.9/site-packages/rdt/transformers/numerical.py:100: UserWarning: No rounding scheme detected for column 'pctabshs'. Data will not be rounded.
  warnings.warn(
/home/billy/anaconda3/lib/python3.9/site-packages/rdt/transformers/numerical.py:100: UserWarning: No rounding scheme detected for column 'pctexcusedhs'. Data will not be rounded.
  warnings.warn(
/home/billy/anaconda3/lib/python3.9/site-packages/rdt/transformers/numerical.py:100: UserWarning: No rounding scheme detected for column 'hsgpa'. Data will not be rounded.
  warnings.warn(
Epoch 1, Loss G:  9.1588,Loss D: -15.8004
1 epochs took 0.5858280380566915 minutes to complete

Synthesizing Data¶

Once your model has been trained to your liking, which will definitely take more than a single epoch, you can begin synthesizing data from it. The code block below shows not only how to synthesize a whole sample, but also how you can use CTGAN to synthesize data for specific groups, like SMOTE but probably with greater fidelity.

Notes on Synthesizing Specific Groups¶

Like any data synthesis, you need to be careful and aware of what your protected data looks like with regards to representation of your population of interest. Data synthesis isn't magic, nor can it infer what our intentions are. The process is building a mathematical model to best describe the distribution of the data we provide so it can sample from it. So, if you believe the data you have for some of your smaller groups of interest is a good approximation for what the data for that group would look like in repeated samples you should be fine synthesizing from that group. If, however, you believe that the data you have for the group you are interested in is not a good representation of that group, I would recommend against synthesizing data for the group. That said, also keep in mind that the more data you use to train the model, the better the results will be. Even if your goal is only to synthesize data for a single group, include data from all of the other groups you have available since there is additional information in the relationships between the variables that can improve the quality of the synthetic records for the group you are interested in.

In [5]:
# To generate synthetic data use the sample method
synthtest = ctmod.sample(330)

# Then you can save it to a file
synthtest.to_csv('synthestCTGAN.csv')

# And delete the object to reduce memory consumption
del synthtest

# To sample a specific group, define the conditions
from sdv.sampling import Condition

# Define the groups you need synthetic records for
remoteaa = Condition(num_rows = 100, column_values = {'urbanicity': 'Rural: Remote', 'race': 'African-American'})
remotelx = Condition(num_rows = 100, column_values = {'urbanicity': 'Rural: Remote', 'race': 'Hispanic'})
fringeaa = Condition(num_rows = 100, column_values = {'urbanicity': 'Rural: Fringe', 'race': 'African-American'})
fringelx = Condition(num_rows = 100, column_values = {'urbanicity': 'Rural: Fringe', 'race': 'Hispanic'})

# Then use the sample_conditions() method to generate samples with the characteristics you need
condtest = ctmod.sample_conditions(conditions = [remoteaa, remotelx, fringeaa, fringelx])

# And you can save those data to a file as well:
condtest.to_csv('condtestCTGAN.csv')
Sampling conditions: 100%|█████████████████████████████████████████████| 400/400 [00:10<00:00, 39.43it/s]