Working Efficiently with Stata in Shared Computing Environments

Billy Buchanan
Senior Research Scientist
SAG Corporation
Slides:
https://wbuchanan.github.io/stataConference2022

Shared Computing Constraints

Disk
Compute
Memory
People

Disk/Memory Issues

Sparse Matrices
String Storage
Inefficient Data Typing

Sparse Matrices

reshape wide can provide a benefit to long sparse datasets.
If sparsity is due to inferable value:

reshape long ..., i(idvariables)
drop if condition
save optimalSizeFile.dta, replace

Use the Python API to store data in compressed formats.

							
								// Clear data from memory
								clear

								// Set the pseudorandom number seed
								set seed 7779311

								// Set a local for the number of observations
								loc obs 1000000

								// Set a local for the number of score variables
								loc nscores 500

								// Set the average number of scores observed for the observations
								loc muscores 10

								// Set the number of observations in the dataset
								set obs `obs'

								// Create an ID variable
								g long id = _n

								// Create a variable for the number of scores observed for each observation
								g byte nscores = rpoisson(`muscores')

								// Expand the dataset to create the score variable indices
								expandcl `nscores', cl(id) gen(scidx)

								// Create index within id
								bys id: g int scoreidx = _n

								// Create a random uniform to shuffle the scores within individuals
								bys id: g double shuffle = runiform()

								// Sort the data within id by the shuffle variable
								sort id shuffle

								// Create a new variable with a score for nscores scores within each individual
								by id: g byte value = round(runiformint(0, 100), 10) if _n <= nscores

								// Create an indicator for whether the individual has a specific score
								g byte has = !mi(value)

								// Get rid of the variable I had to generate with expandcl
								drop scidx shuffle

								// Save the version of the file that is long, but not fully optimized
								qui: save sparseLong.dta, replace

								// Size 5245.21M
								// Memory 6208M

								// Reshape the data into a common form that is extremely sparse
								greshape wide value has, i(id) j(scoreidx) nochecks

								// save in this format
								qui: save sparseWide.dta, replace

								// Size 958.44M
								// Memory 1248M

								// Make long again
								greshape long value has, i(id) j(scoreidx) nochecks

								// Size 5245.21M
								// Memory 6208M

								// Keep only records that have data
								keep if !mi(has, value)

								// Size 104.91M
								// Memory 192M

								// Save a copy of the file with the redundant variable
								qui: save denseWRedundancy.dta, replace

								// Get rid of redundant indicator
								drop has

								// Size 95.37M
								// Memory 160M

								// Save with full optimization
								qui: save denseWORedundancy.dta, replace

								// See disk use:
								! ls -ohU *se*.dta
								sparseLong.dta        4.2G
								sparseWide.dta        960M
								denseWRedundancy.dta  105M
								denseWORedundancy.dta 96M

String Storage

strL can help with storage.
strL can help with memory too.
Both cases depend on string length and repeated values.

							
								// Defines a mata function used to create random strings
								// Start mata interpreter
								mata:

									// Clear anything from Mata's memory
									mata clear

									// Does not return anything
									void rstring(real scalar obs, real scalar strlen, string scalar varnm) {

										// Declare variable to store the ASCII values of the random string
										real matrix idchars

										// Declare variable to store the string from the ASCII codes in idchars
										string matrix ids

										// Declares a variable to iterate over the matrices
										real scalar i

										// Creates a matrix of random ints used to create mappings to ASCII
										// with dimensions of observations by number of characters in the ID
										idchars = runiformint(obs, strlen, 48, 123)

										// Creates the matrix to store the ID strings
										ids = J(obs, 1, "")

										// Iterate over the rows of the ID chars matrix
										for(i = 1; i <= obs; i++) {

											// Convert the row vector into a string scalar and store it
											ids[i, 1] = char(idchars[i, .])

										} // End loop over the matrix

										// Store the ID strings in the Stata dataset
										st_sstore(., varnm, ids)

									} // End Mata Function definition

								// End the Mata interpreter
								end

								// Loop over values of string lengths
								foreach i in 10 13 44 45 55 {

									// Clear data from memory
									clear

									// Set the pseudorandom number seed
									set seed 7779311

									// Set a local for the number of observations
									loc obs 1000000

									// Set a local for the length of the string ID 15 seems to be the point where
									// the strL is more efficient
									loc strlen `i'

									// Set the average number of repeated observations per individual
									loc mureps 8

									// Set the number of observations in the dataset
									set obs `obs'

									// Create the container for the IDs
									g str`strlen' id = ""

									// Populate the ID variable using the Mata function defined above
									mata: rstring(`obs', `strlen', "id")

									// Create the number of repeated observations per individual
									g byte iobs = rpoisson(`mureps')

									// Expand the dataset to create the repeated observations
									expandcl iobs, cl(id) gen(newcl)

									// Drop the variables that aren't needed
									drop newcl iobs

									// Create a time variable with the sequence of records per individual
									bys id: g byte time = _n

									// Store the example dataset
									save savedAsstr`strlen'.dta, replace

									// Get memory report
									memory

									// sort the dataset
									sort id time

									// Recast the ids to strLs
									recast strL id

									// Store the compressed dataset
									qui: save savedAsstr`strlen'L.dta, replace

									// Get the memory report after recasting to strL
									memory

									// See if compress recasts the strL
									compress, coalesce

								} // End loop over string lengths

								// String Length  | Recasted Type
								// 10             | str10
								// 13             | str13
								// 44             | str44
								// 45             | strL
								// 55             | strL

								// Get the file sizes for each of the two files
								! ls -ohU savedAsstr*.dta
								savedAsstr10.dta  84M
								savedAsstr10L.dta 99M
								savedAsstr13.dta  107M
								savedAsstr13L.dta 102M
								savedAsstr44.dta  344M
								savedAsstr44L.dta 131M
								savedAsstr45.dta  352M
								savedAsstr45L.dta 132M
								savedAsstr55.dta  428M
								savedAsstr55L.dta 142M

Inefficient Data Typing

Always specify types at creation:
g byte adummy = mi(brains)
see help data types
use compress afterwards just to be sure.

Workflow and Documentation

Data Documentation and Organization

Code Organization

Data Documentation and Organization

Rising, B. (2007) self-validating datasets
Use dataset characteristics to store metadata:

Primary Key - char _dta[pk] "id year industry"
Verify primary key using isid
Foreign Key - char _dta[fk_xyz.dta] "m:1 year industry"

Take full/complete advantage of variable/value labels, notes, and characteristics.

Code Organization

Use version control systems to avoid filepath shenanigans.
Prefer functionally named .ado over many ##*.do named scripts.
Gould (2010) subroutines are not used enough.

Compute Consumption

Compute Efficient Commands
Monitoring and Modifying Consumption
Maximizing Workflow Efficiency

Compute Efficient Commands

ftools from Correia offers Mata-based variants of several Stata native commands.
gtools from Bravo offers a C-based plugin with variants on many Stata native commands as well.

Command	Options	Shape	Time
reshape	N/A	wide	9,743.43
reshape	N/A	long	8,876.64
greshape	N/A	wide	137.44
greshape	N/A	long	46.19
greshape	nochecks	wide	138.70
greshape	nochecks	long	45.34

Monitoring and Modifying Consumption

Use set segmentsize to optimize compute vs memory issues.
StataOS returns results to Stata from shell commands you issue.
StataOS created in response to a StataList post about monitoring available memory in Unix.

statacons
Guiteras, Ahnjeong, Quistorff, & Shumway (in review). statacons: An SCons-based build tool for Stata.
Uses the Python API and SCons to run only what is needed in your project based on the dependencies and outputs you define.

Final Thoughts/Ideas

Memory estimation prefix for commands.
Automatic adjustment of memory segment size.
String data optimization/efficiency improvements.