Working Efficiently with Stata in Shared Computing Environments

Billy Buchanan
Senior Research Scientist
SAG Corporation
Slides:
https://wbuchanan.github.io/stataConference2022

Shared Computing Constraints

  • Disk
  • Compute
  • Memory
  • People

Disk/Memory Issues

  • Sparse Matrices
  • String Storage
  • Inefficient Data Typing

Sparse Matrices

  • reshape wide can provide a benefit to long sparse datasets.
  • If sparsity is due to inferable value:
    1. reshape long ..., i(idvariables)
    2. drop if condition
    3. save optimalSizeFile.dta, replace
  • Use the Python API to store data in compressed formats.
							
								// Clear data from memory
								clear

								// Set the pseudorandom number seed
								set seed 7779311

								// Set a local for the number of observations
								loc obs 1000000

								// Set a local for the number of score variables
								loc nscores 500

								// Set the average number of scores observed for the observations
								loc muscores 10

								// Set the number of observations in the dataset
								set obs `obs'

								// Create an ID variable
								g long id = _n

								// Create a variable for the number of scores observed for each observation
								g byte nscores = rpoisson(`muscores')

								// Expand the dataset to create the score variable indices
								expandcl `nscores', cl(id) gen(scidx)

								// Create index within id
								bys id: g int scoreidx = _n

								// Create a random uniform to shuffle the scores within individuals
								bys id: g double shuffle = runiform()

								// Sort the data within id by the shuffle variable
								sort id shuffle

								// Create a new variable with a score for nscores scores within each individual
								by id: g byte value = round(runiformint(0, 100), 10) if _n <= nscores

								// Create an indicator for whether the individual has a specific score
								g byte has = !mi(value)

								// Get rid of the variable I had to generate with expandcl
								drop scidx shuffle

								// Save the version of the file that is long, but not fully optimized
								qui: save sparseLong.dta, replace

								// Size 5245.21M
								// Memory 6208M

								// Reshape the data into a common form that is extremely sparse
								greshape wide value has, i(id) j(scoreidx) nochecks

								// save in this format
								qui: save sparseWide.dta, replace

								// Size 958.44M
								// Memory 1248M

								// Make long again
								greshape long value has, i(id) j(scoreidx) nochecks

								// Size 5245.21M
								// Memory 6208M

								// Keep only records that have data
								keep if !mi(has, value)

								// Size 104.91M
								// Memory 192M

								// Save a copy of the file with the redundant variable
								qui: save denseWRedundancy.dta, replace

								// Get rid of redundant indicator
								drop has

								// Size 95.37M
								// Memory 160M

								// Save with full optimization
								qui: save denseWORedundancy.dta, replace

								// See disk use:
								! ls -ohU *se*.dta
								sparseLong.dta        4.2G
								sparseWide.dta        960M
								denseWRedundancy.dta  105M
								denseWORedundancy.dta 96M
							
						

String Storage

  • strL can help with storage.
  • strL can help with memory too.
  • Both cases depend on string length and repeated values.
							
								// Defines a mata function used to create random strings
								// Start mata interpreter
								mata:

									// Clear anything from Mata's memory
									mata clear

									// Does not return anything
									void rstring(real scalar obs, real scalar strlen, string scalar varnm) {

										// Declare variable to store the ASCII values of the random string
										real matrix idchars

										// Declare variable to store the string from the ASCII codes in idchars
										string matrix ids

										// Declares a variable to iterate over the matrices
										real scalar i

										// Creates a matrix of random ints used to create mappings to ASCII
										// with dimensions of observations by number of characters in the ID
										idchars = runiformint(obs, strlen, 48, 123)

										// Creates the matrix to store the ID strings
										ids = J(obs, 1, "")

										// Iterate over the rows of the ID chars matrix
										for(i = 1; i <= obs; i++) {

											// Convert the row vector into a string scalar and store it
											ids[i, 1] = char(idchars[i, .])

										} // End loop over the matrix

										// Store the ID strings in the Stata dataset
										st_sstore(., varnm, ids)

									} // End Mata Function definition

								// End the Mata interpreter
								end

								// Loop over values of string lengths
								foreach i in 10 13 44 45 55 {

									// Clear data from memory
									clear

									// Set the pseudorandom number seed
									set seed 7779311

									// Set a local for the number of observations
									loc obs 1000000

									// Set a local for the length of the string ID 15 seems to be the point where
									// the strL is more efficient
									loc strlen `i'

									// Set the average number of repeated observations per individual
									loc mureps 8

									// Set the number of observations in the dataset
									set obs `obs'

									// Create the container for the IDs
									g str`strlen' id = ""

									// Populate the ID variable using the Mata function defined above
									mata: rstring(`obs', `strlen', "id")

									// Create the number of repeated observations per individual
									g byte iobs = rpoisson(`mureps')

									// Expand the dataset to create the repeated observations
									expandcl iobs, cl(id) gen(newcl)

									// Drop the variables that aren't needed
									drop newcl iobs

									// Create a time variable with the sequence of records per individual
									bys id: g byte time = _n

									// Store the example dataset
									save savedAsstr`strlen'.dta, replace

									// Get memory report
									memory

									// sort the dataset
									sort id time

									// Recast the ids to strLs
									recast strL id

									// Store the compressed dataset
									qui: save savedAsstr`strlen'L.dta, replace

									// Get the memory report after recasting to strL
									memory

									// See if compress recasts the strL
									compress, coalesce

								} // End loop over string lengths

								// String Length  | Recasted Type
								// 10             | str10
								// 13             | str13
								// 44             | str44
								// 45             | strL
								// 55             | strL

								// Get the file sizes for each of the two files
								! ls -ohU savedAsstr*.dta
								savedAsstr10.dta  84M
								savedAsstr10L.dta 99M
								savedAsstr13.dta  107M
								savedAsstr13L.dta 102M
								savedAsstr44.dta  344M
								savedAsstr44L.dta 131M
								savedAsstr45.dta  352M
								savedAsstr45L.dta 132M
								savedAsstr55.dta  428M
								savedAsstr55L.dta 142M
							
						

Inefficient Data Typing


  • Always specify types at creation:
    g byte adummy = mi(brains)
  • see help data types
  • use compress afterwards just to be sure.

Workflow and Documentation

  • Data Documentation and Organization
  • Code Organization
  • Data Documentation and Organization

    • Rising, B. (2007) self-validating datasets
    • Use dataset characteristics to store metadata:
      • Primary Key - char _dta[pk] "id year industry"
      • Verify primary key using isid
      • Foreign Key - char _dta[fk_xyz.dta] "m:1 year industry"
    • Take full/complete advantage of variable/value labels, notes, and characteristics.

    Code Organization

    • Use version control systems to avoid filepath shenanigans.
    • Prefer functionally named .ado over many ##*.do named scripts.
    • Gould (2010) subroutines are not used enough.

    Compute Consumption

    • Compute Efficient Commands
    • Monitoring and Modifying Consumption
    • Maximizing Workflow Efficiency

    Compute Efficient Commands

    • ftools from Correia offers Mata-based variants of several Stata native commands.
    • gtools from Bravo offers a C-based plugin with variants on many Stata native commands as well.

    CommandOptionsShapeTime
    reshapeN/Awide9,743.43
    reshapeN/Along8,876.64
    greshapeN/Awide137.44
    greshapeN/Along46.19
    greshapenocheckswide138.70
    greshapenocheckslong45.34

    Monitoring and Modifying Consumption

    • Use set segmentsize to optimize compute vs memory issues.
    • StataOS returns results to Stata from shell commands you issue.
    • StataOS created in response to a StataList post about monitoring available memory in Unix.

    Final Thoughts/Ideas

    • Memory estimation prefix for commands.
    • Automatic adjustment of memory segment size.
    • String data optimization/efficiency improvements.