Reading an arbitrary number of files into Stata made easy

Billy Buchanan, Ph.D.
Director, Office of Grants, Research, Accountability, & Data
Fayette County Public Schools

https://wbuchanan.github.io/stataConference2020-readit

Motivation

Stata users regularly look for solutions to reading and appending multiple files efficiently.

Bautista (2004) tried to find a solution when files used standardized naming.

Boyle (2013) tried to store multiple file names in a .txt file in order to merge/append files.

Picard (2014) proposed a solution using a combination of two programs he developed (filelist and runby) to handle cases where the data are all in the same format.

Rising (2020) also provided an elegant one line solution when all files are Stata formatted files.

Each of the solutions in those threads have their own advantages/disadvantages.

But they all share at least one constraint due to the functionality of append.

appending files requires them to already be in Stata's file format.

If the files aren't already Stata files, then you also deal with I/O performance bottlenecks.

And while frames supports joins/merges, it doesn't support union/append operations at this time.

It is also likely that this is constrained by serial operation.

Introducing `readit`

The introduction of the Python API in Stata 16 provides many unique opportunities for the Stata community.

The pandas library is capable of reading/writing myriad file formats used in data science/analytics.

The pandas library also contains a lot of useful functionality for data munging.

The pandas library operates on objects in memory.

Dependencies

Package	Version	Used for
Pandas	≥1.0.0	All core functionality
pyarrow	≥0.17.1	I/O of Apache Arrow/Parquet files
fastparquet	≥0.4.0	I/O of Parquet files
xlrd	≥1.2.0	I/O of MS Excel files
pyreadstat	≥1.0.0	I/O of SPSS and SAS files
h5py	≥2.10.0	I/O of Hadoop Distributed File System (HDF5) files
Cython	≥0.21	I/O of Hadoop Distributed File System (HDF5) files
numexpr	≥2.7.1	I/O of Hadoop Distributed File System (HDF5) files
tables	≥3.6.1	I/O of Hadoop Distributed File System (HDF5) files

Installing depedencies:

                    
                        pip install 'pandas>=1.0.0'
                        pip install 'pyarrow>=0.17.1'
                        pip install 'fastparquet>=0.4.0'
                        pip install 'xlrd>=1.2.0'
                        pip install 'pyreadstat>=1.0.0'
                        pip install 'h5py>=2.10.0'
                        pip install 'Cython>=0.21'
                        pip install 'numexpr>=2.7.1'
                        pip install 'tables>=3.6.1'

Note: The pip binary must be on your path and should be the pip used by the Python installation that Stata references.

File Type Inference

Extension	File Type
.dta	Stata
.xls(x)	MS Excel
.csv	Comma Delimited
.pkl/.pickle	Pickle
.sas7bdat/.xport	SAS
.tab/.tsv	Tab Delimited
.txt/.dat	Fixed-Width
.json	JSON
.html	HTML
.feather	Apache Arrow/Parquet
.parquet	Parquet
.h5	HDF5
.sav	SPSS

Disclaimer:

This is a very early version of the program and work is already underway to refactor a non-trivial amount of the codebase.

Installation:

                    
                        net inst readit, replace                   ///
                            from("https://wbuchanan.github.io/readit")

                    
                        /*
                          Note: When specifying the file(s), I strongly recommend
                          using double quotes to enclose the full string and using
                          individual apostrophe/single quote characters to delimit
                          each individual file/path within.

                          Clear data from memory then read and append all SPSS files
                          in a directory
                        */
                        readit "'test/*.sav'", c

                    
                        /*
                          Read all the files starting with auto in the directory &
                          then rename the gear_ratio variable after all the files
                          are appended.
                        */
                        readit "'test/auto*'", ren('gear_ratio': 'gratio') c

                    
                        /*
                        Read all of the Excel, SPSS, and Feather files in the directory,
                        rename multiple columns, and pass arguments to the pandas API
                        for CSV and SPSS files.
                        */
                        readit "'test/*.xlsx', 'test/*.feather', 'test/*.sav'", ///
                        ren('src2' : 'backupsrc', 'gear_ratio' : 'gearRatio', 'datatype': 'filetype') ///
                        "sep = ',', na_values = '', convert_categoricals = False"

Challenges/Limitations

Python which will determine if there is a package of the same name available, but does not return the __version__ string.
Datetime variables have not been tested.
Products like StatTransfer can actually create issues when they attempt to optimize the storage of the data on disk.

Roadmap

Adding additional parsing capabilities to provide file specific flexibility when used interactively.

Depending on Zhao Xu's talk on the 30th developing unit tests for the program.

Distribution of the Python code via PyPI/conda.

Using a JSON schema to define/validate a configuration file that can be passed as a single argument.

Concurrent (i.e., parallel) processing of file reads.

Currently, it isn't possible to specify which variables to rename on a file-by-file basis. With future releases this should be possible, and will likely require similar non-traditional Stata syntax.
Unit testing isn't the easiest thing at the moment since all of the code needs to be controlled by the Stata instance. By decoupling more of the code base (or based on the talk) it might be possible to more rigorously test the code prior to a wider release.
Right now users will need to have modules installed on their machines. Many Python developers use virtual environments so any modules/packages are installed locally to a specific directory. Making the code pip installable will make it easier to manage dependencies.
Using a JSON schema means proving a more robust way to test/check syntax errors in the specification of the configuration files. It also means defining standardized semantics that can be used to define how the program should operate.
The current version of readit uses a class and methods to work its magic. This doesn't really work so well for multiprocessing (e.g., non-shared memory based concurrency). Multithreading may not be the sole solution either.
Multithreading support will be added prior to any multiprocessing support since it uses a shared memory model.

Configuration File Based Operation:

                    
                        {
                            "renameAll": {"var1": "timestamp", "var2": "firmid", "var3": "stockprice"},
                            "globalFileType": "",
                            "globalOptions": {},
                            "files": [
                                {
                                    "filenm": "/Users/user/Desktop/myMessyData/file1.sas7bdat",
                                    "ext" : ".sas7bdat",
                                    "filetype": "sas",
                                    "rename": {},
                                    "options": {}
                                },
                                {
                                    "filenm": "/Users/user/Desktop/myMessyData/file2.feather",
                                    "ext" : ".feather",
                                    "filetype": "feather",
                                    "rename": { "xyz": "var3", "abc": "var1", "lmno": "var2"},
                                    "options": {}
                                },
                                {
                                    "filenm": "/Users/user/Desktop/myMessyData/file3.csv",
                                    "ext" : ".csv",
                                    "filetype": "csv",
                                    "rename": {},
                                    "options": {
                                        "sep": ",;",
                                        "names": ["var1", "var2", "var3"],
                                        "parse_dates": true
                                    }
                                }
                            ]
                        }