Reading an arbitrary number of files into Stata made easy

Billy Buchanan, Ph.D.
Director, Office of Grants, Research, Accountability, & Data
Fayette County Public Schools

https://wbuchanan.github.io/stataConference2020-readit

Motivation

Stata users regularly look for solutions to reading and appending multiple files efficiently.

Bautista (2004) tried to find a solution when files used standardized naming.

Boyle (2013) tried to store multiple file names in a .txt file in order to merge/append files.

Picard (2014) proposed a solution using a combination of two programs he developed (filelist and runby) to handle cases where the data are all in the same format.

Rising (2020) also provided an elegant one line solution when all files are Stata formatted files.

Each of the solutions in those threads have their own advantages/disadvantages.

But they all share at least one constraint due to the functionality of append.

appending files requires them to already be in Stata's file format.

If the files aren't already Stata files, then you also deal with I/O performance bottlenecks.

And while frames supports joins/merges, it doesn't support union/append operations at this time.

It is also likely that this is constrained by serial operation.

Introducing readit

The introduction of the Python API in Stata 16 provides many unique opportunities for the Stata community.

The pandas library is capable of reading/writing myriad file formats used in data science/analytics.

The pandas library also contains a lot of useful functionality for data munging.

The pandas library operates on objects in memory.

Dependencies

PackageVersionUsed for
Pandas≥1.0.0All core functionality
pyarrow≥0.17.1I/O of Apache Arrow/Parquet files
fastparquet≥0.4.0I/O of Parquet files
xlrd≥1.2.0I/O of MS Excel files
pyreadstat≥1.0.0I/O of SPSS and SAS files
h5py≥2.10.0I/O of Hadoop Distributed File System (HDF5) files
Cython≥0.21I/O of Hadoop Distributed File System (HDF5) files
numexpr≥2.7.1I/O of Hadoop Distributed File System (HDF5) files
tables≥3.6.1I/O of Hadoop Distributed File System (HDF5) files

Installing depedencies:

                    
                        pip install 'pandas>=1.0.0'
                        pip install 'pyarrow>=0.17.1'
                        pip install 'fastparquet>=0.4.0'
                        pip install 'xlrd>=1.2.0'
                        pip install 'pyreadstat>=1.0.0'
                        pip install 'h5py>=2.10.0'
                        pip install 'Cython>=0.21'
                        pip install 'numexpr>=2.7.1'
                        pip install 'tables>=3.6.1'
                    
                

Note: The pip binary must be on your path and should be the pip used by the Python installation that Stata references.

File Type Inference

ExtensionFile Type
.dtaStata
.xls(x)MS Excel
.csvComma Delimited
.pkl/.picklePickle
.sas7bdat/.xportSAS
.tab/.tsvTab Delimited
.txt/.datFixed-Width
.jsonJSON
.htmlHTML
.featherApache Arrow/Parquet
.parquetParquet
.h5HDF5
.savSPSS

Disclaimer:

This is a very early version of the program and work is already underway to refactor a non-trivial amount of the codebase.

Installation:

                    
                        net inst readit, replace                   ///
                            from("https://wbuchanan.github.io/readit")
                    
                
                    
                        /*
                          Note: When specifying the file(s), I strongly recommend
                          using double quotes to enclose the full string and using
                          individual apostrophe/single quote characters to delimit
                          each individual file/path within.

                          Clear data from memory then read and append all SPSS files
                          in a directory
                        */
                        readit "'test/*.sav'", c
                    
                
                    
                        /*
                          Read all the files starting with auto in the directory &
                          then rename the gear_ratio variable after all the files
                          are appended.
                        */
                        readit "'test/auto*'", ren('gear_ratio': 'gratio') c
                    
                
                    
                        /*
                        Read all of the Excel, SPSS, and Feather files in the directory,
                        rename multiple columns, and pass arguments to the pandas API
                        for CSV and SPSS files.
                        */
                        readit "'test/*.xlsx', 'test/*.feather', 'test/*.sav'", ///
                        ren('src2' : 'backupsrc', 'gear_ratio' : 'gearRatio', 'datatype': 'filetype') ///
                        "sep = ',', na_values = '', convert_categoricals = False"
                    
                

Challenges/Limitations

  • Python which will determine if there is a package of the same name available, but does not return the __version__ string.
  • Datetime variables have not been tested.
  • Products like StatTransfer can actually create issues when they attempt to optimize the storage of the data on disk.

Roadmap

Adding additional parsing capabilities to provide file specific flexibility when used interactively.

Depending on Zhao Xu's talk on the 30th developing unit tests for the program.

Distribution of the Python code via PyPI/conda.

Using a JSON schema to define/validate a configuration file that can be passed as a single argument.

Concurrent (i.e., parallel) processing of file reads.

Configuration File Based Operation:

                    
                        {
                            "renameAll": {"var1": "timestamp", "var2": "firmid", "var3": "stockprice"},
                            "globalFileType": "",
                            "globalOptions": {},
                            "files": [
                                {
                                    "filenm": "/Users/user/Desktop/myMessyData/file1.sas7bdat",
                                    "ext" : ".sas7bdat",
                                    "filetype": "sas",
                                    "rename": {},
                                    "options": {}
                                },
                                {
                                    "filenm": "/Users/user/Desktop/myMessyData/file2.feather",
                                    "ext" : ".feather",
                                    "filetype": "feather",
                                    "rename": { "xyz": "var3", "abc": "var1", "lmno": "var2"},
                                    "options": {}
                                },
                                {
                                    "filenm": "/Users/user/Desktop/myMessyData/file3.csv",
                                    "ext" : ".csv",
                                    "filetype": "csv",
                                    "rename": {},
                                    "options": {
                                        "sep": ",;",
                                        "names": ["var1", "var2", "var3"],
                                        "parse_dates": true
                                    }
                                }
                            ]
                        }
                    
                

If anyone is interested in collaborating on this feel free to let me know.