Billy Buchanan, Ph.D. Director, Office of Grants, Research, Accountability, & Data Fayette County Public Schools
Bautista (2004) tried to find a solution when files used standardized naming.
Boyle (2013) tried to store multiple file names in a .txt file in order to merge/append files.
Picard (2014) proposed a solution using a combination of two programs he developed (filelist and runby) to handle cases where the data are all in the same format.
Rising (2020) also provided an elegant one line solution when all files are Stata formatted files.
Each of the solutions in those threads have their own advantages/disadvantages.
But they all share at least one constraint due to the functionality of append
.
append
ing files requires them to already be in Stata's file format.
If the files aren't already Stata files, then you also deal with I/O performance bottlenecks.
And while frames
supports joins/merges, it doesn't support union/append operations at this time.
It is also likely that this is constrained by serial operation.
readit
The introduction of the Python API in Stata 16 provides many unique opportunities for the Stata community.
The pandas library is capable of reading/writing myriad file formats used in data science/analytics.
The pandas library also contains a lot of useful functionality for data munging.
The pandas library operates on objects in memory.
Package | Version | Used for |
Pandas | ≥1.0.0 | All core functionality |
pyarrow | ≥0.17.1 | I/O of Apache Arrow/Parquet files |
fastparquet | ≥0.4.0 | I/O of Parquet files |
xlrd | ≥1.2.0 | I/O of MS Excel files |
pyreadstat | ≥1.0.0 | I/O of SPSS and SAS files |
h5py | ≥2.10.0 | I/O of Hadoop Distributed File System (HDF5) files |
Cython | ≥0.21 | I/O of Hadoop Distributed File System (HDF5) files |
numexpr | ≥2.7.1 | I/O of Hadoop Distributed File System (HDF5) files |
tables | ≥3.6.1 | I/O of Hadoop Distributed File System (HDF5) files |
pip install 'pandas>=1.0.0'
pip install 'pyarrow>=0.17.1'
pip install 'fastparquet>=0.4.0'
pip install 'xlrd>=1.2.0'
pip install 'pyreadstat>=1.0.0'
pip install 'h5py>=2.10.0'
pip install 'Cython>=0.21'
pip install 'numexpr>=2.7.1'
pip install 'tables>=3.6.1'
Note: The pip binary must be on your path and should be the pip used by the Python installation that Stata references.
Extension | File Type |
.dta | Stata |
.xls(x) | MS Excel |
.csv | Comma Delimited |
.pkl/.pickle | Pickle |
.sas7bdat/.xport | SAS |
.tab/.tsv | Tab Delimited |
.txt/.dat | Fixed-Width |
.json | JSON |
.html | HTML |
.feather | Apache Arrow/Parquet |
.parquet | Parquet |
.h5 | HDF5 |
.sav | SPSS |
This is a very early version of the program and work is already underway to refactor a non-trivial amount of the codebase.
net inst readit, replace ///
from("https://wbuchanan.github.io/readit")
/*
Note: When specifying the file(s), I strongly recommend
using double quotes to enclose the full string and using
individual apostrophe/single quote characters to delimit
each individual file/path within.
Clear data from memory then read and append all SPSS files
in a directory
*/
readit "'test/*.sav'", c
/*
Read all the files starting with auto in the directory &
then rename the gear_ratio variable after all the files
are appended.
*/
readit "'test/auto*'", ren('gear_ratio': 'gratio') c
/*
Read all of the Excel, SPSS, and Feather files in the directory,
rename multiple columns, and pass arguments to the pandas API
for CSV and SPSS files.
*/
readit "'test/*.xlsx', 'test/*.feather', 'test/*.sav'", ///
ren('src2' : 'backupsrc', 'gear_ratio' : 'gearRatio', 'datatype': 'filetype') ///
"sep = ',', na_values = '', convert_categoricals = False"
Python which
will determine if there is a package of the same name available, but does not return the __version__
string.Adding additional parsing capabilities to provide file specific flexibility when used interactively.
Depending on Zhao Xu's talk on the 30th developing unit tests for the program.
Distribution of the Python code via PyPI/conda.
Using a JSON schema to define/validate a configuration file that can be passed as a single argument.
Concurrent (i.e., parallel) processing of file reads.
{
"renameAll": {"var1": "timestamp", "var2": "firmid", "var3": "stockprice"},
"globalFileType": "",
"globalOptions": {},
"files": [
{
"filenm": "/Users/user/Desktop/myMessyData/file1.sas7bdat",
"ext" : ".sas7bdat",
"filetype": "sas",
"rename": {},
"options": {}
},
{
"filenm": "/Users/user/Desktop/myMessyData/file2.feather",
"ext" : ".feather",
"filetype": "feather",
"rename": { "xyz": "var3", "abc": "var1", "lmno": "var2"},
"options": {}
},
{
"filenm": "/Users/user/Desktop/myMessyData/file3.csv",
"ext" : ".csv",
"filetype": "csv",
"rename": {},
"options": {
"sep": ",;",
"names": ["var1", "var2", "var3"],
"parse_dates": true
}
}
]
}