Automating exploratory data analysis tasks with eda


Billy Buchanan
Fayette County Public Schools (Lexington, KY)
Director of Grants, Research, Accountability, & Data

Slides available at https://bit.ly/2O2hiLd

What is eda?



  • A tool to do more exploratory data analysis with less effort.
  • A combinatorics approach to univariate, bivariate, and some multivariate Exploratory Data Analysis.
  • An example of standing on the shoulders of giants (Cox, 2017).

Why eda?



If you are in the audience, there is a good chance that you are a human.

Your time is valuable.

Spend your time on tasks that humans are uniquely qualified to do.

Try to remember this example for a few moments.

					

						// First load some data
						sysuse auto.dta, clear

						// I am going to assume that you can call pdflatex from your command line
						// If that is correct, then the line below is all you need.
						eda, r(`"`c(pwd)'/edaexamples"') o("minexample") comp

					
				

What is eda?



The Syntax

Notice that there are only three required parameters to use eda.

					
						eda  [varlist] [using] [if] [in] , output(string) root(string) [ idvars(varlist)
						strok[(varlist)] minnsize(integer) mincat(integer) maxcat(integer)
						catvars(varlist) contvars(varlist) authorname(string) reportname(string)
						scheme(string) keepgph grlablength(integer) missing percent nobargraphs
						bargraphopts(string) nopiecharts piechartopts(string) nohistograms
						histogramopts(string) kdensity kdensopts(string) fivenumsum
						fnsopts(string) nodistroplots distroplotopts(string) noladderplots noscatterplots
						lfit[(string)] qfit[(string)] lowess[(string)] fpfit[(string)] lfitci[(string)]
						qfitci[(string)] fpfitci[(string)] noboxplots nomosaic noheatmap nobubbleplots
						weighttype(int) compile pdflatex(string) bygraphs(string) byvars(varlist) byseq ]
					
				

What does eda do?

  • Checks for and optionally installs dependencies.
  • Classifies variables as categorical or continuous.
  • Creates a LaTeX document to store all your results.
  • Generates uni-, bi-, and some multi-variate data viz.
  • Generates LaTeX tables with appropriate summary statistics.
  • eda works with brewscheme.
  • eda assumes you want everything unless you tell it otherwise.

What graphs does eda create?

  • Univariate :
    • symplot, quantile, qnorm, pnorm, & histograms
    • Bar graphs and pie charts for categorical data.
    • Ladder of powers graphs for continuous data.
  • Bivariate : scatterplots, box plots, & mosaic plots (i.e., spineplots).
  • Multivariate : bubbleplots & correlation heatmaps.

What else can edaeda do?

  • Allows you to turn off types of graphs with no   plots options.
  • Creates tables of summary statistics for you automatically.
  • Generates a batch (Windows) or shell (*nix) script to compile the LaTeX and clean up ancillary files.

Using eda with brewscheme

Future Directions

Known Issues/Limitations

  • eda is not complete, but currently works.
  • bygraphs tend to error out instead of returning an empty graph.
  • bygraphs not implemented on all possible graphs yet.
  • Still need to add support for alpha layer transparency to brewscheme.

Improvements

  • Refactor more of the code into Mata.
  • Need to find a faster way to generate all of the different graphs.
  • Markdown/dyndoc support.