Elkstackinstaller

View the Project on GitHub wbuchanan/elkStackInstaller

ELK Stack in Education Guide

Why another technology tool?

Is it more valuable to know whether/if your users login to your system or to know how they are using the system once logged into it? We believe the later is most valuable. We also realize how difficult it can be to manage a data stream that can grow exponentially in all dimensions (e.g., observations and variables), but we wanted to find tools that would make it easier for other school districts and/or State Education Agencies to under take similar work.

Simplifying the workflow

Although system administrators have used various log management tools to make it easier to search through many bits of system log data, these tools are less likely to be familiar to data analysts, research scientists, and data strategists. An emerging technology stack that can make it easier to streamline this workflow is quickly coming into maturity, called the ELK stack.

Elasticsearch . Elasticsearch is the core of this technology stack and is built on the Apache Lucene engine. Like Lucene, Elasticsearch offers end users high performance text searching capabilities, and also functions as a Not Only SQL database system. However, rather than requiring specialized software drivers to connect to the data system, a user simply needs to know the URL and the manner in which Elasticsearch formats queries to retrieve and store data using standard tools like Curl and HTTP. Additionally, several community and developer contributed plugins have already added functionality to consume data from traditional (Object) Relational Data Base Management Systems (RDBMS) using Java Data Base Connectivity (JDBC), consume delimited flat files, provide a more traditional SQL interface to query the system, and provide connectivity to other popular front-ends for systems management.

Logstash. Logstash provides a middle layer designed to consume, parse, and store streams and/or static files in several different formats. It also provides an access point in the analytic pipeline where simple metrics can be calculated in near real time from log data being read into the system. More importantly, however, is that this single service layer also provides capabilities to route output (e.g., can send analytic tool log data to one elasticsearch instance and send web log file data to a different instance). Given the variety of codecs (e.g., plugins used to parse the data) and ease with which additional and/or custom parsers can be implemented, this tool can serve several roles related to the analysis of system data.

Kibana. Kibana is the final tier of the technology stack and serves the role of a business intelligence/query tool. Although it isn’t truly real time, Kibana can visualize data streamed into elasticsearch from logstash with just a five-second delay. All of the graphics are - by default - interactive in terms of drill-down capabilities and interactive tool-tips. However, while the tool is well suited for exploratory data analyses, the analytical capabilities are fairly limited beyond that without doing extensive low-level programming.

Getting Started with the Technology.

To help facilitate your school/district getting up and running with this technology stack, we ha’ve created an installation script for *nix-based operating systems (e.g., OSX, Linux, etc…). However, the technology stack will still require some configuration before your organization can deploy the solution. In particular, your IT staff will need to configure the IP addresses for Elasticsearch and will need to create a Logstash configuration file to allow logstash to collect, clean, and parse all of the system log files you would want to analyze. We have also provided some examples of what these log files look like to give your staff additional resources/templates that will help reduce their time and efforts.

Use Cases

Before starting, we thought it might be helpful to frame our work in the context of typical use cases that we believe would be valuable to the greater education community.

What types of questions are best suited to technologies like the ELK stack?

Generally, this solution will provide you with immediate and actionable data centered around access and usage of your technology systems. For example, your IT department can use these data to better plan load balancing across servers to ensure that all of the users in your district get a consistent and high quality experience from your systems without jeopardizing the performance of other mission critical systems. Instructional coaching staff can use these data to confirm whether educators within professional learning communities are viewing reports as a group (e.g., do the educators in a particular group view the same reports during PLC/planning times)? Community outreach/communications staff can use the data to better understand parent use of student data tools.

Additionally, the text search/analysis capabilities make this an excellent tool for mining the content being accessed to better understand what is being accessed. For example, do principals/educators in high performing schools view/access the same reports as principals/educators in lower performing schools? Your analytic and IT staff can also use these data to align work efforts more strategically with the needs/wants of your community of users. For example, if parents typically view only two reports of the seven available, it allows these staff to determine whether to continue investing efforts to develop and maintain the five under accessed reports, define short research studies to solicit parent feedback about the under vs typically accessed reports, and guide future development efforts centered on responding to user requests based on use/access metrics.

Why are these questions important?

At the highest levels of your organization, these questions and the corresponding data provide you with the necessary and relevant insight to budget, plan, and invest organizational resources strategically. Because technological solutions have myriad cost structures (e.g., perpetual vs. annual licensing, support contracts, local labor/training costs, etc…), a clearer understanding of how users in your organization utilize these tools provides you with insight to guide your investment strategies. Additionally, these data can also be used to evaluate how access to data affects student outcomes in your district as well as guide professional learning/development opportunities to support and increase systemic data literacy throughout the organization.

How can investing effort in this area help to improve outcomes for children in your district?

Understanding and valuing the use of data is only the first step towards data-driven decision-making and building a culture of data use. ToIn order to improve both the service delivery to students and the outcomes of the students, our discussions need to shift beyond valuing the use of data to a strategic dissection of what data best supports student learning in your organization and how your educators and leaders are using those data to drive positive change for children.

Classification and Understanding User Groups

There are several “unsupervised” machine learning, classification, and clustering algorithms that are available for this type of work. For our work, we interpreted the behaviors of the users (e.g., which reports were viewed, when they were viewed, etc…) as being indicators of an underlying taxonomy of user types or classes that could not be directly measured. In other words, the user types that could not be directly measured could be inferred from the activity of each of the users in a way that is analogous to student test performance being classified into proficient/non-proficient. To do this, we cleaned/prepared the data provided by the assessment vendor and used Mplus 7.31 to fit multilevel latent class analysis (LCA) models to account for differences in variance of activity within and between users.

How do you use the results to provide better support and services to your staff and students?

We believe these type of analyses can provide insight into decision-making about data infrastructure (e.g., can your organization’s hardware support the platforms sufficiently), training (e.g., are staff members that should be accessing the system using it), planning (e.g., create professional learning communities with a mixture of data savvy staff members), understanding how different types of data use are related to student outcomes, and position the organization to be more proactive in providing support and resources to users that may need additional help to make full use of the data platform. Additionally, the tools and techniques discussed here can be applied to several different scenarios (e.g., analyzing student interactions with learning technologies, analyzing student interaction with assessment platforms, etc…).

System Requirements

The installation script has been tested on Mac OSX and Ubuntu 14.*. The script should work on any system with a Bash shell interpreter and the Curl library (the tar command is also used, but is typically available from a Bash terminal).

Instructions

You can copy/paste the script install.sh from here, or you can use:

git clone https://github.com/wbuchanan/elkStackInstaller
chmod +x elkStackInstaller/install.sh && sudo elkStackInstaller/install.sh

You can find the most up to date information about the ELK stack at : http://www.elastic.co.

To clone the repository (this should make it easier if there are future updates for you to keep things in sync on your system while having some form of version control available)