Intro to Item Response Theory using Open Source Solutions

Billy Buchanan, Ph.D.
Director of Data, Research, and Accountability
Fayette County Public Schools

https://wbuchanan.github.io/kaacSlideDeck

What is Item Response Theory?
What is jMetrik?
Why You Need to Care
How do you do it?

What is Item Response Theory?

And now for a bit of math...

$$Pr(Y_{ij} = 1 | a_i, b_i, c_i, d_i, \theta_j) = c_i + (d_i - c_i)\frac{exp^{(\alpha(\theta_j-\beta_i) ) }}{1 + exp^{(\alpha(\theta_j-\beta_i) ) }}$$

$\alpha$ is the "discrimination" parameter
$\beta$ is the "difficulty" parameter
$c$ is the "pseudoguessing" parameter
$d$ is the upper asymptote or highest probability of a correct response
$\theta$ is the "ability" parameter

$\alpha$ quantifies how well the item does at identifying low vs high ability
$\beta$ quantifies the amount of ability needed to have a 50% chance of responding correctly purely by chance
$c$ quantifies the minimum probability of a correct response
$\theta$ quantifies skill/ability...this gets transformed into a "scaled score"
If we constrain $\alpha$ to 1, $c$ to 0, and $d$ to 1 we would be fitting a Rasch model to the data
If we constrain $\alpha$ to be equal across items, $c$ to 0, and $d$ to 1 we would be fitting a 1 Parameter Logistic (1PL) model to the data
If we constrain $c$ to 0 and $d$ to 1 we would be fitting a 2 Parameter Logistic Model (2PL) to the data
If we constrain $d$ to 1 and freely estimate all other parameters we would be fitting a 3 Parameter Logistic Model (3PL) to the data
If we freely estimate all parameters we would be fitting a 4 Parameter Logistic Model (4PL) to the data

Do to time contraints, we'll only talk about one of the ways to estimate 1PL models.
More specifically, we'll be talking about fitting a Rasch model using the Joint Maximum Likelihood Estimator (JMLE) to the data.
However, for those interested, if you view the slides and click your down arrow, there are some brief explanations of other IRT models that are appropriate for other contexts.

Partial Credit Models (PCM)

$$Pr(Y_{ij} = k | \theta_j) = \frac{exp^{ (\Sigma_{t=1}^k\alpha(\theta_j-\beta_{it}) ) }}{1 + \Sigma_{s = 1}^K exp^{ (\Sigma_{s=1}^s\alpha(\theta_j-\beta_{it}) ) } } $$

The $\alpha$ & $\theta$ parameters have the same meeting as the had from the other models.
The $\beta$ parameter is the difficulty associated with the $t^{th}$ response option on the $i^{th}$ item
The difference is that here we are predicting the probability of the respondent selecting the $k^{th}$ option from the response set if they have an ability of $\theta_j$
$\theta$ is also assumed to be $N(0, 1)$

Rating Scale Models (RSM)

$$Pr(Y_{ij} = k | \alpha, \beta_i, \theta_j) = \frac{exp^{ (\Sigma_{t=1}^k\alpha(\theta_j-\beta_{it}) ) }}{1 + \Sigma_{s = 1}^K exp^{ (\Sigma_{s=1}^s\alpha(\theta_j-\beta_{it}) ) } } $$

There are some subtle but important differences between the Rating Scale and Partial Credit Models
The distance between the category difficulties are also constrained to be equal (e.g., the difference between scoring a 3 or 4 is the same as the difference in scoring a 2 or 3)

Graded Response Models (GRM)

$$Pr(Y_{ij} \geq k | \theta_j) = \frac{exp^{ (\Sigma_{t=1}^k\alpha_i(\theta_j-\beta_{ik}) ) }}{1 + \Sigma_{s = 1}^K exp^{ (\Sigma_{s=1}^s\alpha_i(\theta_j-\beta_{ik}) ) } } $$

Here the interpretation of the $\beta$ parameter changes to indicate the difficulty of endorsing category $k$ or higher for the $i^{th}$ item
Additionally, unlike the PCM, the item discriminations (i.e., the $\alpha$) parameters are freely estimated

Nominal Response Models (NRM)

$$Pr(Y_{ij} = k | \theta_j) = \frac{exp^{ (\alpha_{ik}(\theta_j-\beta_{ik}) ) }}{\Sigma_{h = 1}^K exp^{ (\alpha_{ih}(\theta_j-\beta_{ih}) ) } } $$

You can think of this as the categorical analog to the GRM.
These models would be used in cases where the response choice has no inherent value that could be ordered (e.g., what is your favorite ice cream flavor?)

What is jMetrik?

Get jMetrik Here

Java-based application for Psychometric analysis of data
Freely available (e.g., does not cost you anything to use it and you can even modify it if you so desire)
Some functionality has been integrated into other software platforms like Stata (the raschjmle program)
The source code for all of the math and user interface is publicly available:

Example Data and Source Code

All the data used for these examples, and the source code used to simulate it is publicly available.

To get the files, go to https://github.com/wbuchanan/kaacSlideDeck/tree/gh-pages

The file itemResponses.csv contains the .CSV file used for the examples.

The file simulateItemResponses.R contains the source code used to generate the simulated data in the .CSV above.

Why You Need to Care

HR Issues

Griggs v Duke Power Co (1971) restricts the use of assessments for hiring to only those which are "demonstrably a reasonable measure of job performance".
Watson v Fort Worth Bank & Trust (1988) sets a clear need to have statistical evidence for use of assessment in hiring/firing decisions

Griggs v Duke Power

Nothing in the Act precludes the use of testing or measuring procedures; obviously they are useful. What Congress has forbidden is giving these devices and mechanisms controlling force unless they are demonstrably a reasonable measure of job performance
What Congress has commanded is that any tests used must measure the person for the job and not the person in the abstract.

Watson v Fort Worth

Since neither the District Court nor the Court of Appeals has evaluated the statistical evidence to determine whether petitioner made out a prima facie case of discrimination under disparate impact theory, the case must be remanded.

If you are using locally developed assessments as part of your evaluation system, you need to do your due dilligence to show that it is a sound measure for inferences about educator quality/effectiveness

The Real Reason

Bad Measurement = Bad Decisions = Bad Outcomes
We should enable the adults working with children to make the best decisions based on the best possible data.
Bad Measurement + Correct Decision = Bad Decisions = Bad Outcomes
The quality of your measurement is empirical, not opinion or feeling.
Just because you think you are measuring something doesn't mean you are measuring it.

Starting jMetrik

Initial view when starting the application

Menu Item used to launch creation of new Database

Shows dialog used to name the new database

Shows dialog where users select which database to open

Menu item used to launch the file import dialog

Shows the dialog used to select the file to import

Shows the change in the GUI after a file is loaded

Shows a preview of the data loaded into jMetrik

Shows the variable view option to view the data

Setting up Answer Keys

Shows where to click to launch the advanced scoring dialog

There is one set of three slides for each of the response options, I'm only going to discuss the first, but you can see the others in the slide deck
Items that are not reached are typically not used when constructing the score for a student. If we can't observe it, we have no way of knowing whether or not the student truly knew the answer
Scoring the item incorrectly will bias the score of the student downwards
Items with an omitted response are likely skipped either because the item is too difficult or the student may not know the answer and wants to move on to items they may have more success answering
Binary items are Correct/Incorrect
Polytomous items are those that would come from using a rubric or used in surveys

Setting up answer key for items with keyed response option a

Shows dialog after clicking the submit button

Shows variable view after refreshing the view to confirm columns are registered as item types

Setting up answer key for items with keyed response option b

Setting up answer key for items with keyed response option c

Setting up answer key for items with keyed response option d

Setting up answer key for items with keyed response option e

Item Frequencies

Shows menu option to click to launch item frequency dialog

Distractor Analysis

Distractor analysis can give you a better understanding about the response set and what potential misconceptions persist among the studets
This is particularly valuable when building your item bank since it will give you information that you can use to select better distractors
The difficulties here represent the difficulty parameters if each of the options were the correct response.
If no one selected the response option, or if there are too few cases that selected that option, there will not be sufficient information to estimate the parameters and you will get the NaN value
If there are large overall discrimination values that are negative you may have a reversed item (e.g., a case where students with higher levels of mastery, skill, or ability are less likely to respond correctly compared to students with lower levels of mastery, skill, or ability).

Shows additional recommended options to select for distractor analyses

Shows button to click on to save results

Shows dialog to enter table name where results will be saved

Shows button to click to execute distractor analyses

Shows annotated output for distractor analyses

The Rasch Model

Warning to click back on the item responses before moving forward

Rasch models constrain discrimination to be equal to a value of 1
Only the difficulty parameter is freely estimated
This can only be used with unidimensional measures (e.g., one subject/topic/content area)
When the item parameters are unknown a priori, use the Joint Maximum Likelihood Estimator (JMLE)
The JMLE allows you to calibrate the items and estimate the person parameters simultaneously
If you know the item parameter values a priori you can use the Marginal Maximum Likelihood (MML) Estimator for the item calibrations or person parameter estimates
When you don't have many observations, the Rasch models require the least amount of data to generate stable estimates of the parameters and may be less succeptable to parameter drift

Shows where to click in the menu to launch the JMLE Dialog Box for Rasch Model

Verify that all items you want included in the analysis are located in the box on the right

Shows some optional configuration settings on the global tab of the dialog box

Shows suggested options to use on the item tab

Shows the default view for the person tab

Shows suggested options to use on the person tab

Shows where to click to fit the Rasch model to the data

Shows annotated output from the start of the text based output that appears after fitting the model to the data

Continuation of previous slide showing annotated text-based output after fitting the Rasch model to the data

Shows annotations for the table where the results are stored related to the item parameter estimates

IRT Plots

Annotated set up and menu location to generate Item/Test characteristic curves

Annotation showing what it will look like when items are selected

Shows suggested options to include when creating item characteristic curves

Shows buttons to click to select location where results will be saved and to execute the ICC graph generation

Annotation explaining beta parameter location and meaning

Annotation explaining item information function

The item information function shows the level of skill, mastery, or ability at which a correct response gives us the most information about the test taker
Items that are too easy or too difficult for the student give us little to no information about the student's underlying skill, mastery, or ability
But there is a sweet spot where we can learn a good amount about the students.
When constructing a test the importance of the information function cannot be understated and should align with the purpose of the assessment
You can measure the same concept with items of varying difficulty, so if you can match the difficulty of the items with the skill, mastery, or ability of the test takers you can glean more information from the results
When you create the test/form try to select items that provide a good amount of information across the range of theta that the test takers are assumed to have
For diagnostic assessments, you'll typically want to focus on the left tail of theta to get items that are more appropriate for struggling students

Warning about reversed ICC...with reversed items exclude them from the test form

Annotation explaining test characteristic curve, test information function, and the standard error of measurement.

Annotation explaining the pseudoguessing parameter

Annotation explaining discrimination parameter

The discrimination parameter does not mean human discrimination
You could think of it like how well Kentuckians can detect good from bad basketball players (e.g., discriminating taste in basketball players), particularly those who view basketball through blue and white lenses
This is basically about the range over which the item does a good job at identifying high and low skill, mastery, and ability
The slope should always be positive when using an item in a test form
For a Rasch model, this slope will always be equal to 1
For other types of 1PL models, the slope will be equal to the average item discrimination
In 2PL, 3PL, and 4PL models, the slope can and will vary by item

Item Characteristic Curve for example item number 7

Item Characteristic Curve for example item number 8

Item Characteristic Curve for example item number 9

Item Characteristic Curve for example item number 10

Item Characteristic Curve for example item number 11

Item Characteristic Curve for example item number 12

Item Characteristic Curve for example item number 13

Item Characteristic Curve for example item number 14

Item Characteristic Curve for example item number 15

Item Characteristic Curve for example item number 16

Item Characteristic Curve for example item number 17

Item Characteristic Curve for example item number 18

Item Characteristic Curve for example item number 19

Item Characteristic Curve for example item number 20

Differential Item Functioning

Shows Menu item where DIF analysis can be found

Differential Item Functioning is a way to evaluate measurement invariance at the item level
If a reading passage tells a story about Rugby, we would expect students from New Zealand with the same ability as American students would have a greater probability of answering correctly. This is what DIF is.
DIF does not mean bias. Bias occurs when a group of students has a lower chance of answering correctly regardless of their ability.
DIF analysis is a way of flagging items for additional review.
There is also differential test functioning, but that is closer to traditional tests of measurement invariance from the structural equation modeling literature
What we want at the end of the day is for any differences in the latent means (e.g., theta, skill, ability, mastery, etc...) to be true differences and not be driven by differences in the probabilities of correct responses across groups with the same/similar levels of ability.

Shows dialog to name table where results will be saved

Shows button to click to execute DIF analysis

Annotated DIF output with advice on handling items with higher levels of DIF

Differential Item Function Results table annotated

Oh...and by the way...

FCPS is Hiring Data Strategists

Intro to Item Response Theory using Open Source Solutions

What is Item Response Theory?

And now for a bit of math...

Partial Credit Models (PCM)

Rating Scale Models (RSM)

Graded Response Models (GRM)

Nominal Response Models (NRM)

What is jMetrik?

Get jMetrik Here

Example Data and Source Code

Why You Need to Care

HR Issues

The Real Reason

Starting jMetrik

Setting up Answer Keys

Item Frequencies

Distractor Analysis

The Rasch Model

IRT Plots

Differential Item Functioning

FCPS is Hiring Data Strategists

If you or someone you know is a data ninja, code wizard, or other type of quant that wants to enjoy life out in Wildcat Country tell them to
email me

email me

Intro to Item Response Theory using Open Source Solutions

What is Item Response Theory?

And now for a bit of math...

Partial Credit Models (PCM)

Rating Scale Models (RSM)

Graded Response Models (GRM)

Nominal Response Models (NRM)

What is jMetrik?

Get jMetrik Here

Example Data and Source Code

Why You Need to Care

HR Issues

The Real Reason

Starting jMetrik

Setting up Answer Keys

Item Frequencies

Distractor Analysis

The Rasch Model

IRT Plots

Differential Item Functioning

FCPS is Hiring Data Strategists

If you or someone you know is a data ninja, code wizard, or other type of quant that wants to enjoy life out in Wildcat Country tell them to email me

email me

If you or someone you know is a data ninja, code wizard, or other type of quant that wants to enjoy life out in Wildcat Country tell them to
email me