Beyond n-grams, tf-idf, and word indicators for text

If you want to follow along

There are scripts and instructions available here:

https://github.com/wbuchanan/stataConference2021

Some of the installation can take a bit of time, so you may want to start downloading/installing now.

Beyond n-grams, tf-idf, and word indicators for text:

Leveraging the Python API for vector embeddings

Billy Buchanan
Senior Research Scientist
SAG Corporation

I'm going to move a bit faster when introducing the concepts but will try to slow things down a bit once I get to the code snippets in case anyone is interested in following along. If you have any questions feel free to put them into the chat/Q&A feature.

This talk will share strategies that Stata users can use to get more informative word, sentence, and document vector embeddings of text in their data. While indicator and bag-of-words strategies can be useful for some types of text analytics, they lack the richness of the semantic relationships between words that provide meaning and structure to language. Vector space embeddings attempt to preserve these relationships and in doing so can provide more robust numerical representations of text data that can be used for subsequent analysis. I will share strategies for using existing tools from the Python ecosystem with Stata to leverage the advances in NLP in your Stata workflow.

Motivation

Bag of Words (BoW) models are not always capable of modeling the meaning in natural language.
BoW, TF-IDF, and N-grams typically result in highly sparse matrices with large dimensions.
Because word order can affect semantics these methods can introduce substantial error into your models.

Bag of Words Example of Meaning Varying by Word Order¹
ID	Sentence	he	his	her	loved	only	that	told	wife
		Bag of Words Vector
1	Only he told his wife that he loved her.	2	1	1	1	1	1	1	1
2	He only told his wife that he loved her.	2	1	1	1	1	1	1	1
3	He told only his wife that he loved her.	2	1	1	1	1	1	1	1
4	He told his only wife that he loved her.	2	1	1	1	1	1	1	1
5	He told his wife only that he loved her.	2	1	1	1	1	1	1	1
6	He told his wife that only he loved her.	2	1	1	1	1	1	1	1
7	He told his wife that he only loved her.	2	1	1	1	1	1	1	1
8	He told his wife that he loved only her.	2	1	1	1	1	1	1	1
9	He told his wife that he loved her only.	2	1	1	1	1	1	1	1

Do these sentences all mean the same thing? How would a model built on the bag of words vectors distinguish between the meanings?

N-Gram Example of Meaning Varying by Word Order¹
Sentence ID	only he	he told	told his	his wife	wife that	that he	he loved	loved her	he only	only told	told only	only his	his only	only wife	wife only	only that	that only	only he	only loved	loved only	only her	her only
	N-Gram Vector
1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0
3	0	1	0	1	1	1	1	1	0	0	1	1	0	0	0	0	0	0	0	0	0	0
4	0	1	1	0	1	1	1	1	0	0	0	0	1	1	0	0	0	0	0	0	0	0
5	0	1	1	1	0	1	1	1	0	0	0	0	0	0	1	1	0	0	0	0	0	0
6	0	1	1	1	1	0	1	1	0	0	0	0	0	0	0	0	1	1	0	0	0	0
7	0	1	1	1	1	1	0	1	1	0	0	0	0	0	0	0	0	0	1	0	0	0
8	0	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0
9	0	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1

To accurately model the meaning of the sentences, how much sparser would the matrix need to get? How many different n-grams would need to be used to capture that information?

How do the simpler methods work when trying to measure some form of sentiment?

BoW Example for Sentiment
Sentence ID	Sentence	I	apples	are	as	bad	be	did	expect	expected	half	not	of	the	this	to	were	would
		BoW Vector
1	I did not expect the apples to be this bad.	1	1	0	0	1	1	1	1	0	0	1	0	1	1	1	0	0
2	This half of the apples are bad.	0	1	1	0	1	0	0	0	0	1	0	1	1	1	0	0	0
3	The apples were not half bad.	0	1	0	0	1	0	0	0	0	1	1	0	1	0	0	1	0
4	Half of the apples were not bad.	0	1	0	0	1	0	0	0	0	1	1	1	1	0	0	1	0
5	The apples were not half as bad as I expected.	1	1	0	2	1	0	0	0	1	1	1	0	1	0	0	1	0
6	I expected the apples would not be half bad.	1	1	0	0	1	1	0	0	1	1	1	0	1	0	0	0	1
7	The apples were not bad.	0	1	0	0	1	0	0	0	0	0	1	0	1	0	0	1	0

Cosine Distances Between Sentiment Examples
	Distance to Other Sentence
Source Sentence	1	2	3	4	5	6	7
I did not expect the apples to be this bad.	0	0.48	0.52	0.48	0.46	0.63	0.57
This half of the apples are bad.	0.48	0	0.62	0.71	0.44	0.50	0.51
The apples were not half bad.	0.52	0.62	0	0.93	0.71	0.68	0.91
Half of the apples were not bad.	0.48	0.71	0.93	0	0.65	0.63	0.85
The apples were not half as bad as I expected.	0.46	0.44	0.71	0.65	0	0.67	0.65
I expected the apples would not be half bad.	0.63	0.50	0.68	0.63	0.67	0	0.60
The apples were not bad.	0.57	0.51	0.91	0.85	0.65	0.60	0

Do these distances accurately reflect how similar you would judge the sentiment contained in the sentences?

How do vector embeddings solve these issues?

Reduces sparsity of the data matrix.
Uses a fixed number of dimensions to represent word meanings and context simultaneously.
Vector embeddings can be aggregated to generate embeddings for hierarchical units of language.
Can provide information based on the character/sub-word level that is informative.

Vector embeddings are not the panacea to your NLP related problems

Limitations/Disadvantages

Interpretability
Reproducibility^*
Domain Specificity/Generalizability
Computational Time^*

Unlike indicators for individual tokens, there isn't an easy way to interpret the dimensions of a word embedding.
While interpretability of the individual dimensions isn't an issue in the context of predictive modeling, embeddings may not be useful if the interest is in estimating parameters related to individual words.
Depending on the model and package being used, it may not be possible/easy to reproduce the embeddings exactly.
This is due to the use of a randomized starting vector and/or tuning the model to your data via tuning/training.
Any modern language model will necessarily have some degree of domain specificity inherent to it. This means that while one pre-trained model may be amazing for one task, it may behave like Donald Trump not getting his way with your data and throw a huge temper tantrum.
However, there are new language models being released and shared all the time which may be close enough to your use case to be useful.
It can definitely take longer at times to get word embeddings and push them back into Stata compared with creating Bag of Words representations.
If you are tuning a pre-trained model to your data, the computational overhead can definitely increase significantly.
In that case, I would strongly recommend using a system that has one or more GPUs available so you can get the benefit of the GPUs while tuning/training the model generating the embeddings.

Getting Started

Python Packages for Vector Embeddings
Package Name	CUDA	pip	conda
spaCy^*†	Y	Y	Y
transformers^*†	Y	Y	Y
gensim^*	N	Y	Y
GloVe	N	Y	Y
fastText	N	Y	Y
TextBlob	N	Y	Y
NLTK^‡	N	Y	Y
simplerepresentations	N/A	Y	N

^* These packages provide access to several pre-trained models used to generate vector embeddings. ^† These packages will be used for subsequent examples. ^‡ While the Natural Language ToolKit (NLTK) doesn't provide word embeddings, it has a lot of other useful tools for working with text.

Installing spaCy

Installing Transformers

Get Stata's Python Interpretter Up and Running


				# The examples that I'll talk through will use spaCy, but I've included an example
				# that uses some transformers based models in the Jupyter notebook on GitHub
				import json
				import requests
				import pandas as pd
				from sfi import ValueLabel, Data, SFIToolkit
				import spacy
				import torch
				torch.manual_seed(0)

				# This will load the tokenizers and models using the BERT architecture
				from transformers import BertTokenizer, BertModel

				# This will initialize the tokenizer and download the pretrained model parameters
				tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case = False)

				# We'll also load up the model for spaCy at this time
				nlp = spacy.load('en_core_web_lg')

Get data from source

					
				# List of the URLs containing the data set
				files = [ "https://raw.githubusercontent.com/DenisPeskov/2020_acl_diplomacy/master/data/test.jsonl",
				"https://raw.githubusercontent.com/DenisPeskov/2020_acl_diplomacy/master/data/train.jsonl",
				"https://raw.githubusercontent.com/DenisPeskov/2020_acl_diplomacy/master/data/validation.jsonl" ]

				# Function to handle dropping "variables" that prevent pandas from
				# reading the JSON object
				def normalizer(obs: dict, drop: list) -> pd.DataFrame:
				    # Loop over the "variables" to drop
				    for i in drop:
				        # Remove it from the dictionary object
				        del obs[i]
				    # Returns the Pandas dataframe
				    return pd.DataFrame.from_dict(obs)

				# Object to store each of the data frames
				data = []

				# Loop over each of the files from the URLs above
				for i in files:
				    # Get the raw content from the GitHub location
				    content = requests.get(i).content
				    # Split the JSON objects by new lines, pass each individual line to json.loads,
				    # pass the json.loads value to the normalizer function, and
				    # append the result to the data object defined outside of the loop
				    [ data.append(normalizer(json.loads(i), [ "players", "game_id" ])) for i in content.decode('utf-8').splitlines() ]

Prep Data for Stata

					
				# Define a couple data mappings for later use
				labmap = { True: 1, False: 0, 'NOANNOTATION': -1 }
				cntrys = { 'austria': 0, 'england': 1, 'france': 2, 'germany': 3, 'italy': 4, 'russia': 5, 'turkey': 6 }
				seasons = { 'Fall': 0, 'Winter': 1, 'Spring': 2 }

				# Combine each of the data frames for each game into one large dataset
				dataset = pd.concat(data, axis = 0, join = 'inner', ignore_index = True, sort = False)

				# Recast data to appropriate types
				dataset['game_score'] = dataset['game_score'].astype('int')
				dataset['sender_labels'] = dataset['sender_labels'].astype('int')
				dataset['absolute_message_index'] = dataset['absolute_message_index'].astype('int')
				dataset['relative_message_index'] = dataset['relative_message_index'].astype('int')
				dataset['game_score_delta'] = dataset['game_score_delta'].astype('int')
				dataset['years'] = dataset['years'].astype('int')

				# Recodes text labels to numeric values
				dataset.replace({'receiver_labels': labmap, 'speakers': cntrys, 'receivers': cntrys, 'seasons': seasons}, inplace = True)

				# Creates an indicator for when the receiver correctly identifies the truthfulness of the message
				dataset['correct'] = (dataset['sender_labels'] == dataset['receiver_labels']).astype('int')

				# Get the number of tokens per message using spaCy's tokenizer
				dataset['tokens'] = dataset['messages'].apply(lambda x: len(nlp(x)))

				# This stores the spaCy object in a new variable named token
				dataset['token'] = dataset['messages'].apply(lambda x: nlp(x))

				# Now the data set can be expanded by unique tokens
				dataset = dataset.explode('token')

				# Make sure the token variable is cast as a string
				dataset['token'] = dataset['token'].astype('str')

				# Then add ID's for each token
				dataset['tokenid'] = dataset.groupby('messages').cumcount()

Load Data into Stata and Store Embeddings

					
				# Get the names of the variables
				varnms = dataset.columns

				# Sets the number of observations based on the messages column
				Data.setObsTotal(len(dataset['messages']))

				# Create the variables in Stata
				for var in varnms:

					# The messages and token variables are both string types
					if var not in [ 'messages', 'token' ]:

						# Adds the numeric variables to the data set
						Data.addVarLong(var)

					# We'll make the string types strLs just to make sure there won't be any storage issues
					else:

						# Adds the strL for the string variables
						Data.addVarStrL(var)

				# Now push the data into Stata
				Data.store(var = None, obs = None, val = dataset.values.tolist())

				# Create mapping of value labels to variables
				vallabmap = { 'sender_labels' : labmap, 'receiver_labels': labmap,
						      'seasons': seasons, 'speakers': cntrys, 'receivers': cntrys }

				# Loop over the dictionary containing the value label mappings
				for varnm, vallabs in vallabmap.items():

					# Create the value label
					ValueLabel.createLabel(varnm)

					# Now iterate over the value label mappings and assign to the appropriate value label
					[ ValueLabel.setLabelValue(varnm, value, str(label)) for label, value in vallabs.items() ]

					# Then assign the value label to the variable
					ValueLabel.setVarValueLabel(varnm, varnm)

				# Now create the variables to store the dimensions of the vector embedding
				[ Data.addVarDouble('wembed' + str(i)) for i in range(1, 301) ]

				# Gets all of the tokens and include a sequence ID in the iteration
				for ob, token in enumerate(dataset['token'].tolist()):

					# Gets the spaCy embedding for this token
					embed = nlp(token)

					# Store the word vector for this word in the variables we just created
					[ Data.storeAt("wembed" + str(dim + 1), ob, embed.vector[dim]) for dim in range(0, len(embed.vector)) ]

Fit a Model and Get Document/Message Embeddings Instead

					
					# You can now fit a model to the data:
					SFIToolkit.stata("logit correct i.speakers i.seasons i.years i.game_score wembed1-wembed300")

					# These results are fairly noisy, so maybe there would be better luck using document vectors
					SFIToolkit.stata("drop token tokenid wembed*")
					SFIToolkit.stata("duplicates drop")

					# Now use the same process used above, but using document vectors
					[ Data.addVarDouble('docembed' + str(i)) for i in range(1, 301) ]

					# Then iterate over the messages (instead of individual tokens)
					for ob, token in enumerate(dataset['messages'].tolist()):

						# Gets the spaCy embedding for the message
						embed = nlp(token)

						# Stores the document/message/sentence embedding for this record
						[ Data.storeAt("docembed" + str(dim + 1), ob, embed.vector[dim]) for dim in range(0, len(embed.vector)) ]

					# This model fits the data a bit better than the previous model and is also noticably faster.
					SFIToolkit.stata("logit correct i.speakers i.seasons i.years i.game_score docembed1-docembed300")

Wrapping Up

Be mindful of compute resource consumption and availability.
The Python API and pystata have different functionality.
Look up information about available models and their training contexts.
You may need to train the model on your data for it to product informative embeddings.

The Python API will provide a bit more flexibility with compute consumption by allowing you to work in something analogous to a streaming interface (e.g., streaming observations).
If you have substantial compute resources available you may be able to do everything in larger batches and can use notebooks effectively there as well.
The models in the transformers library all return embeddings with different dimensions.
Aside from an awareness of variable limits in Stata, you should also think about how the additional dimensions affect computational performance.
More importantly, there are highly context specific models developed and shared openly that can be used to provide a reasonable starting point (e.g., SciBert, etc...) and a lot of work is being done in the medical field with electronic health records.
If you need to fine tune or train the last layer or two of a pre-trained model, it may be better to manage that workflow largely in Python to avoid any additional competition for computing resources.

"It's always good to end with a slogan." - Nicholas J. Cox, North American Stata Users Group Conference 2021