If you want to follow along

There are scripts and instructions available here:


https://github.com/wbuchanan/stataConference2021


Some of the installation can take a bit of time, so you may want to start downloading/installing now.

Beyond n-grams, tf-idf, and word indicators for text:

Leveraging the Python API for vector embeddings


Billy Buchanan
Senior Research Scientist
SAG Corporation

Motivation

  • Bag of Words (BoW) models are not always capable of modeling the meaning in natural language.
  • BoW, TF-IDF, and N-grams typically result in highly sparse matrices with large dimensions.
  • Because word order can affect semantics these methods can introduce substantial error into your models.
Bag of Words Example of Meaning Varying by Word Order1
Bag of Words Vector
IDSentencehehisher lovedonlythattoldwife
1Only he told his wife that he loved her. 21111111
2He only told his wife that he loved her. 21111111
3He told only his wife that he loved her. 21111111
4He told his only wife that he loved her. 21111111
5He told his wife only that he loved her. 21111111
6He told his wife that only he loved her. 21111111
7He told his wife that he only loved her. 21111111
8He told his wife that he loved only her. 21111111
9He told his wife that he loved her only. 21111111

Do these sentences all mean the same thing? How would a model built on the bag of words vectors distinguish between the meanings?
N-Gram Example of Meaning Varying by Word Order1
N-Gram Vector
Sentence ID only hehe toldtold hishis wifewife thatthat he he lovedloved herhe onlyonly toldtold onlyonly his his onlyonly wifewife onlyonly thatthat onlyonly he only lovedloved onlyonly herher only
1 111111 110000 000000 0000
2 001111 111100 000000 0000
3 010111 110011 000000 0000
4 011011 110000 110000 0000
5 011101 110000 001100 0000
6 011110 110000 000011 0000
7 011111 011000 000000 1000
8 011111 100000 000000 0110
9 011111 110000 000000 0001

To accurately model the meaning of the sentences, how much sparser would the matrix need to get? How many different n-grams would need to be used to capture that information?

How do the simpler methods work when trying to measure some form of sentiment?

BoW Example for Sentiment
BoW Vector
Sentence IDSentence Iapplesareasbad bedidexpectexpectedhalf notofthethisto werewould
1I did not expect the apples to be this bad. 11001 11100 10111 00
2This half of the apples are bad. 01101 00001 01110 00
3The apples were not half bad. 01001 00001 10100 10
4Half of the apples were not bad. 01001 00001 11100 10
5The apples were not half as bad as I expected. 11021 00011 10100 10
6I expected the apples would not be half bad. 11001 10011 10100 01
7The apples were not bad. 01001 00000 10100 10

Cosine Distances Between Sentiment Examples
Distance to Other Sentence
Source Sentence1234567
I did not expect the apples to be this bad.00.480.520.480.460.630.57
This half of the apples are bad.0.4800.620.710.440.500.51
The apples were not half bad.0.520.6200.930.710.680.91
Half of the apples were not bad.0.480.710.9300.650.630.85
The apples were not half as bad as I expected.0.460.440.710.6500.670.65
I expected the apples would not be half bad.0.630.500.680.630.6700.60
The apples were not bad.0.570.510.910.850.650.600

Do these distances accurately reflect how similar you would judge the sentiment contained in the sentences?

How do vector embeddings solve these issues?

  • Reduces sparsity of the data matrix.
  • Uses a fixed number of dimensions to represent word meanings and context simultaneously.
  • Vector embeddings can be aggregated to generate embeddings for hierarchical units of language.
  • Can provide information based on the character/sub-word level that is informative.

Vector embeddings are not the panacea to your NLP related problems

Limitations/Disadvantages

  • Interpretability
  • Reproducibility*
  • Domain Specificity/Generalizability
  • Computational Time*

Getting Started

Python Packages for Vector Embeddings
Package NameCUDApipconda
spaCy*†YYY
transformers*†YYY
gensim*NYY
GloVeNYY
fastTextNYY
TextBlobNYY
NLTKNYY
simplerepresentationsN/AYN
* These packages provide access to several pre-trained models used to generate vector embeddings. These packages will be used for subsequent examples. While the Natural Language ToolKit (NLTK) doesn't provide word embeddings, it has a lot of other useful tools for working with text.
Installing spaCy
Installing Transformers
					

Get Stata's Python Interpretter Up and Running


				# The examples that I'll talk through will use spaCy, but I've included an example
				# that uses some transformers based models in the Jupyter notebook on GitHub
				import json
				import requests
				import pandas as pd
				from sfi import ValueLabel, Data, SFIToolkit
				import spacy
				import torch
				torch.manual_seed(0)

				# This will load the tokenizers and models using the BERT architecture
				from transformers import BertTokenizer, BertModel

				# This will initialize the tokenizer and download the pretrained model parameters
				tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case = False)

				# We'll also load up the model for spaCy at this time
				nlp = spacy.load('en_core_web_lg')
				

Get data from source

					
				# List of the URLs containing the data set
				files = [ "https://raw.githubusercontent.com/DenisPeskov/2020_acl_diplomacy/master/data/test.jsonl",
				"https://raw.githubusercontent.com/DenisPeskov/2020_acl_diplomacy/master/data/train.jsonl",
				"https://raw.githubusercontent.com/DenisPeskov/2020_acl_diplomacy/master/data/validation.jsonl" ]

				# Function to handle dropping "variables" that prevent pandas from
				# reading the JSON object
				def normalizer(obs: dict, drop: list) -> pd.DataFrame:
				    # Loop over the "variables" to drop
				    for i in drop:
				        # Remove it from the dictionary object
				        del obs[i]
				    # Returns the Pandas dataframe
				    return pd.DataFrame.from_dict(obs)

				# Object to store each of the data frames
				data = []

				# Loop over each of the files from the URLs above
				for i in files:
				    # Get the raw content from the GitHub location
				    content = requests.get(i).content
				    # Split the JSON objects by new lines, pass each individual line to json.loads,
				    # pass the json.loads value to the normalizer function, and
				    # append the result to the data object defined outside of the loop
				    [ data.append(normalizer(json.loads(i), [ "players", "game_id" ])) for i in content.decode('utf-8').splitlines() ]


				

Prep Data for Stata

					
				# Define a couple data mappings for later use
				labmap = { True: 1, False: 0, 'NOANNOTATION': -1 }
				cntrys = { 'austria': 0, 'england': 1, 'france': 2, 'germany': 3, 'italy': 4, 'russia': 5, 'turkey': 6 }
				seasons = { 'Fall': 0, 'Winter': 1, 'Spring': 2 }

				# Combine each of the data frames for each game into one large dataset
				dataset = pd.concat(data, axis = 0, join = 'inner', ignore_index = True, sort = False)

				# Recast data to appropriate types
				dataset['game_score'] = dataset['game_score'].astype('int')
				dataset['sender_labels'] = dataset['sender_labels'].astype('int')
				dataset['absolute_message_index'] = dataset['absolute_message_index'].astype('int')
				dataset['relative_message_index'] = dataset['relative_message_index'].astype('int')
				dataset['game_score_delta'] = dataset['game_score_delta'].astype('int')
				dataset['years'] = dataset['years'].astype('int')

				# Recodes text labels to numeric values
				dataset.replace({'receiver_labels': labmap, 'speakers': cntrys, 'receivers': cntrys, 'seasons': seasons}, inplace = True)

				# Creates an indicator for when the receiver correctly identifies the truthfulness of the message
				dataset['correct'] = (dataset['sender_labels'] == dataset['receiver_labels']).astype('int')

				# Get the number of tokens per message using spaCy's tokenizer
				dataset['tokens'] = dataset['messages'].apply(lambda x: len(nlp(x)))

				# This stores the spaCy object in a new variable named token
				dataset['token'] = dataset['messages'].apply(lambda x: nlp(x))

				# Now the data set can be expanded by unique tokens
				dataset = dataset.explode('token')

				# Make sure the token variable is cast as a string
				dataset['token'] = dataset['token'].astype('str')

				# Then add ID's for each token
				dataset['tokenid'] = dataset.groupby('messages').cumcount()

				

Load Data into Stata and Store Embeddings

					
				# Get the names of the variables
				varnms = dataset.columns

				# Sets the number of observations based on the messages column
				Data.setObsTotal(len(dataset['messages']))

				# Create the variables in Stata
				for var in varnms:

					# The messages and token variables are both string types
					if var not in [ 'messages', 'token' ]:

						# Adds the numeric variables to the data set
						Data.addVarLong(var)

					# We'll make the string types strLs just to make sure there won't be any storage issues
					else:

						# Adds the strL for the string variables
						Data.addVarStrL(var)

				# Now push the data into Stata
				Data.store(var = None, obs = None, val = dataset.values.tolist())

				# Create mapping of value labels to variables
				vallabmap = { 'sender_labels' : labmap, 'receiver_labels': labmap,
						      'seasons': seasons, 'speakers': cntrys, 'receivers': cntrys }

				# Loop over the dictionary containing the value label mappings
				for varnm, vallabs in vallabmap.items():

					# Create the value label
					ValueLabel.createLabel(varnm)

					# Now iterate over the value label mappings and assign to the appropriate value label
					[ ValueLabel.setLabelValue(varnm, value, str(label)) for label, value in vallabs.items() ]

					# Then assign the value label to the variable
					ValueLabel.setVarValueLabel(varnm, varnm)

				# Now create the variables to store the dimensions of the vector embedding
				[ Data.addVarDouble('wembed' + str(i)) for i in range(1, 301) ]

				# Gets all of the tokens and include a sequence ID in the iteration
				for ob, token in enumerate(dataset['token'].tolist()):

					# Gets the spaCy embedding for this token
					embed = nlp(token)

					# Store the word vector for this word in the variables we just created
					[ Data.storeAt("wembed" + str(dim + 1), ob, embed.vector[dim]) for dim in range(0, len(embed.vector)) ]

				

Fit a Model and Get Document/Message Embeddings Instead

					
					# You can now fit a model to the data:
					SFIToolkit.stata("logit correct i.speakers i.seasons i.years i.game_score wembed1-wembed300")

					# These results are fairly noisy, so maybe there would be better luck using document vectors
					SFIToolkit.stata("drop token tokenid wembed*")
					SFIToolkit.stata("duplicates drop")

					# Now use the same process used above, but using document vectors
					[ Data.addVarDouble('docembed' + str(i)) for i in range(1, 301) ]

					# Then iterate over the messages (instead of individual tokens)
					for ob, token in enumerate(dataset['messages'].tolist()):

						# Gets the spaCy embedding for the message
						embed = nlp(token)

						# Stores the document/message/sentence embedding for this record
						[ Data.storeAt("docembed" + str(dim + 1), ob, embed.vector[dim]) for dim in range(0, len(embed.vector)) ]

					# This model fits the data a bit better than the previous model and is also noticably faster.
					SFIToolkit.stata("logit correct i.speakers i.seasons i.years i.game_score docembed1-docembed300")

				

Wrapping Up

  • Be mindful of compute resource consumption and availability.
  • The Python API and pystata have different functionality.
  • Look up information about available models and their training contexts.
  • You may need to train the model on your data for it to product informative embeddings.
Image of Nicholas J Cox
"It's always good to end with a slogan." - Nicholas J. Cox, North American Stata Users Group Conference 2021