To accurately model the meaning of the sentences, how much sparser would the matrix need to get?
How many different n-grams would need to be used to capture that information?
How do the simpler methods work when trying to measure some form of sentiment?
BoW Example for Sentiment
BoW Vector
Sentence ID
Sentence
I
apples
are
as
bad
be
did
expect
expected
half
not
of
the
this
to
were
would
1
I did not expect the apples to be this bad.
1
1
0
0
1
1
1
1
0
0
1
0
1
1
1
0
0
2
This half of the apples are bad.
0
1
1
0
1
0
0
0
0
1
0
1
1
1
0
0
0
3
The apples were not half bad.
0
1
0
0
1
0
0
0
0
1
1
0
1
0
0
1
0
4
Half of the apples were not bad.
0
1
0
0
1
0
0
0
0
1
1
1
1
0
0
1
0
5
The apples were not half as bad as I expected.
1
1
0
2
1
0
0
0
1
1
1
0
1
0
0
1
0
6
I expected the apples would not be half bad.
1
1
0
0
1
1
0
0
1
1
1
0
1
0
0
0
1
7
The apples were not bad.
0
1
0
0
1
0
0
0
0
0
1
0
1
0
0
1
0
Cosine Distances Between Sentiment Examples
Distance to Other Sentence
Source Sentence
1
2
3
4
5
6
7
I did not expect the apples to be this bad.
0
0.48
0.52
0.48
0.46
0.63
0.57
This half of the apples are bad.
0.48
0
0.62
0.71
0.44
0.50
0.51
The apples were not half bad.
0.52
0.62
0
0.93
0.71
0.68
0.91
Half of the apples were not bad.
0.48
0.71
0.93
0
0.65
0.63
0.85
The apples were not half as bad as I expected.
0.46
0.44
0.71
0.65
0
0.67
0.65
I expected the apples would not be half bad.
0.63
0.50
0.68
0.63
0.67
0
0.60
The apples were not bad.
0.57
0.51
0.91
0.85
0.65
0.60
0
Do these distances accurately reflect how similar you would judge the sentiment contained in the sentences?
How do vector embeddings solve these issues?
Reduces sparsity of the data matrix.
Uses a fixed number of dimensions to represent word meanings and context simultaneously.
Vector embeddings can be aggregated to generate embeddings for hierarchical units of language.
Can provide information based on the character/sub-word level that is informative.
Vector embeddings are not the panacea to your NLP related problems
* These packages provide access to several pre-trained models used to generate vector embeddings.
† These packages will be used for subsequent examples.
‡ While the Natural Language ToolKit (NLTK) doesn't provide word embeddings, it has a lot of other useful tools for working with text.
Installing spaCy
Installing Transformers
Get Stata's Python Interpretter Up and Running
# The examples that I'll talk through will use spaCy, but I've included an example
# that uses some transformers based models in the Jupyter notebook on GitHub
import json
import requests
import pandas as pd
from sfi import ValueLabel, Data, SFIToolkit
import spacy
import torch
torch.manual_seed(0)
# This will load the tokenizers and models using the BERT architecture
from transformers import BertTokenizer, BertModel
# This will initialize the tokenizer and download the pretrained model parameters
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case = False)
# We'll also load up the model for spaCy at this time
nlp = spacy.load('en_core_web_lg')
Get data from source
# List of the URLs containing the data set
files = [ "https://raw.githubusercontent.com/DenisPeskov/2020_acl_diplomacy/master/data/test.jsonl",
"https://raw.githubusercontent.com/DenisPeskov/2020_acl_diplomacy/master/data/train.jsonl",
"https://raw.githubusercontent.com/DenisPeskov/2020_acl_diplomacy/master/data/validation.jsonl" ]
# Function to handle dropping "variables" that prevent pandas from
# reading the JSON object
def normalizer(obs: dict, drop: list) -> pd.DataFrame:
# Loop over the "variables" to drop
for i in drop:
# Remove it from the dictionary object
del obs[i]
# Returns the Pandas dataframe
return pd.DataFrame.from_dict(obs)
# Object to store each of the data frames
data = []
# Loop over each of the files from the URLs above
for i in files:
# Get the raw content from the GitHub location
content = requests.get(i).content
# Split the JSON objects by new lines, pass each individual line to json.loads,
# pass the json.loads value to the normalizer function, and
# append the result to the data object defined outside of the loop
[ data.append(normalizer(json.loads(i), [ "players", "game_id" ])) for i in content.decode('utf-8').splitlines() ]
Prep Data for Stata
# Define a couple data mappings for later use
labmap = { True: 1, False: 0, 'NOANNOTATION': -1 }
cntrys = { 'austria': 0, 'england': 1, 'france': 2, 'germany': 3, 'italy': 4, 'russia': 5, 'turkey': 6 }
seasons = { 'Fall': 0, 'Winter': 1, 'Spring': 2 }
# Combine each of the data frames for each game into one large dataset
dataset = pd.concat(data, axis = 0, join = 'inner', ignore_index = True, sort = False)
# Recast data to appropriate types
dataset['game_score'] = dataset['game_score'].astype('int')
dataset['sender_labels'] = dataset['sender_labels'].astype('int')
dataset['absolute_message_index'] = dataset['absolute_message_index'].astype('int')
dataset['relative_message_index'] = dataset['relative_message_index'].astype('int')
dataset['game_score_delta'] = dataset['game_score_delta'].astype('int')
dataset['years'] = dataset['years'].astype('int')
# Recodes text labels to numeric values
dataset.replace({'receiver_labels': labmap, 'speakers': cntrys, 'receivers': cntrys, 'seasons': seasons}, inplace = True)
# Creates an indicator for when the receiver correctly identifies the truthfulness of the message
dataset['correct'] = (dataset['sender_labels'] == dataset['receiver_labels']).astype('int')
# Get the number of tokens per message using spaCy's tokenizer
dataset['tokens'] = dataset['messages'].apply(lambda x: len(nlp(x)))
# This stores the spaCy object in a new variable named token
dataset['token'] = dataset['messages'].apply(lambda x: nlp(x))
# Now the data set can be expanded by unique tokens
dataset = dataset.explode('token')
# Make sure the token variable is cast as a string
dataset['token'] = dataset['token'].astype('str')
# Then add ID's for each token
dataset['tokenid'] = dataset.groupby('messages').cumcount()
Load Data into Stata and Store Embeddings
# Get the names of the variables
varnms = dataset.columns
# Sets the number of observations based on the messages column
Data.setObsTotal(len(dataset['messages']))
# Create the variables in Stata
for var in varnms:
# The messages and token variables are both string types
if var not in [ 'messages', 'token' ]:
# Adds the numeric variables to the data set
Data.addVarLong(var)
# We'll make the string types strLs just to make sure there won't be any storage issues
else:
# Adds the strL for the string variables
Data.addVarStrL(var)
# Now push the data into Stata
Data.store(var = None, obs = None, val = dataset.values.tolist())
# Create mapping of value labels to variables
vallabmap = { 'sender_labels' : labmap, 'receiver_labels': labmap,
'seasons': seasons, 'speakers': cntrys, 'receivers': cntrys }
# Loop over the dictionary containing the value label mappings
for varnm, vallabs in vallabmap.items():
# Create the value label
ValueLabel.createLabel(varnm)
# Now iterate over the value label mappings and assign to the appropriate value label
[ ValueLabel.setLabelValue(varnm, value, str(label)) for label, value in vallabs.items() ]
# Then assign the value label to the variable
ValueLabel.setVarValueLabel(varnm, varnm)
# Now create the variables to store the dimensions of the vector embedding
[ Data.addVarDouble('wembed' + str(i)) for i in range(1, 301) ]
# Gets all of the tokens and include a sequence ID in the iteration
for ob, token in enumerate(dataset['token'].tolist()):
# Gets the spaCy embedding for this token
embed = nlp(token)
# Store the word vector for this word in the variables we just created
[ Data.storeAt("wembed" + str(dim + 1), ob, embed.vector[dim]) for dim in range(0, len(embed.vector)) ]
Fit a Model and Get Document/Message Embeddings Instead
# You can now fit a model to the data:
SFIToolkit.stata("logit correct i.speakers i.seasons i.years i.game_score wembed1-wembed300")
# These results are fairly noisy, so maybe there would be better luck using document vectors
SFIToolkit.stata("drop token tokenid wembed*")
SFIToolkit.stata("duplicates drop")
# Now use the same process used above, but using document vectors
[ Data.addVarDouble('docembed' + str(i)) for i in range(1, 301) ]
# Then iterate over the messages (instead of individual tokens)
for ob, token in enumerate(dataset['messages'].tolist()):
# Gets the spaCy embedding for the message
embed = nlp(token)
# Stores the document/message/sentence embedding for this record
[ Data.storeAt("docembed" + str(dim + 1), ob, embed.vector[dim]) for dim in range(0, len(embed.vector)) ]
# This model fits the data a bit better than the previous model and is also noticably faster.
SFIToolkit.stata("logit correct i.speakers i.seasons i.years i.game_score docembed1-docembed300")
Wrapping Up
Be mindful of compute resource consumption and availability.
The Python API and pystata have different functionality.
Look up information about available models and their training contexts.
You may need to train the model on your data for it to product informative embeddings.
"It's always good to end with a slogan."
- Nicholas J. Cox,
North American Stata Users Group Conference 2021