101 Practical Natural Language Processing
Natural Language Processing is the art of manipulating unstructured free form text. In order to work on almost all famous NLP tasks such as Sentiment Analysis, Semantic Search, Named Entity Recognition etc, a basic understanding of word vectors is required. This post focuses on the practical aspects of word vectors i.e:
- What is a word vector?
- Why are they significant
- How to get word vectors?
What is a word vector?
Simply put, a word vector is the vector representation of a word. i.e: The representation or embedding of a word as n-dimensional vector is the word vector for that word.
Why are they significant?
A human, can very easily understand that the words “pen” and “pencil” are used in the same context or that they fall in the same category. This is because humans imbibe a lot of knowledge of the language that they gather by reading, talking and understand that both the words pen and pencil belong in the same context. Now, our objective is to teach a machine that both words are semantically similar. This is where word vectors come in.
In 2013, Mikolov et al published a paper Efficient Estimation of Word Representations in Vector Space in which they describe a set of elegant algorithms to obtain vector representations of word that reflect semantic knowledge. These methods are called word2vec and has changed the way NLP researchers dealt with text.
How does word2vec work?
The data: word2vec is a kind of unsupervised algorithm (as in no manual labelling of data is required). The training data for wordvec is nothing but large amount of text. In general word2vec models are trained on wikipedia data as it is a free and open source for well written large amounts of text.
The training TLDR: To explain in layman terms, word2vec looks at the surrounding n words of each word present in our data and learns it’s context.
Word2vec at the beginning of the training assigns a random vector for each word in the data and in each iteration it “pushes” the vector of a word closer to the vectors of the surrounding words.
Consider the two sentences, “Batman is a superhero” and “Superman is the best superhero in the world.” Since, the vectors of each word are “pushed” closer to the words surrounding it, even though “Batman” and “Superman” doesn’t occur nearby in our data their vector representations will have higher similarity because they occur in similar context in our large corpus of text.
Note: Since, I want to focus on more practical aspects of dealing with text the above is a very generalized, simple explanation of the word2vec algorithm that does not cover the different
types of word2vec algorithms like CBOW, Skip Gram etc. For an in-depth explanation please take a look at the following resources.
The following image shows the words that are nearer (by cosine distance) to the word “football” in a pre-trained word2vec model.
Please take a look at https://projector.tensorflow.org/ for visualizing the word similarities and the word map generated by word2vec.
Properties of word embeddings generated by word2vec:
The vectors generated by word2vec have some very interesting properties.
The most obvious property of word embeddings is that the embeddings of semantically similar words have higher similarity value.
One other interesting property is that the vectors corresponding to the “important” words present in the corpus have high magnitude compared to vectors of unimportant or stop words. This property is very useful while analysing unstructured text, as we can see the significant words and their semantic equals easily.
See the most significant words of the Harry Potter books here.
Experimenting with word2vec:
Gensim is a very useful library for Python while dealing with word vectors.
The following steps show how to install gensim and train, experiment with Word2Vec model :-
- First, we need some text data to train the model on. Let’s use the text from Order of Phoenix. You can download it from here. Note that, even though this is an entire book, this is still considered very less amount of text and the results will not be as good as pre-trained word2vec models trained on large amount of text like the English Wikipedia text.
- Now, let’s start by downloading the data
This small tutorial needs the following packages. So, make sure you download them before starting.
pip install --upgrade gensim
pip install nltk
After downloading nltk please run the following in python terminal to download nltk’s required datafiles.
- After this is done, run the following lines one by one to train a basic Word2Vec model on the Harry Potter book.
from gensim.models import Word2Vec
data = open('Harry Potter and the Order of the Phoenix.txt').read()
# Replacing new line character, tab etc with spaces
data = ' '.join(data.split())
# Splitting the text into meaningful sentences
data = nltk.sent_tokenize(data)
# 'Well, it changes every day, you see,' said Harry.
# Splitting each line into list of words
data = [nltk.word_tokenize(line) for line in data]
print(data)# ["'Well", ',', 'it', 'changes', 'every', 'day', ',', 'you', 'see', ',', "'", 'said', 'Harry', '.']
# Training the model
model = Word2Vec(data, min_count=1)
# [('Dudley', 0.6178644895553589), ('Lupin', 0.6073513031005859), ('Moody', 0.6009576916694641), ('Hermione', 0.5514477491378784), ('Dumbledore', 0.550061821937561), ('James', 0.549625813961029), ('Cho', 0.5431044697761536), ('Snape', 0.543087363243103), ('Ron', 0.5417178273200989), ('Tonks', 0.519599199295044)]
Here, we used sentence tokenization function of nltk to split the raw text into sentences and passed the data to the Word2Vec model.
We use sentence tokenization function instead of Python’s readlines() function because, sometimes a single English line might be present in multiple lines in a text file. In practice, we need to follow some data cleaning and preprocessing methods to get the best results, which I will be covering in my next post.
Here, we used the default parameters provided by gensim library. See https://radimrehurek.com/gensim/models/word2vec.html for the complete list of parameters which must be tuned carefully to get the best results for any type of model.
After playing with the trained model for sometime you will observe that this model doesn’t give out any vector for the word “harry”, even though it has a vector for the word “Harry”. Here lies one of the major shortcomings of word2vec. Since, it is a word level model, it can’t generate vectors for Out of Vocabulary words and even though a human can very easily understand that ‘Harry’ and ‘harry’ are the same, this model can’t. In my next post, I will discuss about current state of art models that should overcome this and also about the practical aspects of preparing the data for training.
Note: For playing with word vectors, download pre-trained word vectors by Google from here. These are trained on the Google One Billion word corpus and will provide significantly better results than the model we trained above.