Don't say what you mean, embed it.
How we learned to translate language to Computer
The Landscape of NLP
In this post, we explore the world of Natural Language Processing (NLP) embeddings, their significance, and how they revolutionize the way machines understand human language. We will delve into a conceptual history of NLP techniques for encoding language. We will see the role that various techniques have played in a brief history of how computers process and understand how human language.
A Long Road
Transformer models have revolutionized industries and reshaped the AI landscape. To many, chatbots and other applications of Large Language Models (LLMs) appear almost magical, resembling the emergence of sentient artificial life. It seems like you input a prompt, a computer processes it through unexplainable reasoning, and out comes abstract concepts about life and meaning. But how exactly does a computer capture the meaning—or semantics—of language?
Natural Language Processing (NLP) encompasses a range of techniques, algorithms, and methods used to derive quantitative meaning from language. From predicting the sentiment of tweets to building Transformer models, NLP is at the core. What makes these Transformer models so different? So superior? What will we eventually achieve with these models?
To understand our current position and future direction, we must first understand our past. This post aims to provide a high-level overview of how we have developed the ability to process the semantics of language with computers. This is NOT an exhaustive summary or explanation of these methods and techniques. Instead, I will highlight key conceptual milestones that have enabled us to process language as easily as we do numbers.
This is the first of several posts that will trace the evolution of NLP techniques and our understanding of language. These posts will track how our methods have evolved, leading us to transformer models, LLMs, and chatbots.
Precursor: A Binary System
Computers execute algorithms using a strict binary language: 0s and 1s. Every action your computer takes can be expressed in binary. This concept is crucial to today's discussion.
Numbers and Categories
In our binary system, we represent numeric and categorical data using specific types of symbology.
Example
Imagine we have a dataset of reviews. These reviews could be from Google Maps, Yelp, or any other popular service for finding businesses. Each review is a row in our data. Our goal is to uncover semantic significance and relationships within the reviews. A common NLP technique is to discard words that do not carry substantial semantic weight. Often, we strip all "filler" words (according to a computer's definition) that are not nouns, verbs, adverbs, or adjectives. This post will not delve into this technique, but you can find more information by searching for "how to remove stopwords from text?".
Terminology
Documents (Docs, D): These are the reviews.
Vocabulary (Vocab, V): These are the unique words across the reviews.
Vector (N): A list or array of comma-separated numeric values.
Independent Variable (X): This is what our reviews turn into and are the variables we use to find a pattern.
Dependent Variable (Y): This is our target. In this case, it is our review star rating. We find a pattern between the Y variable and the X variable(s).
Bag-of-Words
How do you represent multiple categories with numbers with equal bias to all? One-Hot-Encoders came about as a solution to this question. You can represent multiple categories in a unique way while maintaining an equal representation of each. Bag-of-Words (BoW) is more or less One-Hot-Encoding applied to language data.
After dropping all words that are not nouns, verbs, adverbs, and adjectives (plus a couple of extra words dropped for the sake of simplicity), we line up each unique word used across all reviews (i.e. docs) across the columns. The matrix that is left is D x V: docs as rows and unique vocab across the columns.
The values within this matrix will be all 1s and 0s. There will be a 1 if a word is used and 0 if a word is not used in the given review. The OneHotEncoded version of our data will look like this:
Why are 1s and 0s Good?
Why are 0s and 1s more meaningful than the original words? Well, now we have a symbolic, numeric representation of the words across all reviews. Further, each doc can be represented as a vector of 1s and 0s. We could compare the vectors of each review and see, mathematically, which reviews are closer or more distant than others.
However, do you see the immediate problem(s)? We are now offering equal weight to each and every word we have in this matrix. For example, good is the most common word in our data, appearing 3 times, yet it has the same value as the word spot, which only appears once.
Term-Frequency
NLP always matures and becomes more and more sophisticated over time. BoW and One-Hot-Encoding, although very useful for dealing with classes, were not up to standards for the NLP community. Eventually, term-frequency, the concept of weighting words that appear more often than others, came into the picture.
The standard way term-frequency is used is within the technique term-frequency inverse-document-frequency, or tf-idf. However, I am going to use a slightly simplified version of this technique in this post. Feel free to see the slightly more sophisticated math that the method actually uses on your own.
We are going to follow these steps:
- Get the total number of words across all docs (i.e. reviews)
- Get counts per unique word
- Divide the count of the word by the total number of words used
- Generate tf-idf matrix of weighted occurrence values
And-ah (Step) 1. and-ah (Step) 2.
Tally up every occurrence of each word.
3. Divide term-frequencies
Now we just divide the number of occurrences of a word by the total number of words used. The resulting numbers are the weights of each word.
4. tf-idf matrix
Finally, we simply replace the 1s in our BoW matrix with the term-frequencies (or, weights). As you can see, now each document (i.e. review) is a sparse vector and all of them together make a sparse matrix.
The 1s were all replaced with the correct weights of their respective words in hot pink / purple!
Is it Really Sparse?
This post is just showing a simple, toy example. If you had 10,000 reviews to work with, you would have many more unique words. Each of those unique words would usually not appear in each review. Thus, the review vectors and the tf-idf matrix would be sparse.
Translating from English to Vector
Now we have a mathematical representation of all of our reviews. Remember, the tf-idf matrix is in dimensions D x V: documents by unique vocabulary.
NLP is really the branch of Data Science that specifically deals with translating language, e.g. English, into a numerical representation, often vectors. See this comparison of the reviews alongside their vector representations.
How do you plot language?
The power of NLP techniques can be summarized with the following image. This image shows that we took reviews written in the English language and turned them into points in a mathematical space. Here is my embedding of my reviews in the mathematical spaces of the words good and right.
Note: For readability, I simplified this plot to only use 2 of the 8 dimensions.
I color-coded the docs (i.e. the reviews) to match the different colors of the points on the plot. I also showed the values for doc 4 and where those values came from.
Double Entendre: Right?
A double entendre is a word or phrase that has 2 distinct meanings. A major problem with the image above and the methods we have used in this post so far is that they do not take into account the context in which words were used. Many words in most languages have several different meanings. Those meanings are context-specific.
The most obvious example of this in my example data is the word right. Right can be used to describe a direction (e.g. 'on the right'), confirmation (e.g. 'right on'), correctness (e.g. 'not made right' could be 'not made correct'), and many other potential meanings and uses.
As a sneak peek to what lies ahead in post 2, I will discuss how methods that use shallow neural networks generally allow a single representation per word but, on the other hand, models that use deep neural networks can have a single representation per meaning per word.
Why Translate our Language?
Mathematics is the language of our Universe. Further, computers are only able to process the language of mathematics (or at least their abstraction of it). The major, recent significance of ChatBots and other applications of Transformer models comes from the fact that we have learned more sophisticated ways to translate our language.
We are not just encoding words themselves, rather their semantics. Embeddings are vectors that represent relationships between the meanings of the words that make up our language. We definitely have more work to do, and it is exciting to see where we will go from here.
Next Steps
The trick is that the nuanced semantic richness of our human-created languages is actually very hard to represent with one set of numbers. We are still a few steps away from our initial goal of tracing the history of NLP methods from Bag-of-Words and One-Hot-Encoding (binary encoding) to Transformer models. In the next blog post, we are going to discuss the vector representation of words and docs with Word2vec, shallow vs deep neural networks, and Attention Heads. All of this will naturally lead up to the use of Transformer models.
Master AI and Data Science!
Passionate about AI and Data Science? Elevate your skills with personalized, one-on-one tutoring sessions. Ready to dive deeper? Contact Oliver now!