Understanding Large Language Models: A Long But Simple Guide

October 15, 2023

In the age of information, language models have become an essential tool for various applications, from chatbots to recommendation systems. Among these, Large Language Models stand out for their ability to understand and generate human-like text. But how do they work? This article aims to demystify the underlying mechanisms of Large Language Models, focusing on the concept of word-to-vector calculations and embeddings.

What is a Large Language Model?

If you know anything about this subject, you’ve probably heard that LLMs are trained to “predict the next word” and that they require huge amounts of text to do this. But that is where the explanation stops. The details of how they predict the next word are often treated as a deep mystery. A Large Language Model is a machine learning model trained on a vast dataset comprising text from books, websites, and other sources. It learns the statistical properties of the language, such as syntax, semantics, and context, to generate text that closely resembles human language. These models are often built using neural networks, particularly the Transformer architecture, which enables them to handle long sequences of text effectively.

How Do They Work?

The functioning of a Large Language Model can be broken down into the following steps:

Tokenization: The raw text input is split into smaller pieces called tokens. These tokens can be words, sub-words, or even characters. Tokenization is the first step in the pipeline of a language model. It involves breaking down the raw text into smaller units, known as tokens. These tokens can be as small as characters or as long as words.

Types of Tokenization

Word Tokenization: Splits the text into words based on spaces or punctuation marks.

Example: “I love coding” → [“I”, “love”, “coding”]

Sub-word Tokenization: Breaks down words into smaller meaningful units.

Example: “unhappiness” → [“un”, “happiness”]

Character Tokenization: Splits the text into individual characters.

Example: “cat” → [“c”, “a”, “t”]

Importance: Tokenization is crucial for preparing the text for numerical processing. It helps the model to understand the boundaries between different words or sub-words, which is essential for capturing the semantics of the text.

Embedding: Each token is then converted into a numerical vector using a process called embedding.

Word embeddings are often pre-trained on a large corpus of text. The most common methods for generating embeddings include Word2Vec, GloVe, and FastText.

Word2Vec: Uses neural networks to learn word associations from a text corpus.

GloVe: Stands for “Global Vectors for Word Representation” and is based on matrix factorization techniques.

FastText: Similar to Word2Vec but considers subword information.

Example

Let’s consider the word “apple.” In a 3D embedding space, “apple” could be represented as a vector like [0.2, 0.4, 0.1], where each dimension could represent a feature like ‘tastiness,’ ‘color,’ or ‘shape.

Contextualization: These vectors are processed through the neural network to understand the context.

After embedding, the next step is to understand the context in which each token appears. This is done by processing these vectors through a neural network, often a Transformer architecture.

How it Works

The Transformer architecture uses mechanisms like attention to weigh the importance of different parts of the input text. For example, in the sentence “The cat sat on the mat,” the word “cat” is more closely related to “mat” than to “The.”

Importance

Contextualization allows the model to capture relationships between tokens, which is crucial for understanding the meaning of a sentence or a paragraph.

Output Generation: Finally, the model generates an output text based on the processed context vectors.

Word-to-Vector Calculations and Embeddings

To understand how language models work, you first need to understand how they represent words. Humans represent English words with a sequence of letters, like C-A-T for “cat.” Language models use a long list of numbers called a “word vector.” For example, here’s one way to represent the word cat as a vector:

[0.0074, 0.0030, -0.0105, 0.0742, 0.0765, -0.0011, 0.0265, 0.0106, 0.0191, 0.0038, -0.0468, -0.0212, 0.0091, 0.0030, -0.0563, -0.0396, -0.0998, -0.0796, …, 0.0002]

The actual vector for CAT is 300 numbers long.

So why do we use such a complex notation to represent words? This approach is not new. For example:

Washington, DC, is located at 38.9 degrees north and 77 degrees west. We can represent the names of these cities by this using a vector notation:

• Washington, DC, is at [38.9, 77]

• New York is at [40.7, 74]

• London is at [51.5, 0.1]

• Paris is at [48.9, -2.4]

This is useful for reasoning about spatial relationships. You can tell New York is close to Washington, DC, because 38.9 is close to 40.7 and 77 is close to 74. By the same token, Paris is close to London. But Paris is far from Washington, DC. Words are too complex to represent in only two dimensions, so language models use vector spaces with hundreds or even thousands of dimensions. In GPT-3 for example, words were represented by approximately 12,000 vectors. The human mind can’t envision a space with that many dimensions, but computers are perfectly capable of reasoning about them and producing useful results.

Here is a very simple example of representing fruits in a simple 3-dimensional space:

Suppose we have three words: “apple,” “orange,” and “fruit.” In a simplistic model, we can represent these words in a 3-dimensional space where each axis represents a feature like ‘sweetness,’ ‘color,’ and ‘edibility.’

Apple: (1, 0.8, 1)
Orange: (1, 0.6, 1)
Fruit: (0.5, 0.2, 1)

Here, each word is represented as a point in this 3D space. These points are the word vectors, and the space they inhabit is the embedding space. The closer the vectors, the more similar the words are in meaning. Researchers have been experimenting with word vectors for decades, but the concept really took off when Google announced its word2vec project in 2013.

How is the vector information stored?

Information about the meaning of words, sentences, etc. are stored in a Vector Database. Vector Databases make it possible to quickly search and compare enormous collections of vectors. This means that the most up-to-date embedding models can understand the semantics/meaning behind words and translate them into vectors. This allows the LLMs to efficiently compare sentences with each other. Of course, this also means that LLMs cannot know everything, just the information that has been vectorized, which in essence is a frozen view of the world at a certain point in time.

Because these vectors are built from the way humans use words, they end up reflecting many of the biases that are present in human language. For example, in some word vector models, “Doctor minus man plus woman” yields “Nurse.” Mitigating biases like this is an area of active research.

How do LLMs figure out context or meaning in a sentence?

A simple word vector scheme like this doesn’t capture an important fact about natural language: Words often have multiple meanings.

For example, the word “bank” can refer to a financial institution, turning in an airplane, or to the land next to a river. Or consider the following sentences:

John picks up a magazine.
Susan works for a magazine.

The semantics of the sentences that are using the word magazine in these sentences are related but subtly different. John picks up a physical magazine, while Susan works for an organization that publishes physical magazines.

Words that have two or more unrelated meanings (bank in this example) are called homonyms. While words that have two closely related meanings, as with magazine, linguists call it polysemy. LLMs are designed in a way that they can represent the same word with different vectors, depending on the context in which that word appears.

So how do LLMs resolve these ambiguities and figure out meaning without understanding the real world around them? The following example is an oversimplified view of 96 layers.

At each of the 96 layers of the transformation, the word vectors are modified slightly.

In this simplified model, the transformer figures out that wants and deposit are both verbs (both words can also be nouns). This is then stored by modifying the word vectors to represent this new information. These new vectors are passed up to the next level and are known as the hidden state. The second transformer layer adds two other bits of context: It clarifies that “bank” refers to a financial institution rather than a riverbank, and that “his” is a pronoun that refers to John. The first few layers of the LLM focus on understanding the sentence’s syntax and resolving ambiguities like the ones above. Later layers (which we’re not showing to keep the diagram a manageable size) work to develop a high-level understanding of the passage. The goal is for the highest and final layer of the network to output a hidden state for the final word that includes all the information necessary to predict the next word.

The transformer has a two-step process for updating the hidden state for each word of the input passage:

In the attention step, words “look around” for other words that have relevant context and share information with one another. This step might contain several attention heads that work in parallel to swap information & modify vectors.
In the feed-forward step, each word “thinks about” information gathered in previous attention steps and tries to predict the next word.

How it all Works Together

The contextualized vectors are passed through a decoder, another part of the Transformer architecture, which generates the output text token by token. The model uses a probability distribution to pick the most likely next token based on the context.

If the input is “How are you,” the model might generate an output like “I’m fine, thank you,” based on the probabilities of each token appearing after the input context.

When you put the different functions together, you get the following end-to-end flow:

Conclusion

Large Language Models (LLMs) have revolutionized the way we interact with machines, offering a more human-like experience. They are a type of artificial intelligence (AI) that is trained on massive amounts of text data. This data can include books, articles, code, and other forms of written communication as well as pictures and graphics. LLMs use this data to learn the relationships between words and phrases, as well as the meaning of different types of text. This allows them to perform a variety of tasks, such as translating languages, writing different kinds of creative content, and answering questions in an informative way. As technology continues to develop, these models are expected to become even more sophisticated, offering endless possibilities for various applications.

References:

Timothy B. Lee: https://www.understandingai.org/p/large-language-models-explained-with
Timothy B. Lee: https://arstechnica.com/science/2018/12/how-computers-got-shockingly-good-at-recognizing-images/
Dominik Polzer, All You Need to Know about Vector Databases and How to Use Them to Augment Your LLM Apps
Jack Merullo, Carsten Eickhoff, Ellie Pavlick: Language Models Implement Simple Word2Vec-style Vector Arithmetic, Cornell University

Robert Bergman

Robert Bergman with Next Level Mediation provides full mediation services - including proprietary and confidential Decision Science (DS) analysis that assists each party in understanding their true litigation priorities as aligned with their business objectives. Each party receives a one-time user license to access our exclusive DS Application Cloud. We… MORE