Language Models are Mathematical

April 21, 2023

By: AEOP Membership Council Member Iishaan Inabathini

The sudden growth in machine learning that started with the popularity of deep learning in 2009 still hasn’t slowed down. Machine learning has reached a stage where the idea of artificial general intelligence seems achievable, maybe not even too far away.

The popular Chat-GPT language model is able to provide answers to questions that seem as if they were written by a person. It can craft stories and write in different styles. What’s surprising about these machine learning models, which seem to almost think as they speak back, is that they are purely mathematical. After a text input is transformed into a numerical form, the most powerful language models are mostly just a series of multiplications and additions.

This idea seems difficult to imagine but makes sense once explained. The form of input that these models take is vectors (in this context, a list of numbers with a geometric meaning in math that is necessary for the calculus used in training the models). The most powerful language models right now, based on a transformer architecture, receive each word in the text input one by one, each often referred to as a token. Next, each token is transformed into a vector representation. The process of mapping each word/token into a fixed length (fixed dimension) vector is called embedding. It is these vectors that the model uses from the input to the output, where the output vectors are transformed into words usually using a much simpler method. The output vector will have the same amount of elements as the vocabulary(think of a dictionary without definitions), and each element will correspond to a word or symbol in that vocabulary. The word that is chosen corresponds to the element with the largest value in the output vector.

In GPT models (generative pre-trained transformers, currently the best language models), the embedding process is learned. This means that the model has parameters, also called weights, that are adjusted through training to make the embedding more effective. As the model is trained, the embeddings can convert the tokens to vectors in a manner that captures semantic and syntactic relationships between the tokens. Another process used alongside embeddings is positional encoding, which is used so that the vector representation of the token also stores where the token fits in the sequence of tokens that was provided as the input. This original transformer model, released by a Google Brain team in 2017, used a fixed positional encoding. Positional encoding can also be learned just like embedding.

Once the language model has a vector for each token, it stacks the vectors into a matrix (think of the vectors stacked together to form a grid of values). Computers are much more efficient at doing mathematical computations using a matrix. The operations used, such as matrix multiplication, affect each vector independently, making this matrix representation fundamental and ubiquitous in machine learning. This matrix goes through a series of a form of what is called the transformer block, which is the main piece of the most powerful language models today. This block, which is a mathematical function at its essence, multiplies the input matrix with weight matrices as well as using elementwise division and the softmax function, which effectively scales the vectors stacked in the matrix into a probability distribution. The weight matrices, similar to the weights used in the embedding, are learned as the model is trained. These weight matrices become massive – for example, GPT-3 has around 175 billion parameters, the values in these matrices.

Before training, the model’s weights are randomly initialized. The model in this state is useless, so the outputs that it provides will be useless. Training the model is a very important step, and how well the model is trained plays a fundamental role in the model’s performance. The optimization methods most commonly used in deep learning are forms of gradient descent, a method that involves differential calculus, calculating what is called a partial derivative with respect to each parameter. This gets quite complicated when dealing with vector and matrix calculus, and even unmanageable when dealing with such complex functions. Nowadays, an algorithm called reverse automatic differentiation is used, which allows for the calculation of these partial derivatives as long as the derivative of each individual kind of operation is defined once. For example, in the case of transformer models, the derivative of matrix multiplication and the softmax function would only have to be defined once. This can be done mostly due to the chain rule of calculus.

It should also be noted that these mathematical computations are enormously extensive for even a computer to manage. Going through this training process on a model as large as GPT-3 is incredibly expensive. A V100 GPU, which used to be one of the most powerful GPUs for machine learning available on the market, would take 355 years to train the model. Choosing to train the model on the lowest-cost cloud GPU on the market would cost around $4,600,000. It is clear that this kind of artificial intelligence would be unreachable without the powerful computers that exist today, despite the simpler mathematical definitions of the model.

This article barely scratches the surface of the mathematics behind machine learning. It is a field that is built on probability, statistics, linear algebra, tensor calculus, optimization, and computer science. The simplest to most advanced machine learning models, whether it be conversational models to image generation models, have deep mathematical backgrounds and meanings. The models are defined by math, and implemented by computers.

References:

Alammar, J. (2020). The Illustrated Transformer. Github.io. https://jalammar.github.io/illustrated-transformer/

Li, C. (2020, June 3). OpenAI’s GPT-3 Language Model: A Technical Overview. Lambdalabs.com; Lambda, Inc. https://lambdalabs.com/blog/demystifying-gpt-3

Understanding the Natural Language Processing Architecture of Chat-GPT: A Quick Dive into Its Inner Workings. (2023). Linkedin.com. https://www.linkedin.com/pulse/understanding-natural-language-processing-chat-gpt-quick-rutenberg#:~:text=In%20the%20case%20of%20Chat,vector%20space%20of%20fixed%20size.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. ArXiv.org. https://arxiv.org/abs/1706.03762

Language Models are Mathematical

Blog Categories

Find a Volunteering Opportunity

eCYBERMISSION Mini-Grant