Demystifying Transformers and Embeddings: Some GenAI Concepts

Louenas

Introduction:

Generative AI (GenAI) keeps generating so much news and so many innovative applications in many domains and industries. The potential impact of GenAI on businesses and society is becoming difficult to deny. Thought leaders have compared the impact of GenAI to the invention of the transistor, the internet, and even FIRE.

Being at SAP part of the Partner Ecosystem Success group, I had the opportunity to talk to partners and customers, testifying that Business AI is first on the agenda, as SAP's CEO Christian K. and Walter S., Head of AI at SAP, expressed in the 1st video link below.

SAP is at the forefront of this new enterprise revolution and rightfully so, as SAP applications constitute the digital platform of business in the world, where the oil of Gen Business AI being the business Data is being generated and transformed. As the saying goes, with great power comes great responsibility; therefore, SAP is strategically focusing on Relevant, Reliable, and Responsible AI.

On the GenAI technology news from SAP, the following are the latest:

Generative AI hub: This is the place where SAP and its ecosystem will integrate LLMs and AI into new business processes in a cost-efficient manner.

Joule: The Generative Business AI Assistant from SAP. This will become the main UX for SAP Applications like S/4HANA Cloud and SuccessFactors HCM, providing information retrieval, navigational assistance, transitional assistance, and analytics assistance with ad-hoc and natural language querying capabilities.

SAP Build Code: This is where the developer community will build new user experiences with such speed of development thanks to embedded productivity tools powered with GenAI.

SAP HANA Database Vector Engine: This is the in-memory database from SAP, integrating under one single APIs set all the required engines for transactional workloads, Analytics/ML, and now GenAI thanks to the added support for a Vector engine. The Vector engine is the technical component used for managing Embeddings and using them to ground GenAI business use cases. Grounding is a term used in the context of RAG (Retrieval-Augmented Generation) to condition the LLM with the relevant context and therefore limit hallucinations.

Understanding Transformers and Embeddings:

Talking to partners and customers about GenAI is fun, but for more productive brainstorming, it's sometimes important to level set on the technical side with some GenAI basics, such as Embeddings and Transformers.

There are many great blogs and videos explaining these topics in detail using GPT-2, so this is a great resource as most LLMs use a very similar architecture. The resources that helped me most and seemed to be the best for people with technical background are from Niels Rogge from HuggingFace and from Andrej Karpathy.

Transformers like GPT act on tokens. Tokens are sub-words, and all the LLMs sub-words is the vocabulary. GPT-2 has a vocabulary size of 50,257 tokens, with each token represented by an Embedding of 768 dimensions. In comparison, GPT-4 seems to have 100,256 tokens and a default embedding vector of 1,536 dimensions (GPT-4 has a higher size of 3,072 dimensions and comes with the text-embedding-3-large model).

The tokens are produced with a tokenization ML process that tries to maximize the compression of language into an adequate vocabulary size. In GPT, the Embeddings are encoded into vectors using a training phase through the Transformer (see the 2nd video below from Niels).

You can experiment with tokenization on OpenAI here. For example, "hello world, my name is" will produce these 6 tokens: ['hello'- ' world' - ',' - ' my' - ' name' - ' is']. In French, when translated literally to "bonjour monde, mon nom est," it produces 7 tokens. Less common words will produce more tokens; therefore - fun fact - non-English languages will tend to produce more tokens, i.e., higher dollar cost for the same number of language words of a given prompt.

One forward pass through the Transformer architecture of GPT-2, i.e., predicting one next token, can be summarized in Python as follows. The code is self-explanatory with detailed inline comments.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load the GPT-2 tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Prompt
prompt = "hello, my name is" 

# Tokenize the input text and return their IDs
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Pass the tokenized input through the GPT-2 model to get the logits (Unnormalized embeddings) representing the likelihood scores of each possible next word.
logits = model(input_ids).logits

# slices the output to get the logits only for the last position in the sequence, as we're predicting the very next word
next_token_logits = logits[:, -1, :]

# finds the index (ID) of the word with the highest probability score.
next_token_id = torch.argmax(next_token_logits, dim=-1)

# Decode i.e. convert the numerical token ID back into its corresponding word.
next_token = tokenizer.decode(next_token_id[0])

# Print the predicted next token
print(f"Next token is {next_token}")

When running this program on a Jupiter Notebook or Visual Studio Code, the next keyword will be "John." As a developer, you can add a loop to keep appending the next_token to the end of the prompt and perform a forward pass to see what the following tokens will be.

The videos above from Niels are very informative but could be a bit technical for some community members. To make it easier to explain, I spent a lot of time thinking about how to demonstrate the concepts without resorting to the phrase “Some magic happens here…”. Here is what I came up with, hoping it will help you understand and explain broader GenAI topics to your partners and customers.

Imagine a 3-dimensional space representing some concepts like Fruits, Programming languages, and Software companies.

The following matrix with the name "embedding" is the algebraic representation of such a universe. The labels set is the vocabulary, and the X, Y, and Z coordinates are the corresponding embedding vectors. Please continue through the Python code, as it's purposefully well-documented to be self-explanatory.

import numpy as np
import random

# Pre-defined (random) word embeddings
embeddings = {
    # Fruits I love
    "Apple":        np.array([0.85,          0.10,       0]),
    "Banana":       np.array([0.91,          0.05,       0]),
    "Orange":       np.array([0.90,          0.06,       0]),
    "Figs":         np.array([0.88,          0.10,       0]),
    # Programming Languages I know
    "Java":         np.array([0.10,          0,          0.85]),
    "JavaScript":   np.array([0.05,          0,          0.91]),
    "Python":       np.array([0.06,          0,          0.90]),
    "ABAP":         np.array([0.10,          0,          0.88]),
    # Enterprise software companies :)
    "SAP":          np.array([0,             0.85,       0.10]),
    "Oracle":       np.array([0,             0.91,       0.05]),
    "Microsoft":    np.array([0,             0.90,       0.06]),
    "Salesforce":   np.array([0,             0.88,       0.10]),
}

# Prompt 
prompt = "I love ABAP and"

# Split the prompt into individual words
prompt_words = prompt.split()

# Function to get embedding for a word
# Return a vector of ones (identity vector) if the word is not in the vocabulary
# Since 'I', 'love, 'and' are not in our vocabulary, each will get an embedding of [1, 1, 1]. This way they will have no effect when mutiplied by the prompt embeddings. In real LLMs unknown tokens are represented by a special token.
def get_embedding(word):
    return embeddings.get(word, np.ones(3))

# Calculate the average embedding of the prompt
prompt_embedding = np.mean([get_embedding(word) for word in prompt_words], axis=0)

# Get our vocabulary 
vocabulary = list(embeddings.keys())  
# Remove words in the prompt from the vocabulary so that they won't be used to complete the prompt
filtered_vocabulary = [word for word in vocabulary if word not in prompt_words]

# Calculate dot products between the prompt embedding and the remaining vocabulary embeddings
dot_products = {word: np.dot(prompt_embedding, get_embedding(word)) for word in filtered_vocabulary}

# Add some randomness in the chosen next word. 
# First sort the words by their dot products and select the top two
top_two_words = sorted(dot_products, key=dot_products.get, reverse=True)[:2]
# Randomly select one of the two highest words. 
predicted_word = random.choice(top_two_words)

print(f"Predicted next word: {predicted_word}")

When running this code on a Jupiter Notebook or Visual Studio Code, you will be able to complete the prompts with the next programming language that I love 🙂 If you type "I like Apple and" it will complete with another fruit. Go ahead and experiment with you own categories and I hope this simulation can help you appreciate the scale of a real LLM like GPT-4 Turbo that acts on a vocabulary of 100,256 tokens and Embedding vectors of 3072 dimensions!!!

This simple code can predict the next word, classify words, etc. from the given vocabulary.

Conclusion:

To be clear, this is a very simplistic illustration of how high school algebra, i.e., matrix multiplication, is at the centre of today's technology revolution. Among other things, the Transformer neural networks likes GPUs/TPUs/LPUs for their great capabilities in matrix operations and parallelization of such workloads. It's important not to underestimate other concepts in the Transformer architecture, such as tokenization, encoding, embedding, attention, and the neural network architecture of the Transformers, as well as Reinforcement Learning, Alignment, etc., so for that try to listen to the presentation from Niels or Andrej.

Understanding the technical concepts behind GenAI, such as Transformers and Embeddings, is crucial for productive brainstorming and discussions on GenAI applications in various domains and industries. This blog post aimed to provide a primer on these concepts, encouraging readers to explore the provided resources and code examples to deepen their understanding.

References:

SAP's CEO Christian K. and Walter S., Head of AI at SAP on Business AI

Niels videos:

Andrej videos:

Demystifying Transformers and Embeddings: Some GenAI Concepts

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win