Generative AI: Some thoughts on using Embeddings

Gunter · ‎02-11-2024

A practitioner's view on embeddings... with a touch of SAP BTP!

I'm writing these lines with the motivation of saving you time understanding what embeddings are, what they are used for, how to pick 'the right' embedding for your application under consideration of performance, data security and latency. Over the past year I came across this topic so many times that I want to give a braindump on practical findings. Let me know your thoughts and suggestions below.

Introduction to Generative AI and Embeddings

Generative AI refers to a class of artificial intelligence that specializes in creating new content, whether that be text, images, sounds, or even video. Unlike discriminative models that classify or predict based on input data, generative models can generate novel data samples. Applications of generative AI are vast and include tasks such as synthesizing realistic human speech, generating art or music, designing new drugs, and creating virtual environments for gaming and simulations.

At the heart of generative AI's ability to produce new content are embeddings. Embeddings are dense, low-dimensional, and continuous vector representations of high-dimensional data. They are the foundation upon which generative models understand and manipulate data. For example, in natural language processing (NLP), words, sentences, or entire documents are converted into vectors that capture semantic meaning and context. In image processing, embeddings might represent key features of an image that allow a model to generate similar but unique images.

The role of embeddings in generative AI cannot be overstated. They serve as a bridge between the raw, often unstructured data and the sophisticated neural networks that process them. By translating data into a format that AI models can efficiently work with, embeddings enable models to discern patterns, make associations, and ultimately generate new content that is coherent and contextually relevant.

The importance of embeddings lies in their ability to capture the essence of the data. For text, this means understanding synonyms, analogies, and the subtleties of language. For images, it involves recognizing shapes, textures, and colors. This transformation of raw data into a meaningful vector space is what allows generative AI to be creative and insightful, pushing the boundaries of what machines can produce.

In the following sections, we will delve deeper into the types of embeddings, their applications in various domains of generative AI, and the critical considerations one must make when choosing the right embedding model for a given task.

Understanding Different Types of Embeddings

There are quite a few types of embeddings to distinguish: Word embeddings, sentence embeddings, image embeddings, audio embeddings and recently multimodal embeddings (e.g. combining image and text). For this blog I like to focus on word and sentence embeddings.

Word embeddings

Word embeddings are vector representations of individual words. They capture the semantic meaning of words by placing semantically similar words close to each other in the embedding space. Word embeddings are typically used when planning:

Word-Level Tasks: You are working on tasks that require understanding or processing at the word level, such as part-of-speech tagging, named entity recognition, or word sense disambiguation.
Fine-Grained Analysis: You need to perform a fine-grained analysis where the meaning of individual words is crucial, such as in lexical semantics studies.
Input for Other Models: You are building models that take word embeddings as input and aggregate them in some way to understand larger units of text, such as sentences or documents.
Limited Context: The task at hand does not require a lot of contextual information, and the meaning of words can be understood in isolation or with minimal context.

Sentence Embeddings

on the other hand are vector representations of entire sentences or phrases. They are designed to capture the meaning of the sentence as a whole, taking into account word order and the interactions between words. Sentence embeddings are typically used when:

Sentence-Level Tasks: You are working on tasks that require understanding the meaning of entire sentences or phrases, such as sentiment analysis, natural language inference, or text classification.
Semantic Similarity: You need to compare sentences or paragraphs for semantic similarity, such as in information retrieval, document clustering, or duplicate detection.
Contextual Meaning: The task requires capturing the context in which words are used, as the meaning of a sentence often depends on more than just the individual words it contains.
Pre-Trained Models: You are using pre-trained models like BERT, GPT, or Sentence-BERT, which can generate sentence embeddings that capture deep contextualized meanings.

When to Choose One Over the Other

Well, you can read about that in the literature but what I found is that with modern embeddings derived from GPT models combine both models often, so if you feed in single words, you get the word embedding, feeding in sentences gives you the advantages you get from sentence embeddings. That's nice!

RAG, or: Why one should care about embeddings

Retrieval-augmented Generation: That's likely one of the big use cases for embeddings. In a nutshell: large language models still have only a very limited context window. It's said that GPT-4 Turbo offers a 128k token context window - which is between 300-400 pages of text. Impressive! Still we want more: For once, it's not much compared to the information out there, and then: Commercial models charge per token, so why would we want to search the needle in the haystack if we got to pay for the haystack and the needle? It would be better to say 'find the needle' in terabytes of data without the data even going to the cloud! Cheaper, faster, less energy consuming.

That's where vector stores enter the stage (e.g. SAP HANA Vector Engine) to hold a complex vector representation of what was once your text.

What embedding models do for your text

The vector representation of what was text before is generated by the according embedding model you picked for that task. For example a model might place an apple close to a pear while a tomato would be further away. Now imagine that with other dimensions that establish context or represent language. An apple in Japanese would likely not be far away from an 'English' apple - something which might be very much wanted to allow for multilingual searches of meanings in international texts. Here's a nice playground to understand that idea better.

The universal solution to embedding

'My company uses GPT models from (Azure) OpenAI so I just use their embedding model, done!' - Ah yes, you can do that, but one reason I wrote the blog is because I think it's not that simple. Consider these aspects:

1. Performance of the embedding model

Here the embeddings from OpenAI certainly have a point - they perform quite well and deliver acceptable results in average. But they are not the best. Say you need a great performance for clustering or pairing or classification of text - there is no 'one for all' model. Or you want great performance with Chinese text only and so on. So model performance delivering not just an embedding but allowing for finding what you search later is key.

2. Data privacy in the cloud

We all work with cloud solutions - and we all do it by weighing benefits for and risks of processing. Sending your business text data to retrieve embeddings to an entity outside of your defined cloud space might be unnecessary risk.

3. Cost of embedding

Commercial models almost always charge for creating embeddings - not much but if you embed giga and terabytes of data it sums up.

4. Latency generating embeddings

That might be an important point to consider: You move large text data out to a cloud service, you move even larger data back in (if your vector engine is with your cloud or on-premise entity) and you experience sometimes massive network latency depending on the size of chunks the service can digest.

Practical considerations for embeddings

I suggest to consider above 4 points when going for a generative AI project that will need embeddings to represent your business data. If any of these 4 points is a concern it's already worth looking into embedding models other than the cloud services.

Option 1 - Creating embeddings inside your SAP BTP application

Say, you want to convert your business text data in an extension you create on one of SAP BTP's runtimes: Cloud Foundry or Kyma. Then - for the moment - the throughput from text to embedding per time unit on CPU is crucial. You need to also understand what you want to derive from the text later as mentioned above (e.g. retrieval, clustering or other tasks). Are your texts in one language only? Good! There are high performing models that can process texts even on CPUs with great speed.

Option 2 - Creating embeddings on SAP AI Core

Many great performing models request a tribute: Computing power. One approach you can take is deploying the according model on SAP AI Core to leverage GPU acceleration. For that using SentenceTransformers is an approach I found very handy. Deploy your own service as a web service you can then consume in your extension application.

Sharing own findings with some embedding models

I was looking for a model with a good overall performance and the ability to work multilingual while having only CPU available to compute the embeddings. Below table contains my findings.

Model	Time to embed in seconds	Japanese/ English retrieval	Multilanguage practical test
all-MiniLM-L6-v2	1.473	0.154	Not good
all-MiniLM-L12-v2	1.5	0.217	Not good
intfloat/multilingual-e5-small	2.724	0.931	ok
avsolatorio/GIST-Embedding-v0	7.813	0.639	not measured
intfloat/multilingual-e5-base	16.037	0.929	ok
thenlper/gte-large	19.825	not measured	Not good
BAAI/bge-m3	21.767	0.879	ok
llmrails/ember-v1	29.61	not measured	Not good
WhereIsAI/UAE-Large-V1	31.877	not measured	not measured
intfloat/multilingual-e5-large	48.056	not measured	not measured
intfloat/e5-large	not measured	0.739	not measured
intfloat/e5-base	not measured	0.724	not measured

Time to embed is measured on my own notebook on CPU how long it took to embed a Japanese document of 56 A4 pages.

Japanese/English retrieval is how good the model performs with 2 sentences that have the same meaning in both Japanese and English. That is an important characteristic if you want to run e.g. an English search on a Japanese text or vice-versa.

The last column finally is my very subjective judgement considering all 3 criteria for my use case. I then decided to use intfloat/multilingual-e5-small for my project.

Conclusion

As we wrap up this discussion on embeddings, it's clear that their role in Generative AI is both fundamental and multifaceted. This blog has aimed to demystify the concept of embeddings, providing a practical viewpoint on their selection and application.

By considering the key factors of performance, data privacy, cost, and latency, we've navigated the complexities that come with choosing the right embedding for your specific needs. The shared experiences and findings serve as a guide to help you make informed decisions, ensuring that your AI projects are not only effective but also aligned with your operational constraints and goals. Make the most out of embeddings to enhance your business data's representation!

References

MTEB Leaderboard @ Huggingface - Ranked embeddings