π Embedding Models: Powering Intelligent RAG Systems with Vector Generation
Embedding models are AI models used to generate arrays of embeddings called vectors for given input text. These models provide a more convenient way to generate embeddings or vectors compared to traditional methods, making the RAG (Retrieval Augmented Generation) process simpler and more efficient.
π Overview
Embedding models are AI models used to generate arrays of embeddings called vectors for given input text. These models provide a more convenient way to generate embeddings or vectors compared to traditional methods, making the RAG (Retrieval Augmented Generation) process simpler and more efficient.
π Related Resources
π§ What Are Embedding Models?
Embedding models transform text into numerical representations (vectors) that capture semantic meaning. Unlike traditional approaches such as TF-IDF, embedding models can:
- π Understand context and semantic relationships
- π― Generate dense vector representations
- π Capture nuanced meanings in text
- π Provide better similarity matching
π Role in RAG (Retrieval Augmented Generation)
π οΈ Knowledge Base Preparation
As part of the RAG pipeline, embedding models play a crucial role in preparing the knowledge base:
- π Data Collection: Gather information from various data sources including:
- π Files (PDFs, documents, etc.)
- π Web content
- ποΈ Databases
- π Other structured/unstructured data
-
π― Vector Database Creation: Use embedding models to convert text into vectors and store them in a vector database
- π Semantic Search: Enable efficient retrieval of relevant information based on semantic similarity
β‘ RAG Process Enhancement
Once the vector database is established, it serves as the foundation for generating more accurate chat responses:
- π¬ Query Processing: Convert user queries into vectors using the same embedding model
- π Similarity Search: Find the most relevant content from the knowledge base
- π Context Enhancement: Provide relevant context to the language model for better response generation
βοΈ Traditional vs. Embedding Model Approaches
π TF-IDF Approach
- π Based on term frequency and inverse document frequency
- π€ Limited to exact word matches
- β Doesnβt capture semantic meaning
- π Suitable for simple keyword-based searches
π§ Embedding Models
- π Capture semantic relationships
- π Handle synonyms and related concepts
- π― Provide dense vector representations
- π€ Better for complex query understanding
π Demonstration Setup with Ollama
For practical implementation, this guide uses Ollama local embedding models to generate embeddings. This approach offers:
- π» Local Processing: No need for external API calls
- π Privacy: Data stays on your local machine
- π° Cost Efficiency: No per-request charges
- βοΈ Customization: Ability to fine-tune for specific use cases
π οΈ Available Ollama Embedding Models
Browse the complete list of embedding models at: https://ollama.com/search?c=embedding
Popular embedding models include:
Model | Parameter Size | Description |
---|---|---|
π mxbai-embed-large | 334M | High-performance embedding model |
βοΈ nomic-embed-text | 137M | Balanced performance and efficiency |
π¨ all-minilm | 23M | Lightweight model for basic tasks |
βοΈ Installation and Setup
β Prerequisites
- π§ Install Ollama and Start Ollama local model
- Follow the guide: Ollama Local AI Model Platform
- π₯ Pull Embedding model:
ollama pull mxbai-embed-large
- π¦ Install required Python packages:
pip install ollama chromadb
π― Generating Embeddings
π Using REST API
curl --request POST \
--url http://localhost:11434/api/embed \
--header 'content-type: application/json' \
--data '{
"model": "mxbai-embed-large",
"input": "Python is a high-level, interpreted programming language known for its simplicity and readability."
}'
π Using Python Library
ollama.embed(
model='mxbai-embed-large',
input='Python is a high-level, interpreted programming language known for its simplicity and readability.',
)
π How RAG Works with Embeddings to Generate More Accurate Chat Responses
Letβs understand how embeddings are created for given text and how RAG works to generate more accurate chat responses from a knowledge base.
π§ͺ Example Implementation
Assume we have the following document that weβll use with embedding models to generate vectors, storing them in a Chroma vector database. Chroma is an open-source database designed to store embeddings efficiently.
π Sample Content
"Python is a high-level, interpreted programming language known for its simplicity and readability. Java is a widely used, object-oriented programming language and computing platform first released in 1995. Java and Python are both popular, high-level programming languages, but they have different strengths and weaknesses."
π¦ Document Chunking
First, convert the document into chunks - a collection of documents called a corpus:
documents = [
"Python is a high-level, interpreted programming language known for its simplicity and readability.",
"Java is a widely used, object-oriented programming language and computing platform first released in 1995.",
"Java and Python are both popular, high-level programming languages, but they have different strengths and weaknesses."
]
ποΈ Generate Embeddings and Store in Vector Database
This process is called preparing the knowledge base. We use embedding models to generate the embeddings and store them in a vector database. ChromaDB is an open-source vector database.
# Initialize ChromaDB client and create collection
client = chromadb.Client()
collection = client.create_collection(name="docs")
documents = [
"Python is a high-level, interpreted programming language known for its simplicity and readability.",
"Java is a widely used, object-oriented programming language and computing platform first released in 1995.",
"Java and Python are both popular, high-level programming languages, but they have different strengths and weaknesses."
]
# Store each document in a vector embedding database
for i, d in enumerate(documents):
print(f" Processing document {i+1}: {d[:50]}...")
response = ollama.embed(model="mxbai-embed-large", input=d)
# Extract the first (and only) embedding from the response
embeddings = response["embeddings"][0]
collection.add(
ids=[str(i)],
embeddings=[embeddings],
documents=[d]
)
π RAG Process
π Retrieve
In the retrieval phase, we will use the user prompt to get the most relevant document:
prompt = "What is python?"
response = ollama.embed(
model="mxbai-embed-large",
input=prompt
)
# Extract the first embedding and pass it correctly
query_embedding = response["embeddings"][0]
results = collection.query(
query_embeddings=[query_embedding],
n_results=1
)
# Get the most relevant document
data = results['documents'][0][0]
β‘ Augmentation and Generate
This phase combines the user prompt with contextual data generated as part of the retrieval phase, and this is called prompt stuffing. Finally, we use an AI model to generate an accurate response from our knowledge base. This way, AI models use our data to generate the most up-to-date and accurate chat response based on the user prompt.
# Generate a response combining the prompt and data we retrieved
output = ollama.generate(
model="llama3.2",
prompt=f"Using this data: {data}. Respond to this prompt: {prompt}"
)
π» Complete Code Example
# embedding-model.py
import ollama
import chromadb
# ANSI color codes for terminal formatting
class Colors:
HEADER = '\033[95m' # Magenta
BLUE = '\033[94m' # Blue
CYAN = '\033[96m' # Cyan
GREEN = '\033[92m' # Green
YELLOW = '\033[93m' # Yellow
RED = '\033[91m' # Red
BOLD = '\033[1m' # Bold
UNDERLINE = '\033[4m' # Underline
END = '\033[0m' # Reset to default
# Sample documents for the knowledge base
documents = [
"Python is a high-level, interpreted programming language known for its simplicity and readability.",
"Java is a widely used, object-oriented programming language and computing platform first released in 1995.",
"Java and Python are both popular, high-level programming languages, but they have different strengths and weaknesses."
]
print(Colors.BOLD + Colors.HEADER + "=" * 80)
print("AICodeGeekπ RAG SYSTEM - RETRIEVAL AUGMENTED GENERATION")
print("=" * 80 + Colors.END)
# Initialize ChromaDB client and create collection
client = chromadb.Client()
collection = client.create_collection(name="docs")
print(f"\n{Colors.BLUE}π Building Knowledge Base...{Colors.END}")
print(Colors.BLUE + "-" * 40 + Colors.END)
# Store each document in a vector embedding database
for i, d in enumerate(documents):
print(f" Processing document {i+1}: {d[:50]}...")
response = ollama.embed(model="mxbai-embed-large", input=d)
# Extract the first (and only) embedding from the response
embeddings = response["embeddings"][0]
collection.add(
ids=[str(i)],
embeddings=[embeddings],
documents=[d]
)
print(f"β
Successfully indexed {len(documents)} documents!")
# User query
prompt = "What is python?"
print(f"\nβ User Query: '{prompt}'")
print("-" * 40)
# Generate an embedding for the input and retrieve the most relevant doc
print("π Searching knowledge base...")
response = ollama.embed(
model="mxbai-embed-large",
input=prompt
)
# Extract the first embedding and pass it correctly
query_embedding = response["embeddings"][0]
results = collection.query(
query_embeddings=[query_embedding],
n_results=1
)
# Get the most relevant document
data = results['documents'][0][0]
print(f"π Most relevant document found:")
print(f" β {data}")
# Generate a response combining the prompt and data we retrieved
print(f"\nπ€ Generating AI response using llama3.2...")
output = ollama.generate(
model="llama3.2",
prompt=f"Using this data: {data}. Respond to this prompt: {prompt}"
)
# Display the final response in a nice format with colors
print("\n" + Colors.CYAN + "=" * 80)
print("π¬ AI RESPONSE")
print("=" * 80 + Colors.END)
print()
# Format the response with proper line breaks and indentation
response_text = output['response'].strip()
# Split into paragraphs for better readability
paragraphs = response_text.split('\n\n')
for paragraph in paragraphs:
if paragraph.strip():
# Wrap long lines for better readability
words = paragraph.strip().split()
line = ""
for word in words:
if len(line + word) > 75:
print(f"{Colors.GREEN} {line}{Colors.END}")
line = word + " "
else:
line += word + " "
if line.strip():
print(f"{Colors.GREEN} {line.strip()}{Colors.END}")
print()
print(Colors.CYAN + "=" * 80)
print("β¨ RAG Process Complete!")
print("=" * 80 + Colors.END)
Running the Example
π Source Code Repository
π GitHub Repository - Ollama Embeddings Example
Git Repository Setup
- Clone the repository:
git clone https://github.com/AI-Code-Geek/ollama-embeddings.git cd ollama-embeddings
- Install Python virtual environment:
python3 -m venv .venv
- Activate Python virtual environment:
# On Windows .venv\Scripts\activate # On macOS/Linux source .venv/bin/activate
- Install required Python packages:
pip install -r requirements.txt
- Run the example:
python embedding-model.py
Intellij
π Conclusion
This guide demonstrates how embedding models power intelligent RAG systems by transforming text into meaningful vector representations. By leveraging local models through Ollama, you can build privacy-focused, cost-effective RAG systems that provide accurate responses based on your specific knowledge base.
The combination of embedding models and vector databases creates a powerful foundation for building intelligent applications that can understand and retrieve contextually relevant information, making AI responses more accurate and domain-specific.