CCLee / Blog / RAG Deployment Part 1: Semantic Chunking, Agentic Rephase and Reranking; Chroma Database

RAG Deployment Part 1: Semantic Chunking, Agentic Rephase and Reranking; Chroma Database

1.Setup Environment
2.Prepare Custom Data
- 2.1.Basic Imports
- 2.2.Constants
- 2.3.Modified print for Jupyter Notebook
- 2.4.Custom models before vectorization
- 2.5.Model for vectorization
- 2.6.Models for semantic chunking
- 2.7.Get title and tags from blog markdown
- 2.8.fetch_documents
3.Start of Semantic Chunking
- 3.1.make_user_prompt and make_user_messages
- 3.2.Example of the chunks
- 3.3.Start Chunking
  - 3.3.1.create_chunks
  - 3.3.2.Save the chunks locally
  - 3.3.3.Retrieve the chunks
4.Vector Embeddings and Chroma Database
5.Fetch and Rerank results from Vector DB According to the Question
- 5.1.fetch_context_unranked
- 5.2.rerank
- 5.3.fetch_reranked_context
6.Prompts to Answer Question with Agentic Reranking
- 6.1.rewrite_query
- 6.2.answer_question
7.Local Experiments
- 7.1.rewrite_query
- 7.2.answer_question
- 7.3.The Blog Exploer
8.Reference

December 16, 2025

Llm

Rag

Vectordb

1.Setup Environment
2.Prepare Custom Data
- 2.1.Basic Imports
- 2.2.Constants
- 2.3.Modified print for Jupyter Notebook
- 2.4.Custom models before vectorization
- 2.5.Model for vectorization
- 2.6.Models for semantic chunking
- 2.7.Get title and tags from blog markdown
- 2.8.fetch_documents
3.Start of Semantic Chunking
- 3.1.make_user_prompt and make_user_messages
- 3.2.Example of the chunks
- 3.3.Start Chunking
  - 3.3.1.create_chunks
  - 3.3.2.Save the chunks locally
  - 3.3.3.Retrieve the chunks
4.Vector Embeddings and Chroma Database
5.Fetch and Rerank results from Vector DB According to the Question
- 5.1.fetch_context_unranked
- 5.2.rerank
- 5.3.fetch_reranked_context
6.Prompts to Answer Question with Agentic Reranking
- 6.1.rewrite_query
- 6.2.answer_question
7.Local Experiments
- 7.1.rewrite_query
- 7.2.answer_question
- 7.3.The Blog Exploer
8.Reference

1. Setup Environment

We uv init and uv sync with the following pyproject.toml.

There is langgraph included but we can exclude them whenever we want because our implementation is direct application of agents imported from openai.

1[project]
2name = "app"
3version = "0.1.0"
4description = "Add your description here"
5readme = "README.md"
6requires-python = "==3.12.8"
7dependencies = [
8    "anthropic>=0.69.0",
9    "beautifulsoup4>=4.14.2",
10    "chromadb>=1.1.0",
11    "datasets==3.6.0",
12    "feedparser>=6.0.12",
13    "google-genai>=1.41.0",
14    "google-generativeai>=0.8.5",
15    "gradio>=5.47.2",
16    "ipykernel>=6.30.1",
17    "ipywidgets>=8.1.7",
18    "jupyter-dash>=0.4.2",
19    "langchain>=0.3.27",
20    "langchain-chroma>=0.2.6",
21    "langchain-community>=0.3.30",
22    "langchain-core>=0.3.76",
23    "langchain-openai>=0.3.33",
24    "langchain-text-splitters>=0.3.11",
25    "litellm>=1.77.5",
26    "matplotlib>=3.10.6",
27    "nbformat>=5.10.4",
28    "modal>=1.1.4",
29    "numpy>=2.3.3",
30    "ollama>=0.6.0",
31    "openai>=1.109.1",
32    "pandas>=2.3.3",
33    "plotly>=6.3.0",
34    "protobuf==3.20.2",
35    "psutil>=7.1.0",
36    "pydub>=0.25.1",
37    "python-dotenv>=1.1.1",
38    "requests>=2.32.5",
39    "scikit-learn>=1.7.2",
40    "scipy>=1.16.2",
41    "sentence-transformers>=5.1.1",
42    "setuptools>=80.9.0",
43    "speedtest-cli>=2.1.3",
44    "tiktoken>=0.11.0",
45    "torch>=2.8.0",
46    "tqdm>=4.67.1",
47    "transformers>=4.56.2",
48    "wandb>=0.22.1",
49    "langchain-huggingface>=1.0.0",
50    "langchain-ollama>=1.0.0",
51    "langchain-anthropic>=1.0.1",
52    "langchain-experimental>=0.0.42",
53    "groq>=0.33.0",
54    "xgboost>=3.1.1",
55    "python-frontmatter>=1.1.0",
56    "pgvector>=0.4.2",
57    "psycopg2-binary>=2.9.11",
58]

2. Prepare Custom Data

2.1. Basic Imports

1from pathlib import Path
2from openai import OpenAI
3from dotenv import load_dotenv
4from pydantic import BaseModel, Field
5from chromadb import PersistentClient
6from tqdm import tqdm
7from litellm import completion
8import numpy as np
9from sklearn.manifold import TSNE
10import plotly.graph_objects as go
11import os
12from typing import TypedDict

2.2. Constants

1load_dotenv(override=True)
2
3DB_NAME = "preprocessed_db"
4collection_name = "docs"
5embedding_model = "text-embedding-3-large"
6KNOWLEDGE_BASE_PATH = Path("knowledge-base")
7AVERAGE_CHUNK_SIZE = 2500
8KNOWLEDGE_GLOB_EXPRESSION = "../src/mds/articles/**/*.md"
9RETRIEVAL_K = 10
10
11os.environ["AZURE_API_KEY"] = os.getenv("AZURE_OPENAI_API_KEY")
12os.environ["AZURE_API_BASE"] = os.getenv("AZURE_OPENAI_ENDPOINT")
13
14MODEL = f"azure/{os.getenv('AZURE_OPENAI_MODEL')}"

2.3. Modified print for Jupyter Notebook

This is to display text with fixed width in jupyter notebook:

1import textwrap
2
3def printw(text: str):
4    wrapped = textwrap.fill(text, width=100)
5    print(wrapped)

2.4. Custom models before vectorization

2.5. Model for vectorization

1class CustomDocument(TypedDict):
2    tags: str
3    title: str
4    text: str

2.6. Models for semantic chunking

1class Result(BaseModel):
2    page_content: str
3    metadata: dict
4
5class Chunk(BaseModel):
6    headline: str = Field(
7        description="A brief heading for this chunk, typically a few words, that is most likely to be surfaced in a query. This headline must be in English")
8    summary: str = Field(
9        description="A few sentences summarizing the content of this chunk to answer common questions, this ummary must be in English")
10    original_text: str = Field(
11        description="The original text of this chunk from the provided document, exactly as is, not changed in any way")
12
13    def as_result(self, document):
14        metadata = {"title": document["title"], "tags": document["tags"]}
15        return Result(page_content=self.headline + "\n\n" + self.summary + "\n\n" + self.original_text, metadata=metadata)
16
17class Chunks(BaseModel):
18    chunks: list[Chunk]

2.7. Get title and tags from blog markdown

Our markdowns are of the following format:

1---
2title: some title
3tags: a, b, c
4---
5
6## Title
7
8Contents ...

We get the title and tags from markdowns as follows:

1def get_tags_and_title_from_blogpost(filepath: str) -> tuple[str, str]:
2    try:
3        blog_post = frontmatter.load(filepath)
4    except Exception as e:
5        print(f"Error loading {filepath}: {e}")
6        raise
7    
8    tags = blog_post.get("tag", "")
9    title = blog_post.get("title", "")
10    
11    if isinstance(tags, list):
12        tags = ",".join(sorted(tags))
13    elif isinstance(tags, str) and "," in tags:
14        tags = ",".join(sorted([t.strip() for t in tags.split(",")]))
15    return tags, title

2.8. fetch_documents

Finally we prepare all documents:

1import frontmatter
2import glob
3import os
4import re
5
6def fetch_documents(knowledge_glob_path: str) -> list[CustopmDocument]:
7    """A homemade version of the LangChain DirectoryLoader"""
8
9    documents: list[dict] = []
10
11    for file in glob.glob(knowledge_glob_path, recursive=True):
12        # Load with frontmatter to automatically strip the --- section
13        blog_post = frontmatter.load(file)
14        tags, title = get_tags_and_title_from_blogpost(file)
15        
16        # Get content without frontmatter
17        text = blog_post.content
18        
19        # Remove <style>...</style> blocks (including multiline)
20        text = re.sub(r'<style[^>]*>.*?</style>', '', text, flags=re.DOTALL | re.IGNORECASE)
21        
22        # Clean up extra whitespace
23        text = text.strip()
24        
25        documents.append(CustomDocument(tags=tags, title=title, text=text))
26
27    print(f"Loaded {len(documents)} documents")
28    return documents
29
30documents = fetch_documents(KNOWLEDGE_GLOB_EXPRESSION)

3. Start of Semantic Chunking

3.1. make_user_prompt and make_user_messages

1def make_user_prompt(document: CustomDocument):
2    how_many = (len(document["text"]) // AVERAGE_CHUNK_SIZE) + 1
3    return f"""
4        You take a document and you split the document into overlapping chunks for a KnowledgeBase.
5
6        The document is from the articles from Blog of James Lee.
7        The document is of tags: {document["tags"]}
8        The document has title: {document["title"]}
9
10        A chatbot will use these chunks to answer questions about the articles and retrieve a related list of articles for the reader.
11        You should divide up the document as you see fit, being sure that the entire document is returned in the chunks - don't leave anything out.
12        This document should probably be split into {how_many} chunks, but you can have more or less as appropriate.
13        There should be overlap between the chunks as appropriate; typically about 25% overlap or about 50 words, so you have the same text in multiple chunks for best retrieval results.
14
15        For each chunk, you should provide a headline, a summary, and the original text of the chunk.
16        Together your chunks should represent the entire document with overlap.
17
18        Here is the document:
19
20        {document["text"]}
21
22        Respond with the chunks.
23    """
24
25def make_user_messages(document: CustomDocument):
26    return [
27        {"role": "user", "content": make_user_prompt(document)},
28    ]

3.2. Example of the chunks

1messages = make_user_messages(documents[1])
2response = completion(model=MODEL, messages=messages, response_format=Chunks)
3reply = response.choices[0].message.content
4doc_as_chunks = Chunks.model_validate_json(reply).chunks

Here the model_validate_json method from pydantic.BaseModel will convert a json string into the corresponding class object in Python.

Let's print a summary from Chunk as an example (recall the definition of Chunk from 〈2.6. Models for semantic chunking〉):

1summary=doc_as_chunks[0].summary
2printw(summary)

which results in:

1This article from James Lee's blog reflects on his illustrations drawn during middle school. Despite
2some drawings being embarrassing, he shares a series of images that capture his early artistic
3efforts, documenting his personal history and growth in art.

3.3. Start Chunking

3.3.1. create_chunks

1import pickle
2import os
3
4@retry(wait=wait)
5def process_document(document):
6    messages = make_user_messages(document)
7    response = completion(model=MODEL, messages=messages, response_format=Chunks)
8    reply = response.choices[0].message.content
9    doc_as_chunks = Chunks.model_validate_json(reply).chunks
10    return [chunk.as_result(document) for chunk in doc_as_chunks]
11
12def create_chunks(documents):
13    chunks = []
14    for doc in tqdm(documents):
15        chunks.extend(process_document(doc))
16    return chunks

3.3.2. Save the chunks locally

1import pickle
2
3chunks = create_chunks(documents)
4with open("chunks.pkl", "wb") as f:
5    pickle.dump(chunks, f)

3.3.3. Retrieve the chunks

1def get_chunks(documents, cache_file="chunks.pkl", force_refresh=False):
2    """Get chunks from cache or create new ones."""
3
4    if os.path.exists(cache_file) and not force_refresh:
5        print(f"Loading chunks from {cache_file}...")
6        with open(cache_file, "rb") as f:
7            return pickle.load(f)
8
9    print("Creating chunks...")
10    chunks = create_chunks(documents)
11
12    with open(cache_file, "wb") as f:
13        pickle.dump(chunks, f)
14    print(f"Saved {len(chunks)} chunks to {cache_file}")
15
16    return chunks

Now we start with

1chunks = get_chunks(documents)

4. Vector Embeddings and Chroma Database

4.1. create_embeddings

1from openai import AzureOpenAI
2import os
3
4client = AzureOpenAI(
5    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
6    api_version=os.getenv("AZURE_API_VERSION"),
7    # https://shellscriptmanager.openai.azure.com
8    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
9)
10
11EMBEDDING_MODEL = "text-embedding-ada-002"
12
13
14def create_embeddings(batch_of_texts: list[str]) -> list[list[float]]:
15    response = client.embeddings.create(
16        model=EMBEDDING_MODEL,
17        input=batch_of_texts
18    )
19    print(response)
20    print(response.data[0])
21    return [e.embedding for e in response.data]

Since it is impossible to upload the batch of all texts conversion into embeddings, although the API allows us to do so, we soon get an error due to rate limit.

Therefore we come up with the next slightly modified approach to get the embeddings:

4.2. create_embeddings_batched

1import time
2from typing import List
3
4def create_embeddings_batched(texts: List[str], batch_size: int = 100) -> List[List[float]]:
5    """Create embeddings in batches to avoid rate limits"""
6    all_embeddings = []
7    
8    for i in tqdm(range(0, len(texts), batch_size), desc="Creating embeddings"):
9        batch = texts[i:i + batch_size]
10        
11        try:
12            response = client.embeddings.create(
13                model=EMBEDDING_MODEL,
14                input=batch
15            )
16            all_embeddings.extend([e.embedding for e in response.data])
17        except Exception as e:
18            if "rate limit" in str(e).lower():
19                print(f"Rate limit hit, waiting 60 seconds...")
20                time.sleep(60)
21                # Retry the same batch
22                response = client.embeddings.create(
23                    model=EMBEDDING_MODEL,
24                    input=batch
25                )
26                all_embeddings.extend([e.embedding for e in response.data])
27            else:
28                raise
29        
30        # Add delay between batches to avoid rate limits
31        if i + batch_size < len(texts):
32            time.sleep(2)  # 2 second delay between batches
33    
34    return all_embeddings

4.3. Save vectors into local chroma db

For quick experiment we first save everything into a local sqlite db. Once all the experiments are done, we will upload these vectors onto PostgreSQL.

1def save_vector_embeddings(chunks):
2    chroma = PersistentClient(path=DB_NAME)
3    if collection_name in [c.name for c in chroma.list_collections()]:
4        chroma.delete_collection(collection_name)
5
6    texts = [chunk.page_content for chunk in chunks]
7    
8    # Use batched version instead
9    vectors = create_embeddings_batched(texts, batch_size=50)
10    
11    collection = chroma.get_or_create_collection(collection_name)
12
13    ids = [str(i) for i in range(len(chunks))]
14    metas = [chunk.metadata for chunk in chunks]
15
16    collection.add(ids=ids, embeddings=vectors,
17                   documents=texts, metadatas=metas)
18    print(f"Vectorstore created with {collection.count()} documents")

The creation of embedding does not take very long:

1Creating embeddings: 100%|██████████| 29/29 [05:15<00:00, 10.86s/it]
2
3Vectorstore created with 1411 documents

4.4. Graph plotting for the embeddings in 3D

1chroma = PersistentClient(path=DB_NAME)
2collection = chroma.get_or_create_collection(collection_name)
3result = collection.get(include=['embeddings', 'documents', 'metadatas'])
4vectors = np.array(result['embeddings'])
5documents = result['documents']
6metadatas = result['metadatas']
7doc_tags = [metadata['tags'] for metadata in metadatas]

Outcome:

1tsne = TSNE(n_components=3, random_state=42)
2reduced_vectors = tsne.fit_transform(vectors)
3
4# Create the 2D scatter plot
5fig = go.Figure(data=[go.Scatter3d(
6    x=reduced_vectors[:, 0],
7    y=reduced_vectors[:, 1],
8    z=reduced_vectors[:, 2],
9    mode='markers',
10    marker=dict(size=5, opacity=0.8),  # Removed color parameter
11    text=[f"Tags: {m['tags']}<br>Title: {m['title']}<br>Text: {d[:100]}..."
12          for m, d in zip(metadatas, documents)],
13    hoverinfo='text'
14)])
15
16fig.update_layout(title='2D Chroma Vector Store Visualization',
17                  xaxis_title='x',
18                  yaxis_title='y',
19                  width=800,
20                  height=600,
21                  margin=dict(r=20, b=10, l=10, t=40)
22                  )
23
24fig.show()

5. Fetch and Rerank results from Vector DB According to the Question

5.1. fetch_context_unranked

1def fetch_context_unranked(question):
2    # query = openai.embeddings.create(model=embedding_model, input=[question]).data[0].embedding
3    query = create_embeddings([question])
4    results = collection.query(query_embeddings=query, n_results=RETRIEVAL_K)
5    chunks = []
6    for result in zip(results["documents"][0], results["metadatas"][0]):
7        chunks.append(Result(page_content=result[0], metadata=result[1]))
8    return chunks

5.2. rerank

1class RankOrder(BaseModel):
2    order: list[int] = Field(
3        description="The order of relevance of chunks, from most relevant to least relevant, by chunk id number"
4    )
5
6def rerank(question, chunks):
7    system_prompt = """
8You are a document re-ranker.
9You are provided with a question and a list of relevant chunks of text from a query of a knowledge base.
10The chunks are provided in the order they were retrieved; this should be approximately ordered by relevance, but you may be able to improve on that.
11You must rank order the provided chunks by relevance to the question, with the most relevant chunk first.
12Reply only with the list of ranked chunk ids, nothing else. Include all the chunk ids you are provided with, reranked.
13"""
14    user_prompt = f"The user has asked the following question:\n\n{question}\n\nOrder all the chunks of text by relevance to the question, from most relevant to least relevant. Include all the chunk ids you are provided with, reranked.\n\n"
15    user_prompt += "Here are the chunks:\n\n"
16    for index, chunk in enumerate(chunks):
17        user_prompt += f"# CHUNK ID: {index + 1}:\n\n{chunk.page_content}\n\n"
18    user_prompt += "Reply only with the list of ranked chunk ids, nothing else."
19    messages = [
20        {"role": "system", "content": system_prompt},
21        {"role": "user", "content": user_prompt},
22    ]
23    response = completion(model=MODEL, messages=messages,
24                          response_format=RankOrder)
25    reply = response.choices[0].message.content
26    order = RankOrder.model_validate_json(reply).order
27    print(order)
28    return [chunks[i - 1] for i in order]

We combine to get:

5.3. fetch_reranked_context

1def fetch_reranked_context(question):
2    chunks = fetch_context_unranked(question)
3    return rerank(question, chunks)

6. Prompts to Answer Question with Agentic Reranking

6.1. rewrite_query

1SYSTEM_PROMPT = """
2You are a knowledgeable, friendly assistant to search for articles in the blog of James Lee.
3You are chatting with a user about finding related articles.
4Your answer will be evaluated for accuracy, relevance and completeness, so make sure it only answers the question and fully answers it.
5If you don't know the answer, say so.
6For context, here are specific extracts from the Knowledge Base that might be directly relevant to the user's question:
7{context}
8
9With this context, please answer the user's question. Be accurate, relevant and complete.
10"""
11
12def make_rag_messages(question, history, chunks):
13    context = "\n\n".join(
14        f"Extract from article titled '{chunk.metadata['title']}':\n{chunk.page_content}" for chunk in chunks)
15    system_prompt = SYSTEM_PROMPT.format(context=context)
16    return [{"role": "system", "content": system_prompt}] + history + [{"role": "user", "content": question}]
17
18def rewrite_query(question, history=[]):
19    """Rewrite the user's question to be a more specific question that is more likely to surface relevant content in the Knowledge Base."""
20
21    sys_message = f"""
22        You are in a conversation with a user, answering questions about the articles from the blog of James Lee.
23        You are about to look up information in a Knowledge Base to answer the user's question.
24
25        This is the history of your conversation so far with the user:
26        {history}
27
28        And this is the user's current question:
29        {question}
30
31        Respond only with a single, refined question that you will use to search the Knowledge Base.
32        It should be a VERY short specific question most likely to surface content. Focus on the question details.
33        IMPORTANT: Respond ONLY with the knowledgebase query, nothing else.
34    """
35    response = completion(model=MODEL, messages=[
36                          {"role": "system", "content": sys_message}])
37    return response.choices[0].message.content

6.2. answer_question

1def answer_question(question: str, history: list[dict] = []) -> tuple[str, list]:
2    """
3    Answer a question using RAG and return the answer and the retrieved context
4    """
5    query = rewrite_query(question, history)
6    print(query)
7    chunks = fetch_reranked_ontext(query)
8    messages = make_rag_messages(question, history, chunks)
9    response = completion(model=MODEL, messages=messages)
10    return response.choices[0].message.content, chunks

7. Local Experiments

7.1. rewrite_query

1rewrite_query("find me an article about the departure / resignation of staffs in a company", [])