Building Your First RAG System: Making AI Smart About Your Business

Building Your First RAG System: Making AI Smart About Your Business

Building Your First RAG System: Making AI Smart About Your Business

Part 2 of 5: Retrieval-Augmented Generation

Part 2 of 5: Retrieval-Augmented Generation

Welcome back! Today we’re diving into RAG systems — one of the most practical ways to make AI useful for your specific business needs.

What Exactly is RAG?

Let me explain RAG with a simple analogy. Imagine you’re taking an open-book exam. Instead of relying only on what you memorized, you can look up information in your textbooks to give better, more accurate answers. That’s essentially what RAG does for AI models.

RAG = AI Model + Your Custom Knowledge Base

The AI model provides the “intelligence” to understand questions and generate human-like responses, while your knowledge base provides the specific, up-to-date information about your business.

Why Should You Care About RAG?

Here’s the thing about large language models like ChatGPT — they’re incredibly smart, but they don’t know anything about your specific business. They can’t tell customers about your latest product features, your company policies, or answer questions using your internal documentation.

RAG solves this by:

  • Making AI responses more accurate and relevant to your business
  • Keeping information up-to-date (just update your documents, no retraining needed)
  • Providing transparency (you can see which documents the AI used to answer)
  • Being cost-effective (much cheaper than fine-tuning large models)

Let’s Build One Together

I’ll show you how to build a RAG system using Google’s Gemini AI. We’ll create a system that can answer questions about “MyNextDeveloper”. Here’s the step-by-step process:

Step 1: Install the Required Tools

!pip install -q -U google-generativeai langchain-google-genai langchain faiss-cpu pypdf langchain-community

These libraries will handle the heavy lifting for us:

  • google-generativeai: Access to Google’s Gemini models
  • langchain-google-genai: Gemini integration with LangChain
  • LangChain: Framework for building AI applications
  • FAISS: Fast similarity search (finding relevant documents)

Step 2: Set Up Google AI API

import os
from google.colab import userdata
import google.generativeai as genai

GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

You’ll need to get a free API key from Google AI Studio and add it to your environment.

Step 3: Prepare Your Knowledge Base

raw_text = """
MyNextDeveloper (MND) is a remote-first tech company founded in 2022 in Mumbai. They focus on supplying pre-vetted, highly skilled developers to startups. Their mission is to solve the trust gap between startups and engineers by emphasizing empathy, communication, and transparency.The company promotes agile practices, test-driven development, and POSH-compliant culture. MND offers services like staff augmentation, API and web development, UI/UX design, and AI/ML solutions. Their tech stack includes Angular, React, Next.js, Node.js, Django, Python, Docker, and AWS.
Clients appreciate their transparency and delivery quality. Project pricing starts at $10,000 with hourly rates between $25 and $49. MND claims startups can save up to 30% through their freelance model.
They operate globally, serving startups, Y Combinator and Techstars alumni. It is headquartered in Malabar Hill, Mumbai.
"""


# Convert to LangChain Document
from langchain_core.documents import Document
documents = [Document(page_content=raw_text)]
print(f"Loaded {len(documents)} document(s).")

In a real application, this would be your product documentation, FAQ, company policies, or any text-based information about your business.

Step 4: Split the Information into Chunks

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=50,
length_function=len,
add_start_index=True,
)
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks.")

Why do we split the text? AI models work better with smaller, focused pieces of information. The chunk_overlap=50 ensures we don't lose context at chunk boundaries, while add_start_index=True helps us track where information came from.

Step 5: Create Embeddings and Vector Store

from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import FAISS
embeddings = GoogleGenerativeAIEmbeddings(
model="models/embedding-001",
google_api_key=GOOGLE_API_KEY
)
print("Creating FAISS vector store...")
vector_store = FAISS.from_documents(chunks, embeddings)
print("FAISS vector store created.")

The most important step. Each chunk of text gets converted into a “vector” (numerical representation). When someone asks a question, we find the chunks with vectors most similar to the question’s vector.

Step 6: Set Up the Question-Answering System

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate
retriever = vector_store.as_retriever(search_kwargs={"k": 3})
llm = ChatGoogleGenerativeAI(
model="gemini-1.5-flash",
temperature=0.3,
google_api_key=GOOGLE_API_KEY
)
prompt_template = """
You are an AI assistant that answers questions based on the provided context.
If the answer is not available in the context, politely state that you don't know.
Context:
{context}
Question:
{question}
Answer:
"""
RAG_PROMPT = PromptTemplate.from_template(prompt_template)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": RAG_PROMPT}
)

The temperature=0.3 keeps responses focused and factual, while search_kwargs={"k": 3} means we'll retrieve the 3 most relevant chunks for each question.

Step 7: Test It Out

questions = [
"What does MyNextDeveloper (MND) do?",
"Where is MND headquartered?",
"What services does MND offer?",
"What technologies does MND use?",
"How does MND help startups save money?",
"What is the capital of France?" # This should trigger "I don't know"
]
for i, question in enumerate(questions):
print(f"\nQuestion {i+1}: {question}")
response = qa_chain.invoke({"query": question})
print(f"Answer: {response['result']}")
if 'source_documents' in response:
print("Source Documents:")
for doc in response['source_documents']:
print(f"- Content (truncated): {doc.page_content[:150]}...")

Sample Output:

Question 1: What does MyNextDeveloper (MND) do?
Answer: MyNextDeveloper (MND) is a remote-first tech company that supplies pre-vetted, highly skilled developers to startups. They offer services such as staff augmentation, API and web development, UI/UX design, and AI/ML solutions. Their goal is to bridge the trust gap between startups and engineers through empathy, communication, and transparency.
Question 6: What is the capital of France?
Answer: I'm sorry, but I don't know the capital of France. The provided text is about a company called MND and doesn't contain information about France's capital.

Notice how the system correctly answers questions about MND but politely declines to answer questions outside its knowledge base.

Google Colab

Here’s the google colab link of this article that can be run in one click: link

Real-World Applications

Here are some practical ways you can use RAG systems:

Customer Support: Upload your FAQ, product manuals, and support tickets. Let the AI handle common questions 24/7.

Internal Knowledge Base: Upload company policies, procedures, and documentation. Employees can ask questions in natural language instead of digging through files.

Sales Assistant: Upload product specifications, pricing, and case studies. Your sales team gets instant access to any information they need during calls.

Content Creation: Upload your existing content, blog posts, and marketing materials. Generate new content that’s consistent with your brand voice.

The Limitations (Let’s Be Honest)

RAG isn’t perfect:

  • Quality depends on your data: Garbage in, garbage out
  • Can hallucinate: Sometimes the AI might make up information that sounds plausible
  • Requires maintenance: You need to keep your knowledge base updated
  • Cost considerations: API calls can add up with heavy usage

Key Implementation Insights

From our working example, here are some important technical considerations:

Chunk Size Matters: We used 300 characters with 50-character overlap. This balances context preservation with retrieval precision.

Embedding Quality: Google’s embedding-001 model provides good general-purpose embeddings, but you might need domain-specific embeddings for specialized content.

Retrieval Strategy: The k=3 parameter means we retrieve 3 most relevant chunks. You can adjust this based on your use case.

Prompt Engineering: Our prompt explicitly tells the AI to say “I don’t know” when information isn’t available. This prevents hallucination.

Tips for Success

  1. Start small: Begin with one specific use case rather than trying to cover everything
  2. Keep documents updated: Outdated information leads to frustrated users
  3. Test extensively: Try lots of different questions to understand where your system works well and where it doesn’t
  4. Monitor usage: Keep an eye on what questions people are asking to improve your knowledge base
  5. Iterate on chunk size: Experiment with different chunk sizes and overlap values for your specific content

What’s Next

In our next article, we’ll explore fine-tuning — taking an existing AI model and training it specifically for your use case. While RAG gives AI access to your information, fine-tuning actually changes how the AI “thinks” about your domain.

RAG is often the best starting point because it’s simpler, more transparent, and easier to maintain. But sometimes you need that extra level of customization that only fine-tuning can provide.