Building a RAG System with Vertex AI, Pinecone, and LangChain (Step-by-Step Guide)

Sunil_ — Sat, 13 Sep 2025 05:22:58 +0000

What is RAG (Retrieval-Augmented Generation)

RAG stands for Retrieval-Augmented Generation. It is a way to make AI models smarter by giving them external knowledge.

Normally, an LLM (like GPT or Gemini) only knows what it learned during training.
With RAG, you store your documents in a vector database and embed them into vectors.
When a user asks a question, the system finds the most relevant documents and passes them to the LLM.
The LLM then generates an answer based on the retrieved documents, so it can answer questions about content it hasn’t seen before.

Simple analogy:
Think of it like a student using a textbook. The LLM is the student, and the vector store is the textbook. When asked a question, the student looks up the textbook and answers accurately instead of just guessing.

1. Create a Google Cloud Account

To get started, you need a Google Cloud account. Google gives every new user $300 in free credits that you can spend on services like Vertex AI.

go to Google Cloud and sign up with your Gmail.

2. Enable the Vertex AI API

Once your Google Cloud account is ready, the next step is to turn on the Vertex AI API for your project.

In the Google Cloud console, open the menu and go to “APIs & Services” → “Library.”
Search for Vertex AI API.
Click on it and hit “Enable.”

3. Create a Service Account

In the Google Cloud console, go to “IAM & Admin” → “Service Accounts.”
Click “Create Service Account.”
Give it a name (for example: vertex-rag-sa).
Assign it a role such as Vertex AI User (this allows it to access Vertex AI).
After creating the service account, click on it, go to the “Keys” section, and choose “Add Key → Create New Key.”
Select JSON as the key type and download the file. You’ll use this JSON file later to authenticate your JavaScript code when calling Vertex AI.

4. Initialize Node.js Project and Install Libraries

First, let’s make a new Node project. Open your terminal and type:

npm init -y

This just creates a basic package.json file so your project has all the info it needs.

Next, we need some libraries to make RAG work with Vertex AI and Pinecone. Install them like this:

npm install @langchain/pinecone @langchain/google-vertexai @pinecone-database/pinecone @langchain/textsplitters @langchain/community

Here’s what they do:

@langchain/pinecone → lets LangChain talk to Pinecone.
@langchain/google-vertexai → lets us use Vertex AI embeddings and LLMs.
@pinecone-database/pinecone → the Pinecone client to store and search your vectors.
@langchain/community → gives us loaders, like loading PDFs into LangChain.
@langchain/textsplitters → helps split big documents into smaller text chunks so embeddings can work better.

5. What we are building

let’s talk about what we are going to do. We’ll create a RAG (Retrieval-Augmented Generation) system that can answer questions about a company’s policy.

Basically:

We take a PDF document that has the company policy.
We break it into small chunks and store it in a vector database (Pinecone).
When we ask a question, our system retrieves the most relevant parts from the PDF and gives an answer using Vertex AI.

For this tutorial, you don’t need a real company PDF. I’m using a dummy PDF that we can generate using GPT. You can replace it with your own document later.

6. Load the document and set up Pinecone

Now we need two things:
1.Load our PDF into LangChain.

Use the PDF loader from @langchain/community to read the file.
Split the text into smaller chunks using @langchain/textsplitters.
This makes it easier for the embeddings to work properly.

2.Set up Pinecone to store the embeddings.

Go to Pinecone and create a free account.
Create a new index.
Choose Custom settings.
Since we’re using the text-embedding-004 model from Vertex AI:
- Dimension → 768
- Vector Type → Dense
- Metric → Cosine
If you use another embedding model, make sure to update the dimension value based on that model’s output size.
After creating the index, go to the API Keys section in Pinecone. Copy your API Key and add it to your .env file like this:
```
 PINECONE_API_KEY=your-pinecone-api-key-here
```

7. Load the document and push it into Pinecone

Now let’s write the code to:

Load our PDF (Company Policy).
Split it into chunks.
Generate embeddings with Vertex AI.
Store them inside Pinecone.

Here’s the code:

import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { VertexAIEmbeddings } from "@langchain/google-vertexai";
import { PineconeStore } from "@langchain/pinecone";
import { Pinecone as PineconeClient } from "@pinecone-database/pinecone";

// 1. Setup Pinecone client
const pc = new PineconeClient({
  apiKey: process.env.PINECONE_API_KEY as string,
});
const pineconeIndex = pc.index("google-embed"); // "google-embed" = your Pinecone index name

// 2. Setup embeddings model (Vertex AI)
const embeddings_model_google = new VertexAIEmbeddings();

// 3. Create vector store connection
export const vectorStore = new PineconeStore(embeddings_model_google, {
  pineconeIndex,
  maxConcurrency: 5,
});

// 4. Load PDF, split it, and add to Pinecone
export async function loader(pdf: string) {
  const loader = new PDFLoader(pdf, { splitPages: false });
  try {
    const docs = await loader.load();
    console.log("✅ Document loaded");

    const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 500,   // each chunk ~500 characters
      chunkOverlap: 100 // overlap to keep context
    });

    const texts = await textSplitter.splitText(docs[0]?.pageContent as string);
    console.log("✅ Document split into chunks");

    const documents = texts.map((chunk) => ({
      pageContent: chunk,
      metadata: docs[0]?.metadata || {},
    }));

    console.log("⏳ Adding to vector database...");
    await vectorStore.addDocuments(documents);
    console.log("🎉 Added to Pinecone database!");
  } catch (er) {
    console.error("❌ Error while processing document");
    console.log(er);
  }
}

// Run the loader with our PDF
loader("./Company_Policy.pdf");

🔑 Important: `.env` Setup

Make sure you have the following in your .env file:

PINECONE_API_KEY=your-pinecone-api-key
GOOGLE_APPLICATION_CREDENTIALS=embeddings-test-471806-e1b74d261b5b.json

👉 The GOOGLE_APPLICATION_CREDENTIALS is the JSON key file we downloaded in step 3.

👉 After running the loader script, your PDF chunks should now be stored in Pinecone.
You can go to your Pinecone dashboard and open the index you created.

There, you’ll see something like this:

8. Retrieve documents and use LLM to answer

Now that we’ve stored our company policy PDF in Pinecone, we can search for relevant chunks and use Vertex AI (LLM) to answer questions.

👉 We’ll use the vectorStore we created earlier to run similarity search on the vector database. This will pull out the most relevant pieces of text related to our question. Then we’ll pass those chunks into an LLM (Gemini on Vertex AI) to get the final answer.

the code we used earlier:

// embeddings_model_google and pineconeIndex 
// were already defined in the previous step
export const vectorStore = new PineconeStore(embeddings_model_google, {
  pineconeIndex,
  maxConcurrency: 5,
});

Here’s the next code:

import { vectorStore } from "./prepare.js";
import { ChatVertexAI } from "@langchain/google-vertexai";
import { HumanMessage, SystemMessage } from "@langchain/core/messages";

// 1. Setup LLM
const llm = new ChatVertexAI({
  model: "gemini-2.5-flash", // fast + good for QA
  temperature: 0,            // deterministic answers
});

// 2. System prompt for strict rules
const SYSTEM_PROMPT = new SystemMessage(
  `You are a helpful and knowledgeable assistant named General Dao. 
   Your role is to answer user questions only from the given context.
   Rules:
   - If the answer exists in the context, give a clear and concise response.
   - If it’s not in the context, say you don’t have enough information.
   - Don’t make up answers. Stay professional and helpful.`
);

// 3. Main function
async function main() {
  const q = "What is the company policy on leave?"; // Example question, this query is hardcoded, but you can make it dynamic using an API or user input

  try {
    // Run similarity search → get 3 most relevant chunks
    const res = await vectorStore.similaritySearch(q, 3);

    // Merge results into one context string
    const final_res = res.map((data) => data.pageContent).join("\n\n");

    // Build human prompt
    const HUMAN_PROMPT = new HumanMessage(
      `Q: ${q}\n\nContext:\n${final_res}`
    );

    // Ask the LLM
    const aiMsg = await llm.invoke([SYSTEM_PROMPT, HUMAN_PROMPT]);

    // Print answer
    console.log("Answer:", aiMsg.content);
  } catch (er) {
    console.log("❌ Error:", er);
  }
}

main();

🔑 How it works:

User asks a question.
vectorStore.similaritySearch() finds the most relevant chunks from Pinecone.
We send those chunks + question to Vertex AI (Gemini model).
LLM gives an answer strictly from context.

📚 What’s Next?

You can read more about LangChain vector store integrations in the official docs: 👉 LangChain VectorStores Documentation

Forem: Sunil_