📑 Table of Contents

Build a RAG Pipeline With LangChain and Pinecone

📅 · 📁 Tutorials · 👁 9 views · ⏱️ 13 min read
💡 A step-by-step guide to building a production-ready RAG pipeline using LangChain and Pinecone vector database.

Retrieval-Augmented Generation (RAG) has become the go-to architecture for developers building AI applications that need accurate, context-aware responses grounded in custom data. This tutorial walks you through building a complete RAG pipeline using LangChain and Pinecone, from document ingestion to intelligent query answering — with production-ready code patterns you can deploy today.

Unlike fine-tuning, which costs thousands of dollars and requires retraining for every data update, RAG lets you dynamically inject relevant context into LLM prompts at query time. The combination of LangChain's orchestration framework and Pinecone's managed vector database makes this approach accessible, scalable, and surprisingly fast to implement.

Key Takeaways at a Glance

  • RAG pipelines eliminate hallucinations by grounding LLM responses in your actual data
  • LangChain 0.3+ provides modular components for document loading, splitting, embedding, and retrieval
  • Pinecone's free tier supports up to 100,000 vectors — enough for most prototypes and small production apps
  • Total setup time is under 60 minutes for a working prototype
  • The architecture supports OpenAI, Anthropic Claude, and open-source models interchangeably
  • Estimated cost for a small-scale deployment: under $20/month using OpenAI's text-embedding-3-small model

Understanding the RAG Architecture Before You Build

RAG works in 2 phases: an offline ingestion phase and a real-time query phase. During ingestion, you load documents, split them into chunks, generate vector embeddings, and store them in a vector database like Pinecone.

During the query phase, a user's question gets converted into an embedding. Pinecone performs a similarity search to find the most relevant document chunks, which then get injected into the LLM prompt as context.

This architecture is fundamentally different from approaches like fine-tuning with GPT-4 or training custom models. RAG keeps your base model unchanged while dynamically providing relevant context, making it ideal for applications where data changes frequently — knowledge bases, legal documents, product catalogs, and internal wikis.

Step 1: Install Dependencies and Configure API Keys

Start by installing the required Python packages. You will need LangChain's core library, the Pinecone client, and OpenAI's SDK for embeddings and the LLM.

Here are the essential packages to install:

  • langchain (v0.3+) — the core orchestration framework
  • langchain-openai — OpenAI integration for embeddings and chat models
  • langchain-pinecone — Pinecone vector store integration
  • pinecone-client (v3+) — Pinecone's official Python SDK
  • tiktoken — tokenizer for managing chunk sizes
  • python-dotenv — for secure API key management

You will need 2 API keys: one from OpenAI (starting at $5 credit for new accounts) and one from Pinecone (free tier available at pinecone.io). Store both in a .env file and never commit them to version control.

Create a new Pinecone index through the dashboard or programmatically. Set the dimension to 1536 if using OpenAI's text-embedding-3-small model, or 3072 for text-embedding-3-large. Choose cosine as the similarity metric — it performs best for text-based retrieval in most benchmarks.

Step 2: Load and Chunk Your Documents Intelligently

Document loading is where most RAG pipelines silently fail. LangChain provides over 80 document loaders for PDFs, web pages, databases, Notion exports, and more.

For PDF files, use PyPDFLoader. For web content, WebBaseLoader handles HTML parsing automatically. For structured data, CSVLoader and JSONLoader preserve metadata that becomes critical during retrieval.

Chunking strategy directly impacts retrieval quality. The RecursiveCharacterTextSplitter is the recommended default because it respects natural text boundaries — paragraphs, sentences, then words — rather than cutting mid-sentence.

Optimal chunking parameters based on production experience:

  • chunk_size: 500-1000 characters for general text (start with 800)
  • chunk_overlap: 100-200 characters to preserve context across boundaries
  • separators: Use the default hierarchy — paragraph breaks, newlines, spaces
  • length_function: Use tiktoken token counting instead of character counting for accuracy

Compared to naive fixed-size splitting, recursive splitting improves retrieval accuracy by 15-25% according to benchmarks published by the LangChain team. One common mistake is setting chunks too large — anything over 2000 characters tends to dilute relevance during similarity search.

Step 3: Generate Embeddings and Store in Pinecone

Embedding generation converts your text chunks into high-dimensional vectors that capture semantic meaning. OpenAI's text-embedding-3-small model offers the best price-to-performance ratio at $0.02 per 1 million tokens — roughly 3,000 pages of text for just 2 cents.

LangChain's OpenAIEmbeddings class handles batching automatically. For large document sets (over 10,000 chunks), enable batching with a batch size of 100 to avoid rate limits.

Once embeddings are generated, push them to Pinecone using PineconeVectorStore.from_documents(). This single method call handles the entire pipeline — embedding generation, vector formatting, and upsert to Pinecone — in one step.

Include metadata with each vector. Store the source filename, page number, chunk index, and any relevant tags. This metadata enables filtered searches later — for example, retrieving only chunks from a specific document or date range.

Handling Large-Scale Ingestion

For datasets exceeding 50,000 documents, consider these optimizations:

  • Use Pinecone's serverless indexes (launched in January 2024) for automatic scaling
  • Process documents in parallel using Python's concurrent.futures
  • Implement incremental ingestion — hash each document and skip unchanged ones
  • Monitor your Pinecone dashboard for index fullness and query latency

Pinecone's serverless tier starts at $0.00 for the first 2GB of storage and 2 million read units, making it significantly cheaper than self-hosted alternatives like running Weaviate or Milvus on AWS.

Step 4: Build the Retrieval and Generation Chain

The retrieval chain is where everything comes together. LangChain's RetrievalQA chain or the newer LCEL (LangChain Expression Language) syntax connects your Pinecone retriever to an LLM for answer generation.

Configure the retriever with search_type='similarity' and k=4 to return the top 4 most relevant chunks. For more nuanced retrieval, use search_type='mmr' (Maximal Marginal Relevance), which balances relevance with diversity — preventing the retriever from returning 4 nearly identical chunks.

The prompt template is critical. Structure it to clearly separate the retrieved context from the user's question. Instruct the model to only answer based on the provided context and to say 'I don't have enough information' when the context is insufficient. This reduces hallucinations dramatically.

For the LLM, GPT-4o-mini at $0.15 per million input tokens offers excellent quality for RAG applications. Anthropic's Claude 3.5 Haiku at $0.25 per million tokens is a strong alternative with a 200K context window. Both models follow retrieval-grounded instructions reliably.

Advanced Retrieval Techniques

Once your basic pipeline works, consider these enhancements that can boost answer quality by 20-40%:

  • Multi-query retrieval: Generate 3-5 variations of the user's question to capture different semantic angles
  • Contextual compression: Use an LLM to extract only the relevant sentences from each retrieved chunk
  • Hybrid search: Combine Pinecone's dense vector search with BM25 sparse retrieval for better keyword matching
  • Re-ranking: Add a cross-encoder model (like Cohere Rerank at $1 per 1000 queries) to re-score retrieved results
  • Parent document retrieval: Store small chunks for search but return the full parent document for context

Step 5: Evaluate and Optimize Your Pipeline

Evaluation is non-negotiable for production RAG systems. Without measurement, you are guessing at quality. The RAGAS framework (Retrieval Augmented Generation Assessment) provides 4 key metrics.

Measure faithfulness (does the answer stick to the context?), answer relevancy (does it actually answer the question?), context precision (are the retrieved chunks relevant?), and context recall (did retrieval find all necessary information?).

Create a test set of 50-100 question-answer pairs that represent real user queries. Run these through your pipeline weekly and track metrics over time. A faithfulness score below 0.8 usually indicates prompt engineering issues, while low context precision points to chunking or embedding problems.

What This Means for Development Teams

RAG democratizes AI-powered search for organizations of every size. A solo developer can build a production-quality knowledge assistant in a weekend. Enterprise teams can scale the same architecture to millions of documents with Pinecone's serverless infrastructure.

The total cost for a small deployment — 100,000 document chunks, 1,000 queries per day — runs approximately $15-25/month. That is a fraction of the $10,000+ cost for fine-tuning a custom model, and the data stays current without retraining.

Looking Ahead: Where RAG Is Heading in 2025

Agentic RAG is the next evolution. Instead of a simple retrieve-then-generate flow, AI agents will dynamically decide which data sources to query, when to perform multi-hop retrieval, and how to synthesize information from multiple vector stores.

LangChain's LangGraph framework already supports agentic RAG patterns. Pinecone's recent addition of inference endpoints — which bundle embedding generation directly into the database — eliminates the need for a separate embedding API call, reducing latency by 30-50%.

Expect RAG to become the standard backend for enterprise AI applications throughout 2025, replacing traditional search infrastructure across customer support, legal research, healthcare documentation, and financial analysis. The developers who master this architecture now will have a significant advantage as the market matures.

Start with the free tiers of both Pinecone and OpenAI. Build a simple prototype with 100 documents. Then iterate on chunking, retrieval, and prompting — that is where the real performance gains live.