AI Infrastructure Architecture

Vector Stores and RAG in n8n: Pinecone, Qdrant, and Document Pipelines

Published on Mar 26, 2026 · By Anshul Namdev

The Problem That RAG Solves

Every LLM has a knowledge cutoff date and a fixed context window. Ask GPT-4o about your company's internal product documentation, a PDF you uploaded last week, or data from your proprietary database and it simply cannot answer accurately — because it has never seen that information.

Retrieval-Augmented Generation (RAG) is the architectural pattern that solves this. Instead of stuffing your entire document into the prompt (which is expensive and breaks at scale), RAG works in two distinct phases:

  1. Ingestion: Documents are split into smaller chunks, converted into numerical vector embeddings, and stored in a Vector Store database.
  2. Retrieval: When a user asks a question, the question itself is embedded, and the Vector Store finds the most semantically similar document chunks. Only those relevant chunks are sent to the LLM as context.

n8n has first-class nodes for both phases, connecting directly to the major Vector Store providers and embedding models without requiring any custom code.

Vector Store Options in n8n

Provider Hosting Free Tier n8n Node Best For
Pinecone Managed Cloud Yes Native Fastest production setup
Qdrant Self-host / Cloud Yes Native Self-hosted + data privacy
Supabase (pgvector) Managed Cloud Yes Native Already using Supabase
Weaviate Self-host / Cloud Yes Native Multi-modal data
Chroma Self-host Free (OSS) Native Local dev prototyping
In-Memory Store n8n Built-in Free Native Quick testing, no infra

Full Workflow: Injection and Retrieval

The interactive workflow below shows the complete dual-pipeline architecture: the injection sub-graph on the left (Manual Trigger feeding Pinecone in insert mode with document loader and embeddings sub-nodes) and the retrieval sub-graph on the right (Chat Trigger feeding an AI Agent with OpenRouter, Simple Memory, and Pinecone in retrieve-as-tool mode).

Live Workflow Preview

Here is exactly what each node does in the injection chain:

  • Manual Trigger: Starts the ingestion run on demand. In a production setup, this would typically be replaced with a Schedule Trigger (to re-sync nightly) or a Webhook Trigger (to react to new document uploads).
  • Default Data Loader: Receives your raw data from the previous node (a file, a string, or a binary) and splits it into chunks. The chunking strategy (chunk size, overlap) is configurable in the node's options. This node connects to Pinecone via the ai_document sub-connection type.
  • Embeddings OpenAI: Takes each chunk and calls the OpenAI Embeddings API (text-embedding-3-small by default) to produce a high-dimensional numerical vector for each piece of text. This connects via the ai_embedding sub-connection type. You can swap this for any other embedding model node in n8n.
  • Pinecone Vector Store (insert mode): The orchestrator. It receives the chunks from the Data Loader and their embeddings from the Embeddings node, then writes every chunk and its corresponding vector to your Pinecone index.

Setting Up Pinecone

Before connecting the n8n node, your Pinecone environment needs to be provisioned correctly. The critical configuration decision is the index dimension, which must match the output dimension of your embedding model exactly.

  • OpenAI text-embedding-3-small: 1536 dimensions
  • OpenAI text-embedding-3-large: 3072 dimensions
  • Cohere embed-english-v3.0: 1024 dimensions

A dimension mismatch will cause the Pinecone node to throw an error at insert time. Create the index in the Pinecone console, select cosine as the distance metric for text similarity tasks, and copy the API key into n8n's Pinecone credential. That is the entire infrastructure setup required on the free tier.

Part 2: Retrieval with an AI Agent

Once documents are stored, the retrieval pipeline is a live chatbot workflow. A user sends a message, the AI Agent decides it needs document context, queries Pinecone, reads the returned chunks, and generates a grounded response.

The key architectural decision in this workflow is the mode: retrieve-as-tool setting on the Pinecone node. This tells n8n to expose the vector store as a callable tool to the AI Agent, rather than pulling documents automatically. The agent itself decides when context is needed and calls the tool with a query string derived from the user's question.

  • Chat Trigger: Exposes a live chat interface (or webhook endpoint) that accepts user messages. Each new message triggers a fresh agent execution.
  • OpenRouter Chat Model: The LLM powering the agent's reasoning. OpenRouter is used here, giving you the flexibility to swap between GPT-4o, Claude, or any open-weight model without changing the agent architecture.
  • Simple Memory: A sliding window buffer that keeps the last N conversation turns in context so the agent maintains conversational continuity across messages.
  • Pinecone Vector Store (retrieve-as-tool): When the agent determines it needs document knowledge, it calls this tool with a natural language query. The node embeds the query, searches Pinecone for the top-K most similar chunks, and returns the results to the agent as text context.
  • Embeddings OpenAI (shared): Notice that the same Embeddings node is shared between both the ingestion and the retrieval workflows via the ai_embedding connection on the retrieval Pinecone node. This is by design: the same embedding model must be used at query time as was used at ingestion time. Switching models requires re-indexing all documents.

Quick Setup: Qdrant as a Self-Hosted Alternative

If you are running a self-hosted n8n deployment and want to keep data entirely within your own infrastructure, Qdrant is the recommended alternative to Pinecone. It runs as a single Docker container alongside your n8n instance.

  1. Add Qdrant to your docker-compose.yml: image: qdrant/qdrant, exposing port 6333.
  2. Create a collection via the Qdrant REST API or the built-in dashboard at http://localhost:6333/dashboard. Set the vector size to match your embedding model (1536 for OpenAI small).
  3. In n8n, add a Qdrant credential pointing to http://qdrant:6333 (using the Docker service name). No API key is required for local deployments.
  4. Replace the Pinecone Vector Store nodes in both workflows with Qdrant Vector Store nodes and select your collection. All other nodes remain identical.

The entire RAG architecture is provider-agnostic at the n8n level. Swapping Pinecone for Qdrant, Weaviate, or Supabase requires only changing the vector store node — the embedding model, data loader, and AI agent chain stay exactly the same.

What Goes Wrong: Common RAG Mistakes in n8n

  • Mismatched embedding models: Injecting with OpenAI embeddings and querying with a different model produces garbage results. The vector space is model-specific. Re-index everything if you change the embedding model.
  • Chunk size too large: With a chunk size of 2000 tokens, each retrieved chunk consumes the majority of the agent's usable context. Use 256 to 512 tokens with a 10 to 20 token overlap as a starting point, and tune empirically.
  • Not using retrieve-as-tool mode: Connecting Pinecone directly to the agent's input (rather than as a tool) bypasses the agent's reasoning. The agent then always retrieves context regardless of whether it is actually needed, wasting tokens.
  • Injecting binary files without extracting text first: The Default Data Loader handles plain text. For PDFs, Word documents, or HTML, use the Extract From File node upstream to convert to text before the data loader step.

The workflow JSON for both pipelines shown here is based on a real n8n export and is importable directly via the n8n editor. Replace the Pinecone index name, OpenAI credentials, and OpenRouter credentials with your own before executing.