The model you pick matters less than who serves it. Compare pricing, speed, free tiers, and n8n integration quality across every major inference provider.
Most people building AI workflows in n8n obsess over which model to use. But the question that determines your actual costs, latency, and reliability is different: who is serving that model?
An AI inference provider is the cloud infrastructure between your n8n workflow and the model weights. They handle GPU allocation, request queuing, batching, and rate limits. The same Llama 3.1 70B model can cost $0.13/M tokens on one provider and $0.88/M on another. It can respond in 200ms on one and 2 seconds on another.
Why not just use OpenAI directly? You can. But you are locked into OpenAI's models, OpenAI's pricing, and OpenAI's outages. Inference providers give you access to hundreds of models (open-weight and proprietary) through a single API key, often at significantly lower prices. When OpenAI goes down, your workflows keep running on a different provider.
Why not just use the model lab's own API? Because most model labs (Meta, Mistral, Google) either do not offer direct API access, charge premium rates, or have limited infrastructure. Inference providers specialize in serving models at scale, often faster and cheaper than the labs themselves.
Prerequisite: This article assumes you have chosen your model or model family. If not, start with How to Choose the Right AI Model for n8n first, then come back here to pick the best provider for that model.
Prices are per million tokens as of mid-2026. Input and output tokens are priced separately by most providers. "Free tier" describes what you get without paying anything.
| Provider | Free Tier | Llama 3.1 70B (Input) | Llama 3.1 70B (Output) | GPT-4o Access | Claude Access | Speed |
|---|---|---|---|---|---|---|
| OpenRouter | $5 credit | $0.13 | $0.40 | Yes (routed) | Yes (routed) | Fast |
| Groq | Generous RPM free | $0.59 | $0.79 | No | No | Fastest |
| Together AI | $25 credit | $0.88 | $0.88 | No | No | Fast |
| Fireworks AI | $1 credit | $0.90 | $0.90 | No | No | Very Fast |
| Mistral (Direct) | Free tier (rate-limited) | N/A | N/A | No | No | Fast |
| OpenAI (Direct) | None | N/A | N/A | Yes ($2.50/$10) | No | Fast |
| Hugging Face | Free serverless | Varies | Varies | No | No | Slow (free) |
| Ollama (Local) | Free forever | $0 | $0 | No | No | Hardware-bound |
Reading these prices: $0.13/M input tokens means processing 1 million input tokens costs $0.13. A typical n8n workflow step processes 500-2,000 tokens. At $0.13/M, processing 10,000 workflow executions with ~1,000 tokens each costs roughly $0.0013, or about one-tenth of a cent. AI inference in n8n is extremely cheap at scale.
If you only set up one inference provider in n8n, make it OpenRouter. It is not a model lab. It is a unified routing layer that sits on top of virtually every major AI provider. A single API key gives you access to OpenAI, Anthropic, Google Gemini, Mistral, DeepSeek, Qwen, Meta Llama, Cohere, and hundreds more.
n8n integration: OpenRouter exposes a fully OpenAI-compatible API endpoint. In n8n's AI Agent node, select "OpenAI" as the credential type, set the base URL to https://openrouter.ai/api/v1, and paste your OpenRouter API key. No custom HTTP Request nodes needed. It works natively with all AI nodes.
Price routing: OpenRouter lets you route to the cheapest available provider for any given model at real-time market prices. You can run DeepSeek R1 at a fraction of the direct API cost, or have OpenRouter automatically pick the cheapest provider serving your chosen model.
Automatic fallback: If your primary model's provider goes down mid-workflow, OpenRouter can transparently retry on a backup provider. This is resilience you would otherwise have to build yourself with n8n If-nodes and error branches.
Free tier: New accounts receive $5 in credits automatically. That is enough for hundreds of test executions across multiple models. No credit card required to start.
When NOT to use OpenRouter: When you need the absolute lowest latency (OpenRouter adds ~50ms routing overhead), when you need guaranteed GDPR-compliant EU data processing (use Mistral Direct), or when you want fully local inference (use Ollama).
Groq operates custom-built LPU (Language Processing Unit) hardware designed specifically for inference. The result is genuinely fast token generation: 500-800 tokens per second for models like Llama 3 and Mixtral. For context, standard GPU inference from OpenAI generates roughly 60-80 tokens per second.
Best use in n8n: High-volume pipeline steps where speed matters more than model variety. Running 1,000 records through sentiment analysis, entity extraction, or text classification? Groq processes the batch in a fraction of the time any other provider takes.
Free tier: Groq provides a genuinely generous free tier measured in Requests Per Minute (RPM) and daily token limits, not a credit system that expires. For non-commercial workflows or testing, you can run entire AI pipelines at zero cost indefinitely.
n8n integration: Groq exposes an OpenAI-compatible API. In n8n, use the OpenAI credential with base URL https://api.groq.com/openai/v1. Works natively with AI Agent, Chat Model, and LLM Chain nodes.
| Model | Input $/M | Output $/M | Context | Speed (tok/s) |
|---|---|---|---|---|
| Llama 3.1 70B | $0.59 | $0.79 | 128K | ~250 |
| Llama 3.1 8B | $0.05 | $0.08 | 128K | ~750 |
| Mixtral 8x7B | $0.24 | $0.24 | 32K | ~500 |
| Gemma 2 9B | $0.20 | $0.20 | 8K | ~600 |
Limitation: Groq's model catalog is intentionally narrow: primarily open-weight models like Llama 3, Mixtral, and Gemma. You will not find Claude, GPT-4o, or Gemini here. If your workflow requires proprietary models, Groq is not your answer.
Together AI has built one of the broadest open-weight model serving platforms available. If you need access to fine-tuned, specialized, or obscure open-source models, Together is the definitive answer.
Free tier: New accounts receive $25 in credit, the largest of any provider on this list. That is enough to genuinely stress-test production workflows before committing to paid usage.
Model depth: Together serves over 100 models including Llama 3.1 405B, Qwen 2.5, Mistral variants, DeepSeek, and specialized research models. For anyone building RAG pipelines or domain-specific agents, this breadth matters.
Pricing advantage: Together aggressively undercuts proprietary API costs. Running Llama 3.1 70B via Together is 5-10x cheaper per token compared to an equivalent proprietary model from OpenAI or Anthropic. The tradeoff is slightly higher latency than Groq.
n8n integration: OpenAI-compatible API. Base URL: https://api.together.xyz/v1. Works with all n8n AI nodes using the OpenAI credential type.
Custom fine-tuning: Together offers fine-tuning as a service. Upload your training data, fine-tune a base model, and serve it through the same API endpoint. This is powerful for n8n workflows that need domain-specific accuracy (medical, legal, financial text processing).
Mistral AI (Direct) is a French AI lab with outstanding open and proprietary models. Accessing their API directly at api.mistral.ai gives you their premium models: Mistral Large 2 for complex reasoning and Mistral Nemo for ultra-fast inference. For European businesses, Mistral processes data within EU infrastructure, making it the strongest GDPR-compliance choice. n8n has a dedicated Mistral AI credential type requiring no workarounds.
| Mistral Model | Input $/M | Output $/M | Context | Best For |
|---|---|---|---|---|
| Mistral Large 2 | $2.00 | $6.00 | 128K | Complex reasoning, coding |
| Mistral Nemo | $0.15 | $0.15 | 128K | Fast classification, extraction |
| Mistral Small | $0.10 | $0.30 | 32K | Cost-effective general use |
Fireworks AI focuses on production-grade inference with ultra-low latency. Their standout feature is optimized JSON mode and structured output for open-weight models. For n8n AI Agents that rely on tool calling from non-OpenAI models, Fireworks reduces schema hallucination errors more reliably than most alternatives. Pricing is comparable to Together AI, with a small $1 free credit to start.
Hugging Face Inference API hosts the world's largest repository of open-source models (900,000+ checkpoints). Use it when you need a highly specialized or domain-specific model not available anywhere else: medical NLP, legal document parsing, specific-language models. The free serverless tier throttles heavily during peak demand. For production, Dedicated Inference Endpoints are priced per GPU-hour and can get expensive.
Ollama is not a cloud provider. It is a local runtime that downloads and runs open-weight models directly on your hardware. For self-hosted n8n deployments, the Ollama combination is the ultimate zero-cost, zero-data-leakage AI stack.
Privacy: Zero data leaves your infrastructure. Every token is generated on your hardware. For workflows processing sensitive data (financial records, PII, proprietary IP), Ollama is the only architecturally sound option.
n8n integration: n8n has a dedicated Ollama node in the AI section. Point the credential to your Ollama server address (default: http://localhost:11434). The integration is first-class and works natively with AI Agent orchestration.
Cost: $0/month for inference. Your only cost is the hardware. A used RTX 3060 12GB (~$200) runs Llama 3.1 8B at ~30 tokens/second. An RTX 4090 24GB (~$1,500) runs Llama 3.1 70B (quantized) at reasonable speeds.
| Hardware | VRAM | Max Model Size | Speed (Llama 8B) | Approx. Cost |
|---|---|---|---|---|
| CPU Only (16 GB RAM) | N/A | 7B (Q4) | ~5 tok/s | $0 |
| RTX 3060 12GB | 12 GB | 13B (Q4) | ~30 tok/s | ~$200 |
| RTX 4090 24GB | 24 GB | 70B (Q4) | ~45 tok/s | ~$1,500 |
| Mac M2 Ultra 192GB | 192 GB unified | 405B (Q4) | ~20 tok/s | ~$4,000+ |
Limitation: Performance is entirely bound by your local hardware. Without a dedicated GPU, inference will be 10-100x slower than any cloud provider. Ollama also does not support proprietary models like GPT-4o or Claude.
Rather than picking one provider and hoping for the best, apply this tiered approach to your n8n AI stack:
One key, all models, price routing, automatic fallback. Set up once, switch models in seconds. Best starting point for any n8n AI setup.
Route high-frequency nodes (sentiment, classification, extraction) to Groq for 5-10x faster processing. Free tier covers most testing.
For workflows touching personal data, financial records, or proprietary business data. Zero data leaves your infrastructure.
EU-hosted infrastructure with GDPR compliance. Native n8n credential. Best for European businesses processing regulated data.
When you need domain-specific fine-tuned models not available commercially. Together for serving, HF for discovery.
Best-in-class structured output from open-weight models. Reduces JSON hallucination in n8n AI Agent tool calls.
The recommended architecture: Use OpenRouter as your primary API key in n8n. This gives you instant access to all providers through one integration point. Add provider-specific credentials (Groq, Mistral, Ollama) only where you need guaranteed access to features not routed via OpenRouter.
Yes. Each AI node in n8n can use a different credential. You could route a classification step through Groq for speed, a complex reasoning step through OpenRouter (GPT-4o), and an embedding step through Ollama for privacy. Each node is independent.
The n8n execution will fail at that node. To handle this, use n8n's error handling: add an Error Trigger node or configure the AI node's "On Error" setting to continue with a fallback. OpenRouter handles this automatically with its built-in failover routing.
OpenRouter does add a small margin on top of the underlying provider's price, but it actively routes to the cheapest provider for each model. In practice, OpenRouter's price for many models is lower than going direct to the model lab because they negotiate volume discounts. The routing, fallback, and multi-provider access justify any markup.
Very little. A typical n8n AI node processes 500-2,000 tokens per execution. At $0.13/M input tokens (OpenRouter Llama 70B), running 10,000 executions costs about $0.65. Even heavy usage (100,000 executions/month with GPT-4o at $2.50/M) costs roughly $25/month. The VPS hosting cost almost always exceeds the inference cost.
Only if your VPS has a GPU (rare and expensive) or you are willing to accept very slow CPU-only inference. Most VPS providers offer CPU-only servers. For practical local inference, run Ollama on a separate machine with a dedicated GPU and connect n8n to it over your local network.
Together AI ($25 credit) for the most generous initial credit. Groq for ongoing free usage (no credit expiry, RPM-based limits). Oracle Cloud + Ollama for fully free, unlimited local inference if you can provision the ARM instance.
OpenRouter, Together AI, and Fireworks all serve embedding models. OpenRouter also routes to OpenAI's DALL-E and other image generation APIs. For embeddings in n8n RAG pipelines, Together AI's embedding models are the cheapest option. Ollama supports local embedding models like nomic-embed-text.