MCP + LiteLLM stack — kenapa kami pakai pattern ini sendiri
Model Context Protocol + LiteLLM proxy = exit strategy 2-layer untuk vendor lock-in agentic AI. Architecture + cost model + reasoning kenapa fit untuk klien sovereign concern.
MCP + LiteLLM stack — kenapa kami pakai pattern ini sendiri
Stack agentic AI 2024-2026 punya pattern lock-in yang kompleks (lihat post Vendor lock-in 4 layer di era agentic AI). Dua layer paling load-bearing — foundation model (Layer 1) dan runtime environment (Layer 3) — punya solusi pattern yang sudah mature: MCP (Model Context Protocol) + LiteLLM proxy.
Article ini deep dive konkret kenapa kami pakai pattern ini sendiri di operasional Capital Commerce internal, dan kapan pattern ini fit untuk klien.
Apa itu MCP
Model Context Protocol = open standard yang Anthropic release November 2024. Standardize bagaimana LLM talk ke external tools (database query, API call, file system, browse) via abstraksi yang sama-sama dimengerti vendor model berbeda.
Pattern problem yang MCP solve:
-
Sebelum MCP: setiap LLM provider punya tool calling format proprietary
- Anthropic: tool use format dengan
tool_usecontent block - OpenAI: function calling dengan JSON schema spec
- Google Gemini: function declaration dengan beda field name
- Setiap provider switch = rewrite tool integration
- Anthropic: tool use format dengan
-
Setelah MCP: tool definition dalam MCP-compliant server, LLM provider yang adapt ke MCP standard
- Tool integration written once, used across providers
- Switch model provider = config change, bukan codebase rewrite
Yang sudah support MCP per 2026:
- Claude (Anthropic) — native first-class
- OpenAI GPT — adapter library tersedia
- Google Gemini — adapter library tersedia
- Open-source models via vLLM / Ollama — community adapter
Adoption belum 100% universal, tapi trajectory clear: standard ini akan jadi default 2027-2028.
Apa itu LiteLLM
LiteLLM proxy = open source library + proxy server yang abstraksi multiple LLM provider behind one OpenAI-compatible API endpoint.
Pattern problem yang LiteLLM solve:
-
Sebelum LiteLLM: codebase Anda call Anthropic SDK direct. Switch ke OpenAI = rewrite di setiap call site. Tambah fallback model = manual try-catch logic per call.
-
Setelah LiteLLM: codebase Anda call LiteLLM endpoint. Provider routing + fallback + load balancing + budget guard di-handle proxy. Switch / add provider = config update di proxy, codebase tidak berubah.
Feature LiteLLM yang load-bearing:
- Multi-provider routing — 1 endpoint, 100+ provider support (Anthropic, OpenAI, Google, Cohere, AWS Bedrock, Azure OpenAI, Ollama self-host, dst)
- Fallback chain — primary provider down → auto-route ke secondary → tertiary
- Cost tracking — per-model + per-team token usage tracking
- Budget guard — hard cap monthly spend per model/team
- Caching — request deduplication untuk identical prompt+params (saving cost untuk batch)
- Streaming + non-streaming — both modes supported transparently
- Tool calling abstraksi — translate format antar provider (work-in-progress untuk MCP integration)
Stack architecture yang kami pakai
┌─────────────────────────────────────────────────────────┐
│ Application code (Python / Next.js / n8n) │
│ - calls LiteLLM endpoint via OpenAI-compatible SDK │
└────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ LiteLLM proxy 127.0.0.1:4000 (self-host, systemd unit) │
│ - Model groups dengan fallback chain │
│ - Cost tracking + budget guard │
│ - Cache (Redis kalau perlu) │
└────┬────────────┬────────────┬─────────────┬───────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Anthropic│ │OpenRouter│ │ Local │ │ Future │
│ direct │ │(fallback)│ │ Ollama │ │provider │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
Plus MCP layer untuk tool calling:
LLM (via LiteLLM) ←──MCP protocol──→ MCP servers
│
├── Postgres MCP server
├── n8n MCP server
├── File system MCP server
├── HTTP fetch MCP server
└── Custom domain MCP servers
Model groups yang kami define
Per [[System Architecture/scripts/litellm/config.yaml]] (vault-tracked):
model_list:
# Tier 1 — high reasoning, kompleks task
- model_name: capcom-tier1
litellm_params:
model: claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
fallback_models:
- model: openrouter/anthropic/claude-sonnet-4-6
- model: openrouter/openai/gpt-4o
# Tier 2 — fast, low-cost untuk classification + draft
- model_name: capcom-tier2
litellm_params:
model: claude-haiku-4-5
fallback_models:
- model: openrouter/anthropic/claude-haiku-4-5
- model: openrouter/openai/gpt-4o-mini
# Tier 3 — eksperimen, scope kecil, budget-rendah
- model_name: capcom-tier3-local
litellm_params:
model: ollama/qwen2.5:7b # self-host kalau perlu offline
Per workflow di n8n / Python service, call by model_name (e.g. capcom-tier1). LiteLLM routing handle provider + fallback.
Cost model konkret
Pattern saving yang kami observe di engagement past + internal ops:
Tanpa LiteLLM (provider direct)
Klien call Anthropic direct di codebase. Total spend per bulan: Rp 2-5 juta.
Saat Anthropic outage 4 jam (jarang tapi pernah Q1 2025), workflow stuck. Recovery manual.
Dengan LiteLLM proxy
Klien call LiteLLM. Routing config: Anthropic primary + OpenRouter fallback.
Saat Anthropic outage, fallback auto-route ke OpenRouter (ke same Claude model atau alternative). Workflow continue dengan latency +200ms (proxy + fallback overhead).
Cost overhead LiteLLM proxy:
- Self-host VPS resource: ~$5/bulan (Hetzner CX21 atau co-tenant existing droplet)
- Latency: +50-200ms per call (negligible untuk B2B workflow)
- Maintenance: ~0.5-1 manday/bulan (config update, model deprecation handling)
Net benefit: uptime resilient + provider switch optional + budget guard hard-enforced.
Cost tracking saving
LiteLLM cost tracking surface real spend per model + per workflow. Pattern saving yang muncul:
- Identify workflow yang seharusnya tier-2 (Haiku) tapi mistakenly call tier-1 (Sonnet) — saving 70-80% per call
- Identify prompt yang tidak efficient (verbose system message, redundant context) — refactor saving 30-50% token
- Budget guard prevent cost spike kalau ada bug yang loop call (catch incident sebelum bill membesar)
MCP tool integration pattern
Untuk klien dengan ≥3 specialist agent yang call tool, MCP layer worth investing:
Pattern dasar
# Application code (Python)
from anthropic import Anthropic
client = Anthropic(base_url="http://127.0.0.1:4000") # LiteLLM proxy
response = client.messages.create(
model="capcom-tier1", # via LiteLLM model group
mcp_servers=[
{"name": "postgres", "url": "ws://localhost:8001"},
{"name": "n8n", "url": "ws://localhost:8002"},
{"name": "files", "url": "stdio:///opt/mcp-servers/files"}
],
messages=[
{"role": "user", "content": "Query last 10 lead di Postgres + summarize"}
]
)
LLM tool calling auto-route ke MCP server yang relevant. Server response back ke LLM. LLM compose final answer.
Switch provider
Saat Anthropic deprecate model atau cost shift signifikan, switch:
# config.yaml — single change
- model_name: capcom-tier1
litellm_params:
model: openrouter/anthropic/claude-sonnet-5 # bumped to next-gen
Codebase tidak berubah. MCP tool integration tidak berubah. Workflow continue.
Kapan pattern ini fit
Klien dengan:
- Stack agentic AI yang scope medium-besar (10-50+ manday investment)
- Concern vendor lock-in jangka panjang (3-5 tahun horizon)
- Budget LLM spend > $200/bulan (break-even point untuk maintenance overhead LiteLLM)
- Tim teknis minimum 1 orang yang familiar Python + Linux + Docker
- Plan multi-provider exit option as governance posture
Kapan TIDAK fit
- Scope kecil (< 10 manday LLM workflow) — over-engineering
- Single-provider posture is fine — klien yang explicit OK lock-in di 1 vendor
- Tim teknis terbatas — maintenance overhead exceeds capacity
- Real-time consumer-facing chat — proxy latency overhead unacceptable
Saya sampaikan honest di konsultasi awal — pattern multi-layer abstraksi tidak free.
Honest trade-off
Cost LiteLLM:
- Setup time: 0.5-1 manday awal
- Maintenance: 0.5-1 manday/bulan ongoing
- Latency overhead: 50-200ms per call
- Operational complexity: tambah 1 service yang harus monitor + backup
Cost MCP:
- Setup time: 1-3 manday awal (bangun MCP server custom kalau tools tidak ada community server)
- Maintenance: 0.5-1 manday/bulan ongoing
- Tool calling latency: +100-300ms per round-trip
- Operational complexity: tambah service + protocol yang harus debug saat issue
Saving yang offset cost:
- Provider switch tanpa codebase rewrite (saving 5-15 manday per switch event)
- Cost optimization via budget guard + tier routing (saving 20-40% LLM spend kalau di-tune)
- Uptime improvement via fallback chain (saving downtime cost)
- Future-proof terhadap MCP standard adoption (zero migration cost saat ekosistem mature)
Net benefit positive untuk klien dengan profile yang fit. Untuk klien outside profile, saya rekomendasikan single-provider direct.
Stack yang kami pakai sendiri
Stack di operasional Capital Commerce internal:
- LiteLLM v1.50+ self-host di VPS sgp1
- Anthropic Claude Sonnet 4.5 + Haiku 4.5 sebagai primary
- OpenRouter sebagai fallback chain
- MCP servers untuk Postgres + n8n + Obsidian vault read
Saving cost per bulan vs direct provider call: ~25% (cost tracking + tier routing optimization). Uptime improvement: 1 outage event yang seharusnya halt workflow auto-mitigated via fallback.
Penutup
MCP + LiteLLM = pattern dengan trade-off yang saya sebut explicit. Bukan one-size-fits-all. Kapan fit, ROI positive jangka panjang. Kapan tidak fit, single-provider direct lebih lean.
Klien yang masuk Tier 1A (burned in-house AI builders) yang stuck di vendor lock-in compound — pattern ini biasanya bagian dari recovery solution. Untuk klien fresh build, pattern ini make sense kalau scope sudah jelas medium-besar dari Day 1.
Tag internal: Pillar 4 · Sovereign · Deep dive · S1 Cross-reference: /services/ai-operating-system · /pattern · Decisions/ADR-002-openrouter-litellm-migration · Decisions/ADR-012-claude-local-mcp-architecture
