Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Model Sizing, Quantization, and VRAM Budgeting for Local Deployment


Owner: Vadim Rudakov, rudakow.wadim@gmail.com Version: 1.0.0 Birth: 2025-10-19 Last Modified: 2026-02-05


Deploying language models locally — whether as the Editor in the aidx pipeline or as a standalone inference endpoint — requires engineering three interrelated budgets: model weights, KV cache memory, and quantization loss. This article provides the sizing rationale that complements the model classification (Agentic / General Purpose / Thinking tiers) with concrete VRAM arithmetic.

1. Model Size Tiers and the aidx Role Map

The aidx framework assigns models to specific pipeline phases based on their capability tier. Size alone does not determine role — instruction adherence and reasoning depth matter more (see General Purpose vs Agentic Models).

TierParameter Rangeaidx RoleRepresentative ModelsHardware Target
Micro125M – 3BResearcher (RAG retrieval), classifier, routerministral, phi-3-miniCPU / mobile / edge
Editor7B – 14BEditor (Phase 3: Execution)qwen2.5-coder:14b-instruct-q4_K_MConsumer GPU (8–16 GB VRAM)
Architect70B+ / Cloud APIArchitect (Phase 2: Planning)Claude 4.0 Sonnet, Gemini 3 Flash, DeepSeek-V3Cloud API
ThinkingCloud APIPre-flight verificationOpenAI o2, Gemini 3 (DeepThink), DeepSeek-R1Cloud API

Key insight: The local GPU budget is reserved for the Editor tier. Architect and Thinking models run via cloud API, so their parameter count is irrelevant to your VRAM planning. Plan your hardware around the Editor model.

2. The “Short-Term Memory” Tax (KV Cache)

When you load a model, you are not just paying for the weights (the “brain”). Every token of conversation history allocates KV cache (the “short-term memory”) on the GPU.

  • The Problem: A model that fits in VRAM at initialization can OOM (Out of Memory) mid-conversation as the KV cache grows with each turn.

  • The Rule of Thumb: For a 14B model at Q4 quantization, budget ~2 GB of VRAM per 8,192 tokens of active context.

This is why the aidx framework enforces a Hard Reset at the Architect→Editor transition (ADR-26005):

The Editor instance is launched without the Architect’s message history. It receives only artifacts/plan.md as input, keeping KV cache usage below 4 GB and leaving maximum headroom for model weights.

The max-chat-history-tokens: 2048 setting in the aidx configuration is a Context Gate — it caps the Editor’s KV cache growth to prevent the OOM crash that long aider sessions would otherwise produce.

3. Quantization: Trading Precision for Deployment

Quantization reduces model weights from 16-bit floats to lower-bit integers, shrinking VRAM requirements at the cost of minor accuracy loss.

FormatSize ReductionAccuracy ImpactWhen to Use
FP16 (baseline)Benchmarking, maximum quality
Q8_0~50%NegligibleWhen VRAM is available but you want a safety margin
Q4_K_M~75%~1% logic degradationDefault for local Editor deployment. Best balance of size and quality.
Q4_0~75%~2–3% degradationBudget hardware; test thoroughly before production use

Practical example: qwen2.5-coder:14b at FP16 requires ~28 GB VRAM. At Q4_K_M, it fits in ~8 GB, leaving headroom for KV cache on a 12 GB consumer GPU.

4. Production Patterns

Pattern A: The Verifier Cascade

Instead of routing everything to a large cloud model, use a two-stage local pipeline:

  1. The Drafter (7B): Produces a fast, rough answer.

  2. The Verifier (14B): Checks the draft against your rules/schema.

This maps naturally to the aidx Architect→Editor flow: the Architect drafts the plan (cloud), and the Editor executes against the codebase (local). The Verifier Cascade extends this to fully local pipelines where cloud access is unavailable.

Pattern B: Hybrid Routing

Use a micro model (≤3B) as a gatekeeper to classify incoming requests:

  • Simple requests (greetings, FAQ lookups) → local Editor model.

  • Complex requests (multi-file refactors, architectural decisions) → cloud Architect API.

In the aidx context, this is the Researcher role (Phase 1): ministral performs lightweight RAG retrieval to determine what context the Architect needs, avoiding expensive cloud API calls for work that can be handled locally.

5. VRAM Budget Checklist

For a target local deployment (e.g., 12 GB consumer GPU):

  1. Select the Editor model: qwen2.5-coder:14b or equivalent in the 7B–14B range.

  2. Quantize to Q4_K_M: Reduces ~28 GB → ~8 GB for a 14B model.

  3. Reserve KV cache headroom: 2–4 GB depending on max-chat-history-tokens setting.

  4. Verify total: Model weights + KV cache + OS overhead (~500 MB) must fit within VRAM.

  5. Enforce structured output: If the Editor produces structured output, enforce schema compliance with Pydantic or Outlines to prevent broken responses.

  6. Stress test: Run the longest expected input through the pipeline to verify no OOM under peak KV cache load.

ComponentBudget (12 GB GPU)
Model weights (14B Q4_K_M)~8 GB
KV cache (2048 tokens gate)~1 GB
OS / framework overhead~0.5 GB
Available headroom~2.5 GB