Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Local LLM Inference: A Practical Handbook for Hybrid Host/Device Execution and KV Cache Offloading


Owner: Vadim Rudakov, lefthand67@gmail.com
Version: 0.7.0
Birth: 2025-11-23
Modified: 2025-11-24


INFO: The handbook is optimized for environments supporting Mermaid.js diagrams. For static export, rasterized versions are available in Appendix B.

When a user hits “enter” after typing a prompt, the system triggers a complex collaboration between the Host (Central Processing Unit, CPU) and the Device (Graphics Processing Unit, GPU). This architecture is called Hybrid Execution.

Your job as an AI Engineer is to manage the trade-offs between CPU’s vast memory capacity and GPU’s raw speed. Where should your precious data live and be processed?

📘 Glossary of Acronyms

AcronymFull NameContext/Role
CPUCentral Processing UnitOrchestrates processing, I/O, and pre/post tasks.
GPUGraphics Processing UnitExecutes parallel matrix compute operations.
VRAMVideo RAMHigh-speed memory on GPU; major capacity limit.
RAMRandom Access MemoryCPU system memory used for offloading.
Hostn/aMain system processor, which is almost always the CPU, along with its associated system memory (RAM). It is the component that manages the entire system, initiates tasks, and handles the flow of data to and from the accelerator cards.
Devicen/aA parallel processing unit or an accelerator used to offload computationally intensive tasks from the host. It can be GPUs, TPUs (Tensor Processing Units), FPGAs (Field-Programmable Gate Arrays), or other custom AI chips.
KV CacheKey-Value CacheStores attention Key and Value vectors for past tokens, eliminating re-computation in the Self-Attention layer.
FLOPSFloating Point Operations Per SecondTheoretical peak compute throughput of a GPU. Rarely achieved in LLM inference due to memory bottlenecks.
TTFTTime To First TokenLatency metric for Prefill phase, measures how long after supplying a prompt the first output token appears.
TPOTTime Per Output TokenLatency metric for Decoding phase, quantifies how long it takes to generate each token after the first.
TPSTokens Per SecondThroughput metric during Decode phase.
TPS=1TPOT\text{TPS} = \frac{1}{\text{TPOT}}
PCI-EPeripheral Component Interconnect ExpressHigh-speed CPU-GPU interconnect, bottleneck for offloading.
VRAM BandwidthGB/s data transfer rate between GPU cores and VRAMPrimary constraint for both Prefill (TTFT) and Decode (TPS) phases.
PagedAttention™vLLM’s unified memory techniqueAllows non-contiguous KV cache blocks to be stored efficiently, dramatically reducing memory fragmentation and maximizing throughput. (Requires datacenter GPUs)
Q4_K_MQuantization Format4-bit quantization with per-channel scaling and K-means clustering (GGUF).

0. Key Metrics for Inference: Throughput vs. Latency Metrics

Performance metrics are often confused, yet understanding their distinctions is critical to optimizing Local LLM inference. Let’s clarify the key terms engineers use to measure model responsiveness and speed.

Key Metrics Overview

MetricWhat It MeasuresDominant PhaseHardware BottleneckRelationship
Time To First Token (TTFT)Latency: time to generate the first tokenPrefillSSD speed, PCI-E bandwidthIndependent of TPS
Time Per Output Token (TPOT)Latency: time to generate each subsequent tokenDecodeGPU VRAM bandwidth
TPOT=1TPS\text{TPOT} = \frac{1}{\text{TPS}}
Tokens Per Second (TPS)Throughput: tokens generated per secondDecodeGPU VRAM bandwidth
TPS=1TPOT\text{TPS} = \frac{1}{\text{TPOT}}

Definitions and Context

Why This Distinction Matters

Optimizing local LLM inference is fundamentally a balancing act between minimizing latency and maximizing throughput:

Insight for Engineers: When colleagues say “optimize TPS,” they mean reducing bandwidth bottlenecks during token generation. This is often achieved by managing the KV Cache effectively or applying quantization to reduce data size.

Practical Example

Consider the Mistral 7B model on an 8GB VRAM GPU with 32GB RAM:

However, these two goals can compete for resource allocation — optimizing for one may degrade the other, so thoughtful system tuning is necessary.


1. The Local Inference Pipeline: A Guided Scenario

We will follow a typical prompt journey using the scenario: running a Mistral 7B model locally with a high-end CPU, 32GB system RAM, and a consumer 8GB VRAM GPU.

1.1 Phase 1: The Prefill (Fast, Parallel Compute)

The model processes the entire prompt in parallel during this phase, marked by high GPU utilization and measured as Time To First Token (TTFT).

StepDominant RoleActionPrimary Bottleneck
1. Cold Start / I/OHost (CPU)Load weights from SSD to system RAMSSD sequential read speed
2. PreprocessingHost (CPU)Tokenize prompt, prepare tensors, transferPCI-E bandwidth (Host\rightarrowDevice)
3. Compute & CacheDevice (GPU)Matrix multiplies, builds KV CacheVRAM bandwidth, Device FLOPS \dagger

\dagger GPU compute (FLOPS) is rarely the true bottleneck. Real-world Prefill is memory-bound by VRAM bandwidth feeding data to cores. See Appendix A for more information.

1.2 Phase 2: The Decode (Sequential Memory Access)

Latency: Time Per Output Token (TPOT)

After the first token is generated, subsequent tokens are created one at a time referencing the expanding KV Cache. The loop is bandwidth-bound and measured by Time Per Output Token (TPOT).

Throughput: Tokens Per Second (TPS)

Tokens Per Second (TPS) is the standard performance metric used to measure the throughput or speed of an LLM serving engine. It is the reciprocal of TPOT, or

TPS=1TPOT\text{TPS} = \frac{1}{\text{TPOT}}

.


2. Memory Pressure: The KV Cache and Hybrid Execution Bottlenecks

2.1 The Key-Value (KV) Cache: The VRAM Killer

The KV Cache stores attention vectors for past tokens, significantly reducing computations but consuming high-speed Device VRAM.

KV Cache CharacteristicEngineering Challenge
Linear GrowthCache size grows linearly with conversation length.
VRAM LimitCache saturation (e.g., 8GB VRAM) causes stalls or crashes. (Saturation means VRAM is fully allocated, forcing data eviction or page faults.)

Memory Pressure Timeline:

🔥 Critical Pitfall: Exceeding 4,000 tokens on 8GB VRAM stalls or crashes inference without aggressive memory management.

2.2 Why the KV Cache Grows Linearly

The Key-Value (KV) Cache is a critical optimization for Transformer-based models (like Large Language Models).

The Need for Caching

How Caching Causes Growth

The VRAM Bottleneck

The issue arises because VRAM is a finite, expensive resource.

In short, the problem is a fundamental trade-off: the linear growth of the cache (essential for fast O(N)O(N) generation) eventually clashes with the fixed limit of the Device’s VRAM (which is required for high-speed operation).

2.3 KV Cache Offloading: The Hybrid Solution

Hybrid execution frameworks like llama.cpp offload KV Cache blocks to Host RAM when VRAM fills.

Offloading Event Workflow:

When the Device needs an offloaded block (cache miss), the data must be retrieved from RAM\text{RAM}, crossing the PCI-E Bus twice. This retrieval process adds significant latency.

(For a step-by-step visualization of this PCI-E Latency Hit, refer to the conditional block in the Complete LLM Inference Pipeline Diagram in Section 3.)

BottleneckSymptom/ErrorCauseActionable Troubleshooting
Prefill LatencyHigh TTFT (>1.5s>1.5\text{s} for 512-token prompt)Slow SSD I/O or PCI-E bottleneck during weight transfer1. Upgrade to NVMe Gen4 SSD
2. Ensure GPU in x16 PCI-E slot
3. Pin process to isolated CPU cores [→ Deep Dive: OS Tuning]
Decode ThroughputLow TPS (<15<15 tokens/sec for Mistral-7B)VRAM bandwidth saturation during KV Cache access (memory-bound ops)1. Apply Q4_K_M quantization
2. Reduce context window to 2K tokens
3. Enable partial KV cache offloading [→ Deep Dive: Quantization Trade-offs]
Memory CrashFatal VRAM error at 4K\sim 4\text{K} tokensKV Cache >> VRAM capacity (e.g., 4.2GB cache on 8GB VRAM GPU)1. Enable KV cache offloading
2. Set --cache-offload-percentage 70
3. Monitor VRAM usage pre-crash [→ Deep Dive: Memory Management]

💡 Interactive Checkpoint: Logs show:

  1. TTFT excellent (0.5s0.5\text{s})

  2. After 3 mins, TPS drops from 20 to 3.
    Can you identify the cause? (Answer at the end.)


3. Complete LLM Inference Pipeline Diagram

This sequence diagram illustrates the complete process of Large Language Model (LLM) inference on a single-GPU system, from model loading to iterative token generation, explicitly highlighting critical performance bottlenecks and the hybrid execution logic.

This structured analysis serves as the key to the diagram, explicitly linking each stage of the pipeline to its governing performance metric, hardware driver, and primary bottleneck.

Phase/EventDriverPerformance MetricBottleneck & Implication
Phase I: PrefillModel loading, Prompt processing, Initial KV Cache build.TTFT (Time To First Token)Primarily I/O Bottlenecks: SSD sequential read speed and the PCI-E Bandwidth (Host RAM \rightarrow Device VRAM) during initial weight and prompt transfer.
Phase II: Decode (Steady State)Iterative token generation, KV Cache\text{KV Cache} read/write.TPOT (Time Per Output Token) & TPSVRAM Bandwidth saturation. The speed is limited by how quickly the Device can access the growing KV Cache\text{KV Cache} in its local high-speed memory.
Conditional: OffloadingKV Cache\text{KV Cache} size exceeds available VRAM\text{VRAM} capacity (e.g., >8GB>8\text{GB}).Latency Spike (High TPOT\text{TPOT})PCI-E Latency Hit: Offloaded blocks must be retrieved from slower Host RAM across the PCI-E\text{PCI-E} bus, adding a significant delay (5×5\times to 10×10\times penalty) to the token generation loop.
Hybrid BenefitUtilizing vast Host RAM (e.g., 32GB32\text{GB}) to extend context.Maximum Context LengthEnables context windows far exceeding VRAM\text{VRAM} capacity (e.g., 16K16\text{K} tokens), trading sustained high TPS\text{TPS} for the ability to handle long conversations.

4. Frameworks: The Hybrid Execution Engines

Real-world hybrid execution depends on inference kernels that optimize Host/Device coordination. Key players in 2025:

FrameworkHybrid Execution SuperpowerCritical LimitationMistral-7B (8GB VRAM) Tip
llama.cppKV cache offloading + Host/Device layer splittingHigh Host (CPU) overhead during pagingn_gpu_layers=40 + --split-mode row
vLLMPagedAttention™ (unified Device VRAM / Host RAM cache)Requires NVIDIA datacenter Devices❌ Not consumer-Device compatible
HuggingFace TGISpeculative decoding + pipeline parallelismNo Host offload supportUse only with 16GB+ Device VRAM

llama.cpp Hybrid Methods Explained

For local deployments, llama.cpp uses two distinct methods for hybrid execution:

  1. KV Cache Offloading: The process of moving the conversation state (the growing KV Cache) from Device VRAM to Host RAM to extend the maximum context length beyond the Device’s memory limit. This primarily manages memory capacity.

  2. Layer Splitting: The most effective hybrid technique for load balancing, which assigns the computation for the model’s lower layers (compute-heavy) to the Device (GPU), while the upper layers (memory-intensive) are strategically placed on the Host (CPU) (using Host RAM). This manages computational load and Device VRAM capacity simultaneously.

💡 Battle-Tested Insight:

For local deployments (our scenario), llama.cpp is the only framework that reliably handles KV cache offloading on consumer Devices (GPUs). Its GGUF quantization support (Q4_K_M) and CUDA graph capture make it the de facto standard for sub-24GB Device VRAM setups.


5. Quantization: The Enabler for Local LLMs

Quantization is the process of reducing the precision of the model’s weights (e.g., from 16-bit to 4-bit integers), which dramatically cuts the model’s VRAM footprint and memory bandwidth requirement.

Trade-offs of Quantization

AspectDescriptionImpact
Model SizeReduces weight size by 4×4\times (from 16-bit to 4-bit).Critical: Allows models to fit into limited VRAM (e.g., Mistral 7B 14GB\approx 14\text{GB} FP16 4.5GB\rightarrow \approx 4.5\text{GB} Q4_K_M).
Inference SpeedReduces the size of the data moved over the VRAM bus.Positive: Increases effective VRAM bandwidth, leading to higher TPS during the Decode phase.
AccuracyLoss of precision can introduce small errors.Minor: Modern formats (e.g., Q4_K_M, Q5_K_S) mitigate this, with negligible quality drop for many tasks.

The Power of GGUF (Q4_K_M)

The llama.cpp framework’s GGUF format is the standard for local LLM quantization. The Q4_K_M variant uses advanced techniques (like per-channel scaling and K-means clustering) to achieve high compression with minimal accuracy loss.

ModelOriginal Size (FP16)Q4_K_M Size (GGUF)VRAM Saved
Mistral 7B14 GB4.4 GB68.5%

Engineering Mandate: Always deploy the highest practical quantization (e.g., Q4_K_M or Q5_K_M) before resorting to full KV Cache offloading, as quantization is a global speed optimization, whereas offloading is a localized latency penalty.


Key Takeaways for the AI Engineer

  1. TTFT vs. TPOT: TTFT is CPU/SSD latency; TPOT is GPU/VRAM bandwidth.

  2. Bandwidth over FLOPs: Decode speed depends more on memory bandwidth than raw compute.

  3. Quantization First: Use GGUF to reduce memory footprint and boost effective VRAM bandwidth.

  4. Hybrid Execution is Necessary: Must configure KV Cache offloading (for context length) and/or Layer Splitting (for model fit) for large models or long contexts on consumer hardware.

✅ Checkpoint Answer: TTFT confirms Prefill ran well. TPS drop signals KV Cache Offloading is active, with slow PCI-E reads causing up to 10×10\times latency.


Appendix A: Deep Dive on Compute vs. Bandwidth

GPU FLOPS Decoded: Theoretical Peak vs. Real World

FLOPS (Floating Point Operations Per Second) measures a GPU’s theoretical peak compute throughput for matrix math. However, LLM inference rarely saturates FLOPS due to memory constraints.

⚠️ Critical Reality Check:

In actual LLM workloads:

💡 When FLOPS Actually Matters:

Only when all these conditions are met:

  1. Weights fully pre-loaded in VRAM (no PCI-E transfers during prefill).

  2. Prompt length >1,024> 1,024 tokens (sufficient parallelism).

  3. Using FP16/BF16 precision (no quantization).

  4. Optimized kernels (e.g., FlashAttention-2).

Example: The RTX 4090 (83 TFLOPS FP1683 \text{ TFLOPS } \text{FP}16)

The 83 TFLOPS83 \text{ TFLOPS} FP16 rating for an RTX 4090 is a peak theoretical number, useful for comparing raw potential—but not a predictor of LLM inference speed in real deployments. This number is achievable only under ideal conditions (dense matrices, perfect memory access) that modern LLM workloads (sparse, memory-bound attention layers) rarely meet.

Real-world LLM inference on an RTX 4090 typically achieves <25 TFLOPS< 25 \text{ TFLOPS} effective throughput—often much lower during decode due to memory-bound behavior.

Key Takeaway: The Prefill phase is memory-bound (VRAM bandwidth), not compute-bound. Optimizing TTFT requires maximizing VRAM bandwidth utilization and ensuring weights are pre-loaded.

Appendix B. Static Diagrams

1. Complete LLM Inference Pipeline Diagram

Complete LLM Inference Pipeline Diagram

2. KV Cache Growth & Offload Trigger Diagram

KV Cache Growth & Offload Trigger

3. Memory Pressure Timeline

Memory Pressure Timeline