Bereket Lemma // ML Systems Engineering

_

GPU: NVIDIA L4 24GB|Engine: vLLM 0.16.0|Configs: 18|Runs: 10/config|Warmup: 3 iters|Date: 2026-04-05
What You're Looking At
This benchmark compares two ways to run AI models: FP16 (full precision) vs INT4-AWQ (compressed). Compression makes the model run faster and use less memory, but we test if quality is affected. Below you'll see speed (throughput), response time (latency), and how performance scales with different loads.
Scenario Presets
Throughput Speedup
3.3x
Compressed version is this many times faster
P99 Latency Improved
37.4%
Worst-case response time got faster
Peak Speed
452.3tok/s
Max words/sec the model can generate
Test Scenarios
18
Different batch sizes & input lengths
Overview Tab
Latency (response time): How long it takes the model to respond. Lower is better for interactive apps.
Throughput (speed): How many words the model generates per second. Higher is better for processing large batches.
P50 vs P99: P50 is typical performance, P99 is worst-case. Production systems care about P99.

$ Avg Latency (ms)

↓ Lower is better

$ Avg Throughput (tok/s)

↑ Higher is better — 3.3x with AWQ-Marlin

$ Throughput Scaling by Batch Size

$ Methodology

Warmup
3 iterations
Cold-start elimination
Measurement
10 runs/config
Statistical stability
Decoding
Greedy (T=0.0)
Deterministic output
Engine
vLLM 0.16.0
PagedAttention + continuous batching
Quantization
AWQ-Marlin
INT4 Marlin kernel (fast path)
Hardware
NVIDIA L4 24GB
Google Cloud us-west1-a