Bereket Lemma // ML Systems Engineering

_

What You're Looking At

This benchmark compares two ways to run AI models: FP16 (full precision) vs INT4-AWQ (compressed). Compression makes the model run faster and use less memory, but we test if quality is affected. Below you'll see speed (throughput), response time (latency), and how performance scales with different loads.

Scenario Presets

Throughput Speedup

3.3x

Compressed version is this many times faster

P99 Latency Improved

37.4%

Worst-case response time got faster

Peak Speed

452.3tok/s

Max words/sec the model can generate

Test Scenarios

Different batch sizes & input lengths

Overview Tab

Latency (response time): How long it takes the model to respond. Lower is better for interactive apps.

Throughput (speed): How many words the model generates per second. Higher is better for processing large batches.

P50 vs P99: P50 is typical performance, P99 is worst-case. Production systems care about P99.

$ Avg Latency (ms)

↓ Lower is better

$ Avg Throughput (tok/s)

↑ Higher is better — 3.3x with AWQ-Marlin

$ Throughput Scaling by Batch Size

$ Methodology

Warmup

3 iterations

Cold-start elimination

Measurement

10 runs/config

Statistical stability

Decoding

Greedy (T=0.0)

Deterministic output

Engine

vLLM 0.16.0

PagedAttention + continuous batching

Quantization

AWQ-Marlin

INT4 Marlin kernel (fast path)

Hardware

NVIDIA L4 24GB

Google Cloud us-west1-a