Latency (response time): How long it takes the model to respond. Lower is better for interactive apps.
Throughput (speed): How many words the model generates per second. Higher is better for processing large batches.
P50 vs P99: P50 is typical performance, P99 is worst-case. Production systems care about P99.