Opening the Box of LLM Serving
Step by Step Through LLM Inference
In this article, we'll take a detailed look inside the black box of Large Language Model (LLM) inference and explain step-by-step how requests are served.
Step1: Request Handling
If serving the LLM in a large scale, we need to properly hand the request, verify the certificate and entryption do the security check, check the rate limit... Then route the request to the properly server.
Step2: Tokenization
- The input is raw question could come in as any language. We need to convert the input text into integer tokens using the model’s vocabulary.
- Also we need to include relevant context, such as previous chat history, to enhance the quality of the response.
- Adding predefined system prompts, such as instructions like "limit your response to 1000 words," or "imagine you are a 5-year-old child, how would you reply?".
- Finally there is a limit model’s maximum context length, we will apply the sliding-window rules to stay within the model’s context length. ( Hey the sliding-window coding quesstion 😆)
Step3: CPU Transfer Tokens to GPU:
The tokenized input (integer token IDs) and any past key/value (KV) tensors (for ongoing conversations) are transferred from the CPU’s RAM to the GPU’s memory.
Step4: First forward-pass over the prompt (Prompt Processing)
Now GPU has the tokens, but the LLM model doesn't take raw tokens as the input. Now we need to map the tokens from integer IDs into dense embedding vectors as the model input.
How KV-cache are used
E very self-attention head produces Key and Value matrices for each token in each layer, abd these are buffered so we don’t have to recompute them on the next decoding step. Without the cache, generating n tokens would be O(n²). With it, after the first pass each additional token is O(n).
Now let's take a look at what we have in GPU memory:
Model weights
Token IDs repesesnt the processed prompt.
KV-cache: After the first forward pass over the whole prompt we stored each layer’s K and V tensors so we never recompute them.
How big the KV cache could be? For example there may be thousands of tokens, and say total 48 layers in the model, then the total cache could be GBs.
Step 5: Decoding Loop (Autoregressive Generation)
To generate each additional token, the following process repeats:
Fetch the last token embedding and relevant KV-cache states.
Perform a single-step forward pass through the transformer layers:
For each transformer layer:
Fetch the previously cached keys/values in KV-cache
Append the new vectors (keys/values from the current step) to the existing KV-cache.
Perform scaled dot-product attention with this extended context.
Apply the remaining transformer operations (e.g., layer normalization, feed-forward networks, residual connections).
Pass the result as the input to the next layer and repeat steps.
After the final Transformer layer you have a final hidden vector.
Step 6: From Hidden State to a New Token
The final hidden state is transformed into probabilities to choose the next word. This include the following steps:
Logits Calculation: Project h_final onto the vocabulary space using the LM head, yielding logits.
Softmax: Convert logits into probabilities.
Sampling: Select the next token based on probability distribution using techniques like top-k, top-p, temperature adjustments, or repetition penalties.
And finally stream token to client.
Why Sampling over Greedy?
Greedy decoding makes the model “play it safe”:
Pros: Lowest perplexity, fast, predictable.
Cons: Repetitive patterns (e.g. “I am ... I am ..."), Lack of surprise, and can get stuck in loops (e.g. “Because because because ...”)
Sampling lets the model explore the full distribution it learned during training, so you occasionally get less likely but more interesting continuations.
Step 7: Post-processing and Delivery
After generating tokens, we will handle the response:
Detokenization: Convert integer token IDs back into readable UTF-8 strings.
Safety and Policy Filters: Apply final checks to block any harmful, toxic, or prohibited content.
HTTP Response: Deliver the response (normally streaming each token).
Monitoring: Record detailed logs, traces, latency, and token usage metrics, and send them to the monitoring backend for analysis.
Step 8: Resource Cleanup
Finally, KV-cache and request context are freed or moved to a longer-lived session cache if the vendor supports conversational continuity across requests.
ML Infra System Design Private Course 📒
Would you like to master modern ML infrastructure from FAANG engineers with 15+ years of experience?
This is not limited to MLE roles. The regular SDE roles are also increasingly incorporating deep ML related changes, and the interview are also evolving accordingly.
Contact us at [email protected]