How Generative AI Uses Text Tokens To Generate A Response

Generative AI systems using Large Language Models (LLMs) like GPT-4o process natural language input through a series of computational steps: tokenization, numerical representation, neural processing, and text generation. While these models have achieved impressive performance in understanding and generating text, they sometimes struggle with seemingly simple tasks, such as counting letters in a word. This article provides a detailed breakdown of how an LLM transforms a prompt into a response and why it can struggle with problems like “How many Rs are in ‘strawberry’?”

In short, LLMs don’t think in words or letters, but in numbers called Tokens. The LLM doesn’t understand the ideas in the prompt, but instead just makes predictions of what text should come next by converting the prompt to numerical tokens, makes the prediction in numerical tokens, then converts the prediction into text to response with.

Step 1: Tokenization – Breaking Text into Units

The first step in processing any prompt is tokenization, where the model converts text into discrete units called tokens. Tokens can be:

Whole words (common in simple cases)
Subwords (frequent in complex or rare words)
Characters (sometimes, especially for uncommon text patterns)
Punctuation and spaces (often treated as separate tokens)

For example, the prompt:

How many Rs are in strawberry?

Might be tokenized by an LLM into something like:

["How", "many", "R", "s", "are", "in", "straw", "berry", "?"]

Or, in a subword-based tokenizer (like Byte Pair Encoding or WordPiece), it might look like:

["How", "many", "Rs", "are", "in", "straw", "##berry", "?"]

Each of these tokens is mapped to a unique number (Token ID) from the model’s vocabulary.

Step 2: Converting Tokens to Numbers

Each token is assigned a numerical ID based on the model’s vocabulary. For instance, in a GPT-like model, the token IDs might look something like this:

Token	Token ID
How	2345
many	6782
Rs	8921
are	4321
in	9876
straw	2234
berry	5678
?	1357

These numbers represent words or subwords in a compressed way, allowing the model to process them efficiently.

Step 3: Predicting the Next Token

Once the input is converted into numerical representations, the model processes these token embeddings using a transformer architecture. The key components involved are:

Self-Attention Mechanism: Determines which parts of the prompt are important.
Positional Encoding: Keeps track of the order of words.
Feedforward Networks: Transforms token representations for better predictions.

At each step, the model predicts the most likely next token. However, because LLMs are trained on massive datasets of text patterns rather than explicit rules, they rely on statistical inference rather than step-by-step logical computation.

Step 4: Converting the Output Back to Text

Once the model predicts tokens, they are converted back to human-readable text using a process called detokenization. The numerical outputs are mapped back to words, subwords, or characters based on the model’s vocabulary.

For example, the LLM might return the following tokens as a prediction:

["There", "are", "4", "r", "s", "in", "strawberry", "."]

It would be detokenized to: “There are 4 Rs in strawberry.“

Why LLMs Struggle with Letter Counting: The ‘Strawberry’ Problem

A well-known issue with LLMs is their difficulty in correctly answering questions like:

"How many Rs are in strawberry?"

1. Tokenization Can Interfere with Accurate Counting

If a word is split into multiple tokens, the model doesn’t “see” it as a single unit. For example, strawberry might be tokenized as:

["straw", "##berry"] (subword tokens)
["stra", "wberry"] (alternative tokenization)

Since the model does not operate at the raw character level, it doesn’t naturally “see” all the instances of the letter ‘r’.

2. LLMs Rely on Statistical Predictions, Not Step-by-Step Counting

LLMs generate responses based on probability distributions of text sequences they have seen during training. If the model has seen the phrase “There are 3 Rs in strawberry” more frequently than the correct answer “There are 2 Rs in strawberry,” it is likely to reproduce the incorrect pattern.

3. Lack of Explicit Logical Computation

Unlike a traditional program designed to count characters in a string, an LLM does not execute an explicit algorithm like:

word = "strawberry"
count = word.lower().count("r")
print(count)  # Correctly outputs 2

Instead, the LLM tries to infer the pattern based on training data rather than computing the exact count using logic.

Conclusion

The process of tokenization, numerical representation, and prediction underlies how LLMs process prompts and generate responses. However, because LLMs rely on statistical inference rather than explicit logical computation, they sometimes fail at tasks requiring exact counting. This is why simple-seeming questions, like counting letters in strawberry, can lead to incorrect answers. Understanding these limitations helps us improve LLMs through better prompt engineering, fine-tuning, and hybrid approaches that combine deep learning with explicit computation methods.

Original Article Source: How Generative AI Uses Text Tokens to Generate a Response written by Chris Pietschmann (If you're reading this somewhere other than Build5Nines.com, it was republished without permission.)