Lesson 01 / 12·9 minFree

Tokens, Context Windows, and How LLMs Actually Think

The mechanics behind why AI behaves the way it does — and what this means for every prompt you write

Every surprising thing an LLM does — every hallucination, every brilliant inference, every bizarre refusal — is explained by two concepts: tokenisation and probability. Once you understand these at the level of a builder rather than a user, you can predict model behaviour and design prompts that work the first time.

Tokens — what models actually read

1 token ≈ 4 characters of English text, or ¾ of a word — Code is denser — Python and TypeScript use more tokens per logical unit than prose
Non-English languages use more tokens per word — Japanese, Arabic, and Chinese often require 2–4× more tokens than equivalent English — relevant for cost at scale
The tokeniser boundary affects model behaviour — "2024" might be one token; "2 0 2 4" is four tokens. Ask a model to count letters in a word and it may fail — it never sees individual letters

Quick token estimation

Rule of thumb:
  1,000 words  ≈ 1,333 tokens
  1 A4 page    ≈ 600–800 tokens
  1 novel      ≈ 100,000–150,000 tokens

Claude 3.7 Sonnet context window: 200,000 tokens (≈ 150,000 words)
That is: your entire codebase, a legal contract, and a 200-page report — at once.

Cost example (Claude 3.7 Sonnet, June 2026):
  Input:  $3.00 per million tokens
  Output: $15.00 per million tokens

  A 1,000-word prompt + 500-word response ≈ 2,000 tokens = $0.006 per call
  At 10,000 calls/month: $60

💡

Models do not read words — they read chunks of characters

A token is roughly 3–4 characters of English text. "Unbelievable" is 4 tokens. "AI" is 1 token. "Hello world" is 2 tokens. Models process tokens, not words — and this matters for three reasons: cost (APIs charge per token), speed (more tokens = slower), and edge cases (rare words get split into many tokens, which is why models sometimes mangle unusual names or code strings).

Context window — the model's working memory

The context window is everything the model can "see" when generating a response: your system prompt, the entire conversation history, any documents you pasted, and the model's own previous output. Nothing outside the context window exists for the model.

Recency bias is real — information at the start and end of the context is weighted more heavily — Put your most important instructions at the top of the system prompt. Do not bury the critical constraint in the middle of a 5,000-word document.
Long contexts degrade quality before hitting the limit — Models start to "lose track" of early context at around 80% of their context window. For production systems, keep context under 100k tokens.
Every token in context costs money — manage it deliberately — Clear conversation history between unrelated tasks. Use summaries instead of full history for long-running agents.

Temperature and sampling — determinism vs creativity

Temperature 0 = most probable token every time (deterministic for practical purposes) — Use for: code generation, data extraction, classification, anything where correctness matters more than variety.
Temperature 1 = standard randomness (the default) — Use for: general conversation, writing, brainstorming. The model explores more of its probability distribution.
Temperature > 1 = high creativity with higher error rate — Use sparingly. Fiction, poetry, wild brainstorms. Expect more mistakes.
Extended thinking (Claude) ignores temperature during the thinking phase — When Claude is reasoning step-by-step before answering, the thinking process uses its own sampling strategy regardless of the temperature you set.

Why models hallucinate — the real explanation

Ask for citations and verify them independently — Models can generate plausible-looking but non-existent paper titles, URLs, and quotes with complete confidence
Give the model the information and ask it to reason from it — do not ask it to recall facts — "Based on this document, what is the deadline?" is reliable. "What was the deadline in the 2023 contract?" is risky.
Extended thinking reduces but does not eliminate hallucination — The model reasons through its uncertainty before answering — but it can still be confidently wrong on information gaps

💡

Hallucination is the model completing a plausible pattern rather than retrieving a fact

If you ask "Who wrote Pride and Prejudice?" the model has seen this in training millions of times — Jane Austen. High confidence, correct. If you ask "What is the population of the suburb of Kildare Heights, County Cork?" the model has seen almost no training data for this. But it has seen thousands of sentences of the form "The population of [place] is [number]." So it produces the most plausible-sounding number — and presents it with the same confidence as Jane Austen. The model has no internal signal for "I am making this up vs I learned this."

🎯

Try this

Open platform.openai.com/tokenizer (free, no login). Paste a paragraph from your most recent project description. Look at the token count. Now paste the same text in Japanese via a translation tool and compare. The difference is why multilingual AI apps cost more than English-only ones. Also note how code tokenises differently from prose.

The Model Landscape — Choosing the Right AI for Every Job