LLM Cheat Sheet

You don't need to understand how a car engine works to drive one, but knowing the difference between horsepower and torque helps when someone's trying to sell you something. This page is that, for AI.

These concepts come up constantly in articles, Reddit threads, and product announcements. Most explanations assume you're either a developer or completely new to computers. This one assumes you're neither.

Jump to: Tokens · Context Window · Parameters · Model Names · Reasoning & Thinking · Temperature · Embeddings · Tools · Fine-Tuning · Distillation · Quantization · Hallucination · RAG · Multimodal · LSP

Tokens

The most fundamental unit of AI text. A token isn't exactly a word and isn't exactly a letter. It's somewhere in between: roughly a syllable or a short word. "Unbelievable" might be three tokens. "Cat" is one. A space before a word often counts as part of the next token.

Why does this matter? Because models don't read text the way you do. They process it one token at a time, predicting what comes next based on everything before it. When a model has a "token limit," that's how many of these chunks it can hold in its head at once, counting both what you send and what it writes back.

Tokens also determine cost. Most commercial AI APIs charge per token, so a long back-and-forth conversation is more expensive than a short one.

A useful rule of thumb: 1,000 tokens is roughly 750 words.

Context Window

Think of the context window as the model's working memory for a conversation. Everything inside it, your messages, the model's replies, any documents you've pasted in, is what the model can actually "see" when forming a response. Anything outside the window might as well not exist.

Early models had small context windows, around 4,000 tokens, which meant long conversations would cause the model to "forget" things you said at the beginning. Modern models can handle hundreds of thousands of tokens, which is enough to load an entire novel.

A bigger context window isn't always better in practice. Models sometimes struggle to pay attention to things buried in the middle of a very long context, a phenomenon researchers call "lost in the middle." But for most everyday use, more context is a genuine improvement.

Parameters

When someone says a model has "7 billion parameters" or "70 billion parameters," they're describing the size of the model's internal knowledge structure. Parameters are the numerical weights stored in a neural network, adjusted during training until the model gets good at predicting text.

More parameters generally means more capability, but also more compute to run. A 70B model needs serious hardware. A 7B model can often run on a decent laptop or phone.

The relationship between parameters and intelligence isn't perfectly linear. Newer, more efficient training techniques let smaller models punch above their weight. A 2025 model with 8 billion parameters might outperform a 2023 model with 30 billion on many tasks.

Model Names: What Does "gemma4:9b" Actually Mean?

Model names look cryptic but follow a pattern once you know what to look for.

Take gemma4:9b as an example. "Gemma" is the model family, created by Google. "4" is the version number. "9b" means 9 billion parameters. So the full name tells you: fourth version of Google's Gemma model, with 9 billion parameters.

Another example: llama3.2:3b is Meta's Llama model, version 3.2, with 3 billion parameters.

Commercial models follow slightly different naming conventions. Anthropic uses names like claude-sonnet-4 where the word in the middle (Haiku, Sonnet, Opus) signals where the model sits on the speed/intelligence tradeoff. Haiku is fast and cheap. Opus is slower but more capable. Sonnet is the middle ground.

OpenAI uses gpt-4o where "o" stands for "omni," meaning the model handles text, images, and audio. The number is the generation.

When you see suffixes like -instruct or -chat, those indicate the model has been fine-tuned to follow instructions or hold conversations, as opposed to the raw "base" model that just predicts the next word without any guidance about being helpful.

Reasoning and Thinking

Standard AI models generate responses token by token, straight through. You ask something, and the model starts writing an answer immediately.

Reasoning models work differently. Before producing a final response, they generate a long internal "scratchpad" where they work through the problem, check their logic, and sometimes backtrack and try a different approach. You can often see this process exposed as a "thinking" section that appears before the answer.

The analogy is the difference between someone blurting out an answer and someone pausing to think it through on paper first. The second approach takes longer and uses more compute, but handles complex problems, especially math, logic, and multi-step planning, much more reliably.

Not every problem benefits from this. For a simple factual question, reasoning just adds latency. For "help me debug this tax situation" or "plan a trip across five countries," it can make a significant difference.

Temperature

Temperature controls how predictable or creative a model's outputs are. It's a number, typically between 0 and 1 (or sometimes 0 and 2 depending on the platform), that adjusts how the model weighs its options at each token.

At low temperatures, the model almost always picks the most probable next token. Outputs are consistent and conservative. Ask the same question twice and you'll get nearly identical answers.

At high temperatures, the model is more likely to pick surprising or less common tokens. Outputs get more varied and creative, but also more prone to errors and weird tangents.

For extracting factual information or writing code, lower temperatures work better. For brainstorming, creative writing, or generating diverse options, higher temperatures can help. Most chat products set a moderate temperature and don't expose the setting to users.

Embeddings

Embeddings are a way of turning meaning into coordinates.

Here's the core idea. Imagine a map, except instead of physical locations, every word, sentence, or document gets plotted based on what it means. Things with similar meanings end up close together on this map. "Furious" and "livid" are neighbors. "Furious" and "spreadsheet" are on opposite ends of the continent.

In practice this map has hundreds of dimensions instead of two (you can't visualize it, but the math works the same way), and each point on it is represented as a long list of numbers. That list of numbers is the embedding.

What makes this genuinely useful is that the relationships between concepts get preserved spatially. The famous demonstration of this: if you take the embedding for "king," subtract the embedding for "man," and add the embedding for "woman," you land almost exactly on the embedding for "queen." The model hasn't been told that kings and queens are gender counterparts. It learned the relationship purely from seeing how those words appear in context, and the geometry captures it.

You've already been using embeddings without knowing it. When Spotify's radio feature surfaces a song you've never heard but somehow fits your mood perfectly, that's embeddings. Spotify maps songs into a space based on tempo, key, energy, instrumentation, and hundreds of other properties. Your listening history creates a kind of gravitational center in that space, and the recommendation engine finds songs that orbit nearby. It's not matching "you liked this rock song, here's another rock song." It's matching the actual sonic fingerprint at a level below genre labels.

The same principle powers the "Find similar" button on Pinterest, the "More like this" feature on streaming platforms, and search tools that understand what you meant rather than just matching your exact words. Type "something to watch when you're sad and want to feel understood" into a good search and it returns relevant results not because those words appear in any movie description, but because the semantic space of that query is close to the semantic space of certain films.

For AI specifically, developers use embeddings to let models search through large collections of text, like a company's entire documentation or years of email archives, without having to load all of it into the context window at once. The model converts a query into an embedding, finds the documents with nearby embeddings, and pulls only those into context to answer the question. This is the mechanism behind most "chat with your documents" products.

Tools (Also Called Function Calling)

A base language model knows a lot, but it's stuck in the past. Its training data has a cutoff date, it can't browse the internet, and it can't take actions in the world.

Tools change that. When a model has access to tools, it can call out to external systems mid-conversation. Common examples: web search, running a calculator, reading a file, checking a calendar, or sending an email.

The model decides when to use a tool based on the conversation. If you ask it what the weather is today, and it has access to a weather API, it will call that API, get the current data, and weave the result into its reply. From the outside it looks seamless.

This is what makes AI "agents" possible. An agent is a model that can use tools repeatedly, in sequence, to accomplish a goal that requires multiple steps. Give it access to your browser, a code interpreter, and your file system and it can do things that would have seemed like science fiction a few years ago.

Fine-Tuning

Pre-training is when a model learns from a massive dataset of internet text, books, and code. Fine-tuning happens after that, using a much smaller, curated dataset to adjust the model's behavior for a specific purpose.

A general-purpose model fine-tuned on medical records becomes better at clinical terminology. One fine-tuned on customer service transcripts gets better at de-escalating complaints. One fine-tuned on your company's internal docs learns your vocabulary and product names.

Fine-tuning doesn't reprogram the model from scratch. It nudges the existing weights in a direction. Think of it like breaking in a new employee who already has general skills; the fine-tuning is the onboarding.

Distillation

Training a large model is enormously expensive. Running one is too. Distillation is a technique for building a smaller, cheaper model that behaves a lot like a larger one — by training it specifically to imitate the bigger model's outputs rather than learning from raw data from scratch.

The intuition: instead of having a student learn from textbooks alone, you have them learn from a teacher who's already mastered the material. The teacher model generates answers, explanations, and reasoning traces. The student model learns to reproduce that behavior. The result is a smaller model that captures a surprising amount of the teacher's capability.

This is why capable small models have gotten dramatically better in recent years. A 7 billion parameter model today often outperforms much larger models from a few years ago — partly because it's been distilled from something bigger.

What you trade away: specialized knowledge at the edges, performance on very hard problems, and the raw ceiling of what the model can do on its best day. What you keep: general fluency, common-task performance, and a model you can actually run on a laptop.

	Large "teacher" model	Distilled "student" model
Size	Hundreds of billions of parameters	Billions, sometimes millions
Cost to run	Expensive — needs powerful hardware	Cheap — runs on consumer hardware
General capability	High	Close, for most everyday tasks
Edge-case performance	Strong	Degrades on very hard or niche problems
Speed	Slower	Faster

Quantization

Model weights are stored as numbers, and like most numbers in computing, you can choose how much precision to store them with. A 32-bit floating point number is very precise. A 4-bit integer is much rougher. Quantization is the process of converting a model's weights to lower precision to make it smaller and faster.

The analogy: imagine you have a high-resolution photo that's 50MB. If you reduce it to a JPEG at 80% quality, it drops to 5MB and looks almost identical. At 20% quality it's 500KB but looks noticeably degraded. Quantization does the same thing to model weights — rounding values to fewer bits, accepting some precision loss in exchange for a dramatically smaller file and faster inference.

This is what makes it possible to run capable models on your laptop or phone. A model that requires 80GB of GPU memory at full precision might fit in 8GB after aggressive quantization.

The tradeoff is accuracy. For most tasks and most users, the difference between full precision and 8-bit quantization is basically undetectable. As you push further — to 4-bit or 2-bit — you start to see more degradation, especially on tasks requiring precise arithmetic, careful reasoning, or uncommon knowledge.

Precision	Bits per weight	Approx. size vs. full	Quality loss	Runs on
FP32 (full)	32	1× (baseline)	None	High-end server GPUs
FP16 / BF16	16	~0.5×	Negligible	Most modern GPUs
INT8 / Q8	8	~0.25×	Minimal	Consumer GPUs, fast laptops
Q4	4	~0.125×	Noticeable on hard tasks	Laptops, phones
Q2	2	~0.06×	Significant	Ultra-low-power devices

When you download a model through something like Ollama, names like q4_K_M or q8_0 in the filename are telling you the quantization level. Higher numbers mean higher quality and larger file size.

Hallucination

When a model confidently states something false, that's called a hallucination. It doesn't mean the model is lying or malfunctioning in the traditional sense. It's a consequence of how these models work: they generate plausible-sounding text, and sometimes plausible-sounding text happens to be wrong.

Hallucinations are more common when the model is operating near the edge of its training data, asked about obscure topics, or pushed to fill in details it doesn't actually have. Asking for specific citations, recent statistics, or niche technical facts is where you're most likely to see them.

The practical response: treat AI outputs the way you'd treat a confident but occasionally unreliable friend who reads a lot. Useful for direction, worth verifying before you act on anything important.

RAG (Retrieval-Augmented Generation)

RAG is a technique for giving a model access to specific information without fine-tuning it. Instead of baking knowledge into the model's weights, you retrieve relevant documents at query time and include them in the context window.

When you ask a question, the system first searches a database of documents (using embeddings to find relevant ones), grabs the most relevant chunks, and adds them to the prompt before sending it to the model. The model then answers based on both its training and those retrieved documents.

This is how most "chat with your documents" products work. It's also more up-to-date than fine-tuning, because you can update the document database without retraining the model.

Multimodal

A multimodal model can work with more than one type of input. Most early language models only handled text. A multimodal model might accept images, audio, video, or documents alongside text and reason across all of them at once.

When you paste a screenshot into Claude or ChatGPT and ask a question about it, you're using multimodal capabilities. When a model can generate images in addition to describing them, that's also multimodal, just in the output direction.

The category is expanding quickly. Models that can see, hear, and read simultaneously are starting to feel less like chatbots and more like a general-purpose assistant.

LSP (Language Server Protocol)

If you've ever used VS Code and noticed that it gives you autocomplete suggestions, highlights errors before you run your code, or lets you jump to a function definition with a single click — those features are powered by something called a Language Server Protocol.

LSP is an agreement between code editors and the tools that understand code. Instead of every editor having to build deep knowledge of every programming language from scratch, LSP provides a common language for editors and "language servers" to talk to each other. The editor says "what's at this cursor position?" and the language server responds with whatever it knows: type information, documentation, errors, suggestions.

Where AI comes in: tools like GitHub Copilot, Cursor, and Continue work by acting as a language server — or wrapping an existing one — that pipes requests to a language model and sends the results back to your editor. When you're typing and a grey suggestion appears, the editor has used LSP to ask an AI-backed server for a completion, and the server returned one.

This is also why AI coding tools tend to work across many editors. The protocol is standardized. Build one AI-backed language server, and it works with VS Code, Neovim, Emacs, and anything else that speaks LSP — without rebuilding the integration for each one.

For most people, LSP is invisible infrastructure. You experience the effects — better autocomplete, smarter suggestions, context-aware completions — without ever knowing what's coordinating behind the scenes. But if you're setting up an AI coding tool and you see references to language servers or LSP configuration, this is what they're referring to.