<!DOCTYPE html>
<html>
<head>
<style>.ck-link { text-decoration: underline; }</style>
<style>.ck-link { text-decoration: underline; }</style>
<style>@media only screen { .email * { word-break: break-word; } }
@media screen and (max-width: 384px) { .mail-message-content { width: 414px !important; } }
@media only screen { .email * { word-break: break-word; } }
@media screen and (max-width: 384px) { .mail-message-content { width: 414px !important; } }
@media only screen and (max-width:600px) { .ck-mobile-font-size { font-size:50px !important; } }
@media only screen and (max-width:600px) { .ck-mobile-font-size { font-size:50px !important; } }</style>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body style="margin:0;padding:0;-webkit-font-smoothing:antialiased;-ms-text-size-adjust:100%;-webkit-text-size-adjust:100%">
<div class="email" style="margin-left:18px;margin-right:18px">
<table cellpadding="0" cellspacing="0" style="border-collapse:separate;mso-table-lspace:0pt;mso-table-rspace:0pt;width:100%;margin:0 auto"><tbody><tr><td style="vertical-align:top">
<div>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">Hey Andrea,</p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">I've been thinking a lot more about the fundamentals of LLMs - and it dawned on me recently that most devs don't even know what tokens are.</p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">If that's you - no shame. Here's an email to get you up to speed.</p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">Tokens are the fundamental building blocks that help Large Language Models (LLMs) process text. Understanding them is essential, especially since you're billed based on token usage.</p>
<h2 style="font-weight:bold;font-style:normal;font-size:1em;margin:0;font-size:1.5em;margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#000000;line-height:1.5">What Are Tokens?</h2>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">Tokens are simply numbers that represent how the LLM "thinks" about the text you provide. The process of converting text into tokens is called <strong>encoding</strong>.</p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">The tokenization process works in two parts:</p>
<ol style="margin:1em 0;margin-left:1em;padding:0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5;list-style-position:outside !important">
<li style="padding:0 0 0 1em;margin:0;margin-left:1em">The tokenizer splits text into tokens it recognizes</li>
<li style="padding:0 0 0 1em;margin:0;margin-left:1em">These tokens are converted into numbers</li>
</ol>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5"><img width="400px" src="https://res.cloudinary.com/total-typescript/image/upload/v1758555807/aihero.dev/emails/tokens/encoding.png" alt="Encoding" style="border:0 none;display:block;height:auto;line-height:100%;outline:none;-webkit-text-decoration:none;text-decoration:none;max-width:100%"></p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">Decoding is the reverse process:</p>
<ol style="margin:1em 0;margin-left:1em;padding:0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5;list-style-position:outside !important">
<li style="padding:0 0 0 1em;margin:0;margin-left:1em">Numbers are converted back into text tokens</li>
<li style="padding:0 0 0 1em;margin:0;margin-left:1em">The tokens are joined together to form the output</li>
</ol>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5"><img width="400px" src="https://res.cloudinary.com/total-typescript/image/upload/v1758555807/aihero.dev/emails/tokens/decoding.png" alt="Decoding" style="border:0 none;display:block;height:auto;line-height:100%;outline:none;-webkit-text-decoration:none;text-decoration:none;max-width:100%"></p>
<h2 style="font-weight:bold;font-style:normal;font-size:1em;margin:0;font-size:1.5em;margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#000000;line-height:1.5">The LLM Process Flow</h2>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">The complete LLM process looks like this:</p>
<ol style="margin:1em 0;margin-left:1em;padding:0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5;list-style-position:outside !important">
<li style="padding:0 0 0 1em;margin:0;margin-left:1em">Tokenizer encodes your input text into tokens</li>
<li style="padding:0 0 0 1em;margin:0;margin-left:1em">LLM processes your tokens</li>
<li style="padding:0 0 0 1em;margin:0;margin-left:1em">LLM produces output tokens</li>
<li style="padding:0 0 0 1em;margin:0;margin-left:1em">Output tokens are decoded back into readable text</li>
</ol>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5"><img width="400px" src="https://res.cloudinary.com/total-typescript/image/upload/v1758555807/aihero.dev/emails/tokens/llm-process.png" alt="LLM Process Flow" style="border:0 none;display:block;height:auto;line-height:100%;outline:none;-webkit-text-decoration:none;text-decoration:none;max-width:100%"></p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">To clarify, input tokens include:</p>
<ul style="margin:1em 0;margin-left:1em;padding:0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5;list-style-position:outside !important">
<li style="padding:0 0 0 1em;margin:0;margin-left:1em">Your conversation history with the LLM</li>
<li style="padding:0 0 0 1em;margin:0;margin-left:1em">System prompts</li>
<li style="padding:0 0 0 1em;margin:0;margin-left:1em">Tool definitions</li>
</ul>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">Output tokens are what the LLM sends back as a response.</p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">You're billed for both input and output tokens, typically at different rates. One way to save money is to design your prompts to generate fewer output tokens.</p>
<h2 style="font-weight:bold;font-style:normal;font-size:1em;margin:0;font-size:1.5em;margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#000000;line-height:1.5">How Tokens Are Created</h2>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">The tokenization process starts with a large corpus of text - similar to what's used to train the LLM itself. Let's imagine a tiny corpus consisting of just one sentence: "the cat sat on the mat."</p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5"><img width="400px" src="https://res.cloudinary.com/total-typescript/image/upload/v1758555808/aihero.dev/emails/tokens/tokenizer-training.png" alt="Tokenization" style="border:0 none;display:block;height:auto;line-height:100%;outline:none;-webkit-text-decoration:none;text-decoration:none;max-width:100%"></p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">First, all individual characters are extracted:</p>
<pre class="hljs"><code>T H E space C A T space S A T space O N space T H E space M A T
</code></pre>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">Each of these characters becomes its own token in the vocabulary.</p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">Next, common groupings of characters are identified:</p>
<ul style="margin:1em 0;margin-left:1em;padding:0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5;list-style-position:outside !important">
<li style="padding:0 0 0 1em;margin:0;margin-left:1em">"TH" appears in "the" (twice)</li>
<li style="padding:0 0 0 1em;margin:0;margin-left:1em">"HE" appears in "the" (twice)</li>
<li style="padding:0 0 0 1em;margin:0;margin-left:1em">"AT" appears in "cat", "sat", and "mat"</li>
</ul>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">Each of these groupings also gets assigned its own token.</p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">Then, groups of groups are identified - like "TH" + "HE" creating "THE" (the word "the"), which gets its own token.</p>
<h2 style="font-weight:bold;font-style:normal;font-size:1em;margin:0;font-size:1.5em;margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#000000;line-height:1.5">Vocabulary Size Matters</h2>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">The goal is to create a large vocabulary of tokens because larger vocabularies can split words into fewer tokens, making processing more efficient.</p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5"><img width="400px" src="https://res.cloudinary.com/total-typescript/image/upload/v1758555808/aihero.dev/emails/tokens/vocabulary-size.png" alt="Vocabulary Size" style="border:0 none;display:block;height:auto;line-height:100%;outline:none;-webkit-text-decoration:none;text-decoration:none;max-width:100%"></p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">For example, a vocabulary size of 1,000 tokens might split "understanding" into 5 tokens. A vocabulary size of 50,000 tokens might split it into 3 tokens, and a vocabulary size of 200,000 tokens might split it into 2 tokens.</p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">Having a larger vocabulary means you can split words into fewer tokens, making processing more efficient.</p>
<h2 style="font-weight:bold;font-style:normal;font-size:1em;margin:0;font-size:1.5em;margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#000000;line-height:1.5">Handling Unusual Words</h2>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">The tokenizer struggles with uncommon words. For example, "O Frabjous Day" from Lewis Carroll's poem gets split into many tokens because "Frabjous" is a made-up word that doesn't appear frequently in the training corpus.</p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5"><img width="400px" src="https://res.cloudinary.com/total-typescript/image/upload/v1758555807/aihero.dev/emails/tokens/unusual-words.png" alt="Unusual Words" style="border:0 none;display:block;height:auto;line-height:100%;outline:none;-webkit-text-decoration:none;text-decoration:none;max-width:100%"></p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">We can see that it turns it into 7 tokens - more than we'd expect from only 15 characters.</p>
<h2 style="font-weight:bold;font-style:normal;font-size:1em;margin:0;font-size:1.5em;margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#000000;line-height:1.5">Final Thoughts</h2>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">I hope that helps demystify tokens a bit. I found the <a href="https://7751c435.click.kit-mail3.com/p9u2lvnpmmtqu35z7zkfqhpqekqzehrhkk55z/reh8h9umz3me60t2/aHR0cHM6Ly90aWt0b2tlbml6ZXIudmVyY2VsLmFwcC8=" style="color:#0875c1" url-id="1848120104">tiktokenizer playground</a> really useful for understanding this stuff.</p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">Let me know if you have any questions - and what else would you like me to cover next?</p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5">Matt</p>
</div>
<p class="" style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5"></p>
</td></tr></tbody></table>
<div class="email-footer"><table cellpadding="0" cellspacing="0" style="border-collapse:separate;mso-table-lspace:0pt;mso-table-rspace:0pt;width:100%;margin:0 auto"><tbody><tr><td style="vertical-align:top">
<div><div class="email-footer" style="color:#4b4b4b">
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5;color:#4b4b4b; margin:1em 0;font-family:Arial,-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;;font-size:12px;line-height:1.5;display:block;margin:13px 0;"><a href="https://7751c435.click.kit-mail3.com/p9u2lvnpmmtqu35z7zkfqhpqekqzehrhkk55z/g3h72p2wcwumwlmkqmcr/aHR0cHM6Ly93d3cudG90YWx0eXBlc2NyaXB0LmNvbS91bnN1YnNjcmliZWQ_c2VxdWVuY2U9YWlfaGVybw==" trigger-id="5153768" style="color:#0875c1" url-id="1848120105">Unsubscribe from AI Hero Emails</a>. You can also <a href="https://7751c435.unsubscribe.kit-mail3.com/p9u2lvnpmmtqu35z7zkfqhpqekqzehrhkk55z" style="color:#0875c1"><span style="font-size:12px">unsubscribe</span></a><span style="font-size:12px"> from all emails at any time.</span></p>
<p style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5;color:#4b4b4b; margin:1em 0;font-family:Arial,-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;;line-height:1.5;display:block;margin:13px 0;"><span style="font-size:10px">12333 Sowden Rd, Ste. B, PMB #97429, Houston, TX 77080</span></p>
</div></div>
<p class="" style="margin:1em 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Oxygen-Sans,Ubuntu,Cantarell,'Helvetica Neue',sans-serif;color:#2d2d2f;font-size:16px;line-height:1.5"></p>
</td></tr></tbody></table></div>
</div>
<!-- -->
<img src="https://7751c435.open.kit-mail3.com/p9u2lvnpmmtqu35z7zkfqhpqekqzehrhkk55z" alt="" style="border:0 none;display:block;height:auto;line-height:100%;outline:none;-webkit-text-decoration:none;text-decoration:none;max-width:100%">
</body>
</html>