Tokenization and Subword Tokenization in Generative AI: A Complete Guide

Karthikeyan Dhanakotti
6 min readSep 8, 2024

Introduction to Tokenization

Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, subwords, or even characters, depending on the approach. Tokenization is fundamental in Natural Language Processing (NLP) and Generative AI because it converts raw text into a form that models can process and understand.

Example:

  • Text: “Generative AI is fascinating!”
  • Tokens: [‘Generative’, ‘AI’, ‘is’, ‘fascinating’, ‘!’]

Subword Tokenization

Subword tokenization involves breaking words into smaller, meaningful subword units. This method is useful for handling rare or unknown words by breaking them into subword parts, which are more likely to be part of the model’s vocabulary.

Example:

  • Word: “fascination”
  • Subword Tokens: [‘fas’, ‘cina’, ‘tion’]

Subword tokenization is important in Generative AI because it allows models to represent rare or unseen words more efficiently. Instead of increasing the vocabulary to cover every possible word, subword tokenization enables models to work with a smaller vocabulary while generating more flexible outputs.

--

--

Karthikeyan Dhanakotti

AI/ML & Data Science Leader @ Microsoft , Mentor/Speaker, AI/ML Enthusiast | Microsoft Certified.