Member-only story

Tokenization and Subword Tokenization in Generative AI: A Complete Guide

Karthikeyan Dhanakotti
6 min readSep 8, 2024

Introduction to Tokenization

Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, subwords, or even characters, depending on the approach. Tokenization is fundamental in Natural Language Processing (NLP) and Generative AI because it converts raw text into a form that models can process and understand.

Example:

  • Text: “Generative AI is fascinating!”
  • Tokens: [‘Generative’, ‘AI’, ‘is’, ‘fascinating’, ‘!’]

Subword Tokenization

Subword tokenization involves breaking words into smaller, meaningful subword units. This method is useful for handling rare or unknown words by breaking them into subword parts, which are more likely to be part of the model’s vocabulary.

Example:

  • Word: “fascination”
  • Subword Tokens: [‘fas’, ‘cina’, ‘tion’]

Subword tokenization is important in Generative AI because it allows models to represent rare or unseen words more efficiently. Instead of increasing the vocabulary to cover every possible word, subword tokenization enables models to work with a smaller vocabulary while generating more flexible outputs.

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Karthikeyan Dhanakotti
Karthikeyan Dhanakotti

Written by Karthikeyan Dhanakotti

AI/ML & Data Science Leader @ Microsoft , Mentor/Speaker, AI/ML Enthusiast | Microsoft Certified.

No responses yet

Write a response