Skip to content

[docs] tokenizer summary#43965

Open
stevhliu wants to merge 2 commits intohuggingface:mainfrom
stevhliu:tokenizer-summary
Open

[docs] tokenizer summary#43965
stevhliu wants to merge 2 commits intohuggingface:mainfrom
stevhliu:tokenizer-summary

Conversation

@stevhliu
Copy link
Member

updates tokenizer_summary.md:

  • condense intro and subword tokenization sections since this doc is mostly about BPE/Unigram/WordPiece
  • removes some redundant and introductory motivation content and links to the course for more info
  • places BPE/Unigram/WordPiece closer to the top since this is most relevant and moves word/character-level tokenization closer to the bottom since its not as important
  • adds relevant videos from the course for BPE/Unigram/WordPiece

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@stevhliu stevhliu requested a review from itazap February 12, 2026 22:27
More specifically, we will look at the three main types of tokenizers used in 🤗 Transformers: [Byte-Pair Encoding
(BPE)](#byte-pair-encoding), [WordPiece](#wordpiece), and [SentencePiece](#sentencepiece), and show examples
of which tokenizer type is used by which model.
For instance, `annoyingly` might be split into `["annoying", "ly"]` or `["annoy", "ing", "ly"]` depending on the vocabulary. The model never encounters a truly unknown word because it can decompose it into known subwords.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely understand what you mean here but by definition unk_token is used for truly the term "unknown" for unknown tokens / words. Could be a bit misleading. So maybe we can rephrase to say:

"This decomposition allows the model to represent unseen words using known subwords."


### WordPiece

WordPiece is the subword tokenization algorithm used for [BERT](model_doc/bert), [DistilBERT](model_doc/distilbert), and [Electra](model_doc/electra). The algorithm was outlined in [Japanese and Korean
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be nice to leave the links to models that use each algorithm! to have an example of each in case you want to load the tokenizer and see how it would tokenize a given string :)

Copy link
Collaborator

@itazap itazap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the change! I made a comment on some wording + might be nice to restore the references (links) to models that use each type of algorithm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants