[docs] tokenizer summary by stevhliu · Pull Request #43965 · huggingface/transformers

stevhliu · 2026-02-12T22:08:33Z

updates tokenizer_summary.md:

condense intro and subword tokenization sections since this doc is mostly about BPE/Unigram/WordPiece
removes some redundant and introductory motivation content and links to the course for more info
places BPE/Unigram/WordPiece closer to the top since this is most relevant and moves word/character-level tokenization closer to the bottom since its not as important
adds relevant videos from the course for BPE/Unigram/WordPiece

HuggingFaceDocBuilderDev · 2026-02-12T22:18:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

itazap · 2026-02-13T14:48:40Z

docs/source/en/tokenizer_summary.md

-More specifically, we will look at the three main types of tokenizers used in 🤗 Transformers: [Byte-Pair Encoding
-(BPE)](#byte-pair-encoding), [WordPiece](#wordpiece), and [SentencePiece](#sentencepiece), and show examples
-of which tokenizer type is used by which model.
+For instance, `annoyingly` might be split into `["annoying", "ly"]` or `["annoy", "ing", "ly"]` depending on the vocabulary. The model never encounters a truly unknown word because it can decompose it into known subwords.


I completely understand what you mean here but by definition unk_token is used for truly the term "unknown" for unknown tokens / words. Could be a bit misleading. So maybe we can rephrase to say:

"This decomposition allows the model to represent unseen words using known subwords."

itazap · 2026-02-13T14:54:50Z

docs/source/en/tokenizer_summary.md

-
-### WordPiece
-
-WordPiece is the subword tokenization algorithm used for [BERT](model_doc/bert), [DistilBERT](model_doc/distilbert), and [Electra](model_doc/electra). The algorithm was outlined in [Japanese and Korean


could be nice to leave the links to models that use each algorithm! to have an example of each in case you want to load the tokenizer and see how it would tokenize a given string :)

itazap

thanks for the change! I made a comment on some wording + might be nice to restore the references (links) to models that use each type of algorithm?

stevhliu added 2 commits February 12, 2026 13:50

update

3ce2284

small update

1624715

stevhliu requested a review from itazap February 12, 2026 22:27

itazap reviewed Feb 13, 2026

View reviewed changes

itazap approved these changes Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] tokenizer summary#43965

[docs] tokenizer summary#43965
stevhliu wants to merge 2 commits intohuggingface:mainfrom
stevhliu:tokenizer-summary

stevhliu commented Feb 12, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Feb 12, 2026

Uh oh!

itazap Feb 13, 2026

Uh oh!

itazap Feb 13, 2026

Uh oh!

itazap left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		### WordPiece

		WordPiece is the subword tokenization algorithm used for [BERT](model_doc/bert), [DistilBERT](model_doc/distilbert), and [Electra](model_doc/electra). The algorithm was outlined in [Japanese and Korean

Conversation

stevhliu commented Feb 12, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Feb 12, 2026

Uh oh!

itazap Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

itazap Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

itazap left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants