Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| More specifically, we will look at the three main types of tokenizers used in 🤗 Transformers: [Byte-Pair Encoding | ||
| (BPE)](#byte-pair-encoding), [WordPiece](#wordpiece), and [SentencePiece](#sentencepiece), and show examples | ||
| of which tokenizer type is used by which model. | ||
| For instance, `annoyingly` might be split into `["annoying", "ly"]` or `["annoy", "ing", "ly"]` depending on the vocabulary. The model never encounters a truly unknown word because it can decompose it into known subwords. |
There was a problem hiding this comment.
I completely understand what you mean here but by definition unk_token is used for truly the term "unknown" for unknown tokens / words. Could be a bit misleading. So maybe we can rephrase to say:
"This decomposition allows the model to represent unseen words using known subwords."
|
|
||
| ### WordPiece | ||
|
|
||
| WordPiece is the subword tokenization algorithm used for [BERT](model_doc/bert), [DistilBERT](model_doc/distilbert), and [Electra](model_doc/electra). The algorithm was outlined in [Japanese and Korean |
There was a problem hiding this comment.
could be nice to leave the links to models that use each algorithm! to have an example of each in case you want to load the tokenizer and see how it would tokenize a given string :)
itazap
left a comment
There was a problem hiding this comment.
thanks for the change! I made a comment on some wording + might be nice to restore the references (links) to models that use each type of algorithm?
updates
tokenizer_summary.md: