unigram

The 1/3 million most frequent words, all lowercase, with counts.

kitoken

Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization