Skip to content

Instantly share code, notes, and snippets.

@travismorton
Last active December 15, 2022 06:22
Show Gist options
  • Save travismorton/5f3bc8837109d5ea67cf60b01a475fb5 to your computer and use it in GitHub Desktop.
Save travismorton/5f3bc8837109d5ea67cf60b01a475fb5 to your computer and use it in GitHub Desktop.
Whisper Tokenizer with Elixir Tokenizers
languages = %{
en: "english",
zh: "chinese",
de: "german",
es: "spanish",
ru: "russian",
ko: "korean",
fr: "french",
ja: "japanese",
pt: "portuguese",
tr: "turkish",
pl: "polish",
ca: "catalan",
nl: "dutch",
ar: "arabic",
sv: "swedish",
it: "italian",
id: "indonesian",
hi: "hindi",
fi: "finnish",
vi: "vietnamese",
he: "hebrew",
uk: "ukrainian",
el: "greek",
ms: "malay",
cs: "czech",
ro: "romanian",
da: "danish",
hu: "hungarian",
ta: "tamil",
no: "norwegian",
th: "thai",
ur: "urdu",
hr: "croatian",
bg: "bulgarian",
lt: "lithuanian",
la: "latin",
mi: "maori",
ml: "malayalam",
cy: "welsh",
sk: "slovak",
te: "telugu",
fa: "persian",
lv: "latvian",
bn: "bengali",
sr: "serbian",
az: "azerbaijani",
sl: "slovenian",
kn: "kannada",
et: "estonian",
mk: "macedonian",
br: "breton",
eu: "basque",
is: "icelandic",
hy: "armenian",
ne: "nepali",
mn: "mongolian",
bs: "bosnian",
kk: "kazakh",
sq: "albanian",
sw: "swahili",
gl: "galician",
mr: "marathi",
pa: "punjabi",
si: "sinhala",
km: "khmer",
sn: "shona",
yo: "yoruba",
so: "somali",
af: "afrikaans",
oc: "occitan",
ka: "georgian",
be: "belarusian",
tg: "tajik",
sd: "sindhi",
gu: "gujarati",
am: "amharic",
yi: "yiddish",
lo: "lao",
uz: "uzbek",
fo: "faroese",
ht: "haitian creole",
ps: "pashto",
tk: "turkmen",
nn: "nynorsk",
mt: "maltese",
sa: "sanskrit",
lb: "luxembourgish",
my: "myanmar",
bo: "tibetan",
tl: "tagalog",
mg: "malagasy",
as: "assamese",
tt: "tatar",
haw: "hawaiian",
ln: "lingala",
ha: "hausa",
ba: "bashkir",
jw: "javanese",
su: "sundanese"
}
special_tokens =
Enum.concat([
["<|endoftext|>", "<|startoftranscript|>"],
languages |> Map.keys() |> Enum.map(fn x -> "<|#{Atom.to_string(x)}|>" end),
[
"<|translate|>",
"<|transcribe|>",
"<|startoflm|>",
"<|startofprev|>",
"<|nospeech|>",
"<|notimestamps|>"
]
])
{:ok, whisper_tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2", additional_special_tokens: special_tokens)
encoded =
Tokenizers.Tokenizer.encode(
whisper_tokenizer,
"<|startoftranscript|><|en|><|transcribe|>hello this is a test<|endoftext|>"
)
IO.inspect(encoded)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment