Skip to content

Instantly share code, notes, and snippets.

@santhoshtr
Created February 22, 2024 03:54
Show Gist options
  • Save santhoshtr/ec3959a5fb6dc3c5810552e03093f1f7 to your computer and use it in GitHub Desktop.
Save santhoshtr/ec3959a5fb6dc3c5810552e03093f1f7 to your computer and use it in GitHub Desktop.
Malayalam tokens in Gemma 7B Model
"ന്": 26465,
"ക്": 28298,
"ത്": 31691,
"ക്ക": 41627,
"ന്ന": 45828,
"▁പ": 46110,
"▁ക": 49867,
"തി": 50292,
"്ട": 52078,
"ും: 55511,
"സ്": 56250,
"▁വ": 56408,
"പ്": 60447,
"ങ്": 60890,
"▁മ": 62282,
"▁ന": 64327,
"▁അ": 65024,
"യി": 68592,
"ച്": 70049,
"്‍": 70738,
"രി": 70996,
"▁സ": 73204,
"്യ": 73559,
"ങ്ങ": 73915,
"റ്": 74401,
"ുന്ന": 76180,
"മാ": 78652,
"്ര": 79257,
"ട്ട": 82176,
"ുക": 83986,
"ത്ത": 86186,
"പ്പ": 87194,
"ില": 88121,
"ത്തി": 89496,
"ച്ച": 89730,
"ിയ": 94368,
"▁എ": 96956,
"ണ്": 98322,
"▁ച": 99683,
"രു": 100092,
"ണ്ട": 100237,
"▁ത": 100796,
"്ല": 101778,
"ിക്ക": 105910,
"ടെ": 106304,
"▁ആ": 109671,
"റ്റ": 111519,
"വി": 112478,
"▁ഇ": 120320,
"ാര": 127822,
"യാ": 128513,
"ള്": 130931,
"റെ": 131583,
"മ്": 132672,
"ള്ള": 134097,
"ാന": 137146,
"▁പ്ര": 137788,
"ിക": 137920,
"▁നി": 139359,
"ത്ര": 142560,
"ഞ്": 143526,
"ങ്ങള": 145145,
"ിൽ": 147455,
"▁ഉ": 148449,
"ല്ല": 149222,
"ുന്നു": 149712,
"ങ്ങൾ": 150055,
"▁ഒ": 150183,
"രാ": 150918,
"ുടെ": 151975,
"▁ശ": 154875,
"ന്റെ": 154932,
"▁വി": 155253,
"ര്‍": 155410,
"ക്ക്": 158982,
"െയ": 161407,
"▁ബ": 161725,
"ാല": 164592,
"▁സ്": 164843,
"▁ജ": 167775,
"ുള്ള": 168905,
"ക്ഷ": 169697,
"റി": 172984,
"നി": 173576,
"ടു": 174769,
"ദ്": 178393,
"ന്‍": 179785,
"വാ": 181397,
"▁ല": 182503,
"താ": 183347,
"ടി": 188302,
"ാൻ": 197408,
"ല്‍": 203007,
"തു": 203401,
"ഞ്ഞ": 203465,
"ിച്ച": 204224,
"ായ": 204632,
"ങ്ക": 204818,
"ണം": 205798,
"ംബ": 206216,
"▁ഗ": 206313,
"ായി": 209033,
"▁ഒരു": 212230,
"യും: 213025,
"ാം: 213341,
"▁എന്ന": 216063,
"ന്ന്": 218016,
"സി": 218161,
"▁ദ": 218553,
"രിക്ക": 218841,
"▁ര": 220732,
"വല": 222775,
"മ്പ": 228544,
"ുറ": 232172,
"ുകൾ": 234779,
"▁ചെയ": 234848,
"്": 236027,
"ി": 236484,
"ക": 236585,
"ന": 236672,
"ു": 236782,
"ത": 236850,
"ാ": 236871,
"ര": 237095,
"യ": 237134,
"പ": 237277,
"ട": 237404,
"വ": 237418,
"ം": 237428,
"മ": 237496,
"ല": 237516,
"െ": 237674,
"സ": 237758,
"റ": 237827,
"ച": 238066,
"ണ": 238224,
"ള": 238263,
"ങ": 238383,
"ോ": 238418,
"േ": 238541,
"അ": 238879,
"ർ": 239088,
"ദ": 239097,
"ൽ": 239247,
"ീ": 239274,
"ശ": 239422,
"ഷ": 239491,
"ഗ": 239505,
"ൾ": 239511,
"ബ": 239580,
"ൂ": 239657,
"ൻ": 239663,
"എ": 239801,
"ജ": 239844,
"ഞ": 240064,
"ഹ": 240140,
"ആ": 240232,
"ധ": 240233,
"ഡ": 240346,
"ഇ": 240378,
"ഭ": 240468,
"ൊ": 240609,
"ഴ": 240698,
"ഒ": 241040,
"ഉ": 241057,
"ൈ": 241121,
"ഫ": 241135,
"ഥ": 241536,
"ഖ": 241745,
"ൃ": 242235,
"ഓ": 242954,
"ൺ": 243092,
"ഈ": 243140,
"ഏ": 243434,
"ഘ": 243857,
"ൗ": 243985,
"ഠ": 245096,
"ഐ": 245644,
"ഃ": 247042,
"ഛ": 247396,
"ഔ": 248104,
"ൌ": 248305,
"ഊ": 249142,
"൦": 249408,
"൧": 250067,
"ൎ": 250737,
"൨": 251155,
"ഢ": 251402,
"ഀ": 251432,
"൯": 251810,
"൪": 252004,
"൭": 252332,
"൫": 252474,
"൩": 252534,
"൬": 252804,
"ഋ": 253533,
"൮": 254195,
@santhoshtr
Copy link
Author

Generated from below step:

Get https://huggingface.co/google/gemma-7b/blob/main/tokenizer.json and then

ugrep "▁?[ഀ-ൿ]+" tokenizer.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment