Skip to content

Instantly share code, notes, and snippets.

@yakovsh
Last active May 23, 2022 19:43
Show Gist options
  • Star 22 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save yakovsh/345a71d841871cc3d375 to your computer and use it in GitHub Desktop.
Save yakovsh/345a71d841871cc3d375 to your computer and use it in GitHub Desktop.
Removing Vowels from Hebrew Unicode Text
/*
* One of the questions that recently came up is how to remove vowels from Hebrew characters in Unicode
* (or any other similar language). A quick look at Hebrew Unicode chart shows that the vowels are all
* located between 0x0591 (1425) and 0x05C7 (1479). With this and Javascript's charCodeAt function, it
* is trivial to strip them out with Javascript as follows
*
* Live demo is available here:
* https://jsfiddle.net/js0ge7gn/
*/
function stripVowels(rawString)
{
var newString = '';
for(j=0; j<rawString.length; j++) {
if(rawString.charCodeAt(j)<1425
|| rawString.charCodeAt(j)>1479)
{ newString = newString + rawString.charAt(j); }
}
return(newString);
}
/* @shimondoodkin suggested even a much shorter way to do this */
function stripVowels2(rawString) {
return rawString.replace(/[\u0591-\u05C7]/g,"")
}
@dir01
Copy link

dir01 commented Apr 10, 2020

Golang example based on other comments on this gist: https://play.golang.org/p/B9VRe-L5lS3

@drjrodriguez
Copy link

drjrodriguez commented Feb 27, 2021

When stripping nequdot from Torah passages (and essentially trying to modernize the spelling for myself), I realized that modern Hebrew actually writes in more vav's (and maybe yud's too). So I wrote this variation that checks to see if the o / u sound is represented before deleting one of its various representations. If it isn't represented, I ADD a vav.

`function stripVowels(rawString){
var newString = '';
for(j=0; j<rawString.length; j++) {
//If it has an O that isn't otherwise represented, then add a vav
if(rawString.charCodeAt(j) == 1465 && rawString[j+1] != "ו" && rawString[j+1] != "א" && rawString[j+1] != "ה"){
newString += "ו";
}
//If it has a U, also add a vav
else if(rawString.charCodeAt(j) == 1467){
newString += "ו";
}
//Turn Hebrew hyphen into space
else if(rawString.charCodeAt(j) == 1470) newString += " ";
//Get rid of anything that's not a normal letter or punctuation
else if(rawString.charCodeAt(j)<1425 || rawString.charCodeAt(j)>1479){
newString += rawString.charAt(j);
}
}
return(newString);
}

//O == 1465
//hyphen == 1470
//U == 1467`

Here's an example of the difference:

וְיִשְׂרָאֵל אָהַב אֶת־יֹוסֵף מִכָּל־בָּנָיו כִּי־בֶן־זְקֻנִים הוּא לֹו וְעָשָׂה לֹו כְּתֹנֶת פַּסִּים ׃
וישראל אהב את יוסף מכל בניו כי בן זקונים הוא לו ועשה לו כתונת פסים

@drjrodriguez
Copy link

I don't know why the code font didn't kick in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment