Skip to content

Instantly share code, notes, and snippets.

@yakovsh
Last active May 23, 2022 19:43
Show Gist options
  • Star 22 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save yakovsh/345a71d841871cc3d375 to your computer and use it in GitHub Desktop.
Save yakovsh/345a71d841871cc3d375 to your computer and use it in GitHub Desktop.
Removing Vowels from Hebrew Unicode Text
/*
* One of the questions that recently came up is how to remove vowels from Hebrew characters in Unicode
* (or any other similar language). A quick look at Hebrew Unicode chart shows that the vowels are all
* located between 0x0591 (1425) and 0x05C7 (1479). With this and Javascript's charCodeAt function, it
* is trivial to strip them out with Javascript as follows
*
* Live demo is available here:
* https://jsfiddle.net/js0ge7gn/
*/
function stripVowels(rawString)
{
var newString = '';
for(j=0; j<rawString.length; j++) {
if(rawString.charCodeAt(j)<1425
|| rawString.charCodeAt(j)>1479)
{ newString = newString + rawString.charAt(j); }
}
return(newString);
}
/* @shimondoodkin suggested even a much shorter way to do this */
function stripVowels2(rawString) {
return rawString.replace(/[\u0591-\u05C7]/g,"")
}
@yakovsh
Copy link
Author

yakovsh commented Jan 17, 2016

Dovid Harrison suggested the following algorithm in Ruby:

def stripVowels(rawString)
newString = โ€
rawString.mb_chars.each_char do |c|
newString << c if c.ord1479
end
return newString
end

@yakovsh
Copy link
Author

yakovsh commented Jan 17, 2016

Yarix suggested a Google Sheets function to do the same:

=REGEXREPLACE(B1,โ€[\x{0591}-\x{05C7}]โ€,โ€โ€)

@yakovsh
Copy link
Author

yakovsh commented Jan 17, 2016

Gnarlodious suggested the following in Python:

import unicodedata
nikudnik=โ€ื”ึทื‘ึทึผื™ึฐืชึธื”โ€
normalized=unicodedata.normalize(โ€˜NFKDโ€™, nikudnik) # Reduce hebrew vowel ื ื™ืงื•ื“ marks
flattened=โ€.join([c for c in normalized if not unicodedata.combining(c)])
flattened

@flamholz
Copy link

FYI: the jsfiddle seems to unintentionally strip dashes and non-vowel punctuation. For example, I put in Bereshit 1:12

"ื•ึทืชึผื•ึนืฆึตึจื ื”ึธืึธึœืจึถืฅ ื“ึผึถึ ืฉืึถื ืขึตึฃืฉื‚ึถื‘ ืžึทื–ึฐืจึดึคื™ืขึท ื–ึถึจืจึทืขึ™ ืœึฐืžึดื™ื ึตึ”ื”ื•ึผ ื•ึฐืขึตึงืฅ ืขึนึฝืฉื‚ึถื”ึพืคึผึฐืจึดึ›ื™ ืึฒืฉืึถึฅืจ ื–ึทืจึฐืขื•ึนึพื‘ึ–ื•ึน ืœึฐืžึดื™ื ึตึ‘ื”ื•ึผ ื•ึทื™ึผึทึฅืจึฐื ืึฑืœึนื”ึดึ–ื™ื ื›ึผึดื™ึพื˜ึฝื•ึนื‘"

and got

"ื•ืชื•ืฆื ื”ืืจืฅ ื“ืฉื ืขืฉื‘ ืžื–ืจื™ืข ื–ืจืข ืœืžื™ื ื”ื• ื•ืขืฅ ืขืฉื”ืคืจื™ ืืฉืจ ื–ืจืขื•ื‘ื• ืœืžื™ื ื”ื• ื•ื™ืจื ืืœื”ื™ื ื›ื™ื˜ื•ื‘"

@altoP
Copy link

altoP commented Dec 20, 2017

And remove vowels in Java:

private String removeVowels(String hebString){
    String newString = "";
    for(int j=0; j<hebString.length() ; j++) {
        char c = hebString.charAt(j);
        if(hebString.charAt(j)<1425 || hebString.charAt(j)>1479)
            newString = newString + hebString.charAt(j);
    }
    return newString;
}

@avrtau
Copy link

avrtau commented Jan 30, 2018

@flamholz You can just exclude the "Maqaf" (the "dash" connecting two words in the hebrew text, unicode u05BE) in the RegEx: return rawString.replace(/[\u0591-\u05BD\u05BF-\u05C7]/g,"");

@bsesic
Copy link

bsesic commented May 24, 2018

Thanks for the python script @yakovsh! ๐Ÿ‘
Based on that script I created a more advanced script in Python. ๐Ÿ

๐ŸŽ‰ It has a GUI, using Tkinter!

It can read and write text files and also removes only Taamei haMikra characters if yo wish.

๐Ÿ”ฅ The script is part of an open source bible software I am writing, called Emmeth.

Check it our here: https://github.com/Emmeth/tools/blob/master/scripts/remove_nikkud.py

@canjecricketer
Copy link

So this thread has been super helpful. I've been using the Google Sheets function suggested by yakovsh, with avrtau's modification to keep the maqqef and sof pasuq. Here's what I'm using: =REGEXREPLACE(B1,"[(\x{0591}-\x{05BD})OR(\x{05BF}-\x{05C2})OR(\x{05C4}-\x{05C7})]","")

But what I want to know is whether and how I could incorporate similar code into a macro for Word, since that is ultimately what I am using this for--to remove nikkud from Hebrew words and text extracts in a Word document. Currently I copy the vocalized text from the Word doc, paste it into my Google sheet, and then copy/paste that output over the vocalized text back in Word. It would be faster and more efficient if I could just select the vocalized text and run a macro on it with a shortcut key. I'm fairly sure this is possible, but I don't which if any of the code listed above would work. I'm pretty new to Word macros; I'm familiar with creating and editing them, and I understand the basics, but I don't know the syntax myself. Thanks for any suggestions!

@orish1
Copy link

orish1 commented Jan 1, 2020

Thanks for the ideas above. I imported all the data in an Excel spreadsheet and couldn't find a way to do it elegantly using a regexreplace function, say.

This is a bit brute-force, but it worked for me and only took about 10 seconds to convert around 10,000 words and sentences:

Function stripVowels(rawString)
Dim stripped As String
stripped = rawString
For H = 1425 To 1479
stripped = Replace(stripped, ChrW(H), "")
Next
stripVowels = stripped
End Function

And then just type "=stripVowels(A2)" in the cell(s) where you want nikkud-less Hebrew text (obviously replace "A2" with whatever the cell is of the original text).

:)

@orish1
Copy link

orish1 commented Jan 1, 2020

Here's the variation required for a Word Find-and-Replace macro:

Sub StripNkudot()
Application.DisplayAlerts = False
For H = 1425 To 1479
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
With Selection.Find
.Text = ChrW(H)
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchKashida = False
.MatchDiacritics = False
.MatchAlefHamza = False
.MatchControl = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute replace:=wdReplaceAll
Next
Application.DisplayAlerts = True
End Sub

(You might want to comment out DisplayAlerts=False if you want to see what's going on, but if you have a lot of text then switching off the display will speed up the process considerably.)

@dir01
Copy link

dir01 commented Apr 10, 2020

Golang example based on other comments on this gist: https://play.golang.org/p/B9VRe-L5lS3

@drjrodriguez
Copy link

drjrodriguez commented Feb 27, 2021

When stripping nequdot from Torah passages (and essentially trying to modernize the spelling for myself), I realized that modern Hebrew actually writes in more vav's (and maybe yud's too). So I wrote this variation that checks to see if the o / u sound is represented before deleting one of its various representations. If it isn't represented, I ADD a vav.

`function stripVowels(rawString){
var newString = '';
for(j=0; j<rawString.length; j++) {
//If it has an O that isn't otherwise represented, then add a vav
if(rawString.charCodeAt(j) == 1465 && rawString[j+1] != "ื•" && rawString[j+1] != "ื" && rawString[j+1] != "ื”"){
newString += "ื•";
}
//If it has a U, also add a vav
else if(rawString.charCodeAt(j) == 1467){
newString += "ื•";
}
//Turn Hebrew hyphen into space
else if(rawString.charCodeAt(j) == 1470) newString += " ";
//Get rid of anything that's not a normal letter or punctuation
else if(rawString.charCodeAt(j)<1425 || rawString.charCodeAt(j)>1479){
newString += rawString.charAt(j);
}
}
return(newString);
}

//O == 1465
//hyphen == 1470
//U == 1467`

Here's an example of the difference:

ื•ึฐื™ึดืฉื‚ึฐืจึธืึตืœ ืึธื”ึทื‘ ืึถืชึพื™ึนื•ืกึตืฃ ืžึดื›ึผึธืœึพื‘ึผึธื ึธื™ื• ื›ึผึดื™ึพื‘ึถืŸึพื–ึฐืงึปื ึดื™ื ื”ื•ึผื ืœึนื• ื•ึฐืขึธืฉื‚ึธื” ืœึนื• ื›ึผึฐืชึนื ึถืช ืคึผึทืกึผึดื™ื ืƒ
ื•ื™ืฉืจืืœ ืื”ื‘ ืืช ื™ื•ืกืฃ ืžื›ืœ ื‘ื ื™ื• ื›ื™ ื‘ืŸ ื–ืงื•ื ื™ื ื”ื•ื ืœื• ื•ืขืฉื” ืœื• ื›ืชื•ื ืช ืคืกื™ื

@drjrodriguez
Copy link

I don't know why the code font didn't kick in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment