Skip to content

Instantly share code, notes, and snippets.

@atuyosi
Created August 24, 2016 12:51
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save atuyosi/77313429cce2bc6f961339d81bd162d4 to your computer and use it in GitHub Desktop.
Save atuyosi/77313429cce2bc6f961339d81bd162d4 to your computer and use it in GitHub Desktop.
To convert the cmudict from arpabet format to IPA format.
#! /usr/bin/env ruby
require 'csv'
require 'json'
mt = Hash.new
newdict = Hash.new
MAPFILE = "map_to_ipa_from_arpabet.tsv"
DICTFILE = "../cmudict/cmudict.dict"
debug = false
def convert_arpabet(arpabets, mt)
punc = String.new
arpabets.each do |n|
break if n == '#'
punc << mt[n]
end
return "[#{punc}]"
end
CSV.foreach( MAPFILE ,col_sep: "\t", headers: false) do |row|
if row.length > 1 then
punc = row[1].downcase.gsub(/\\u([0-9a-f]{4})/) { [$1].pack('H*').unpack('n*').pack('U*') }
hkey = row[0]
mt[hkey] = punc
end
end
dictfile = open(DICTFILE)
dictfile.each do | line |
surf , *arpabets = line.split(/\s/)
puts arpabets.to_s if debug
if surf =~ /(.+)\(\d\)$/ then
surf = Regexp.last_match[1]
end
punc_ipa = convert_arpabet(arpabets, mt)
if newdict[surf] then
# Add array item
newdict[surf] << punc_ipa
else
newdict[surf] = [ punc_ipa ]
end
end
puts newdict.length if debug
# output to file
File.open("output.json", 'w') do |file|
str = JSON.pretty_generate(newdict)
file.write str
# str = JSON.dump(newdict, file)
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment