Skip to content

Instantly share code, notes, and snippets.

@Kaljurand
Created January 9, 2014 12:50
Show Gist options
  • Save Kaljurand/8333643 to your computer and use it in GitHub Desktop.
Save Kaljurand/8333643 to your computer and use it in GitHub Desktop.
Experiment with segmenting Estonian placenames. Main goal was to split off a meaningful suffix. In the experiment we kept all parts which were at least 4 characters long.
$ wc placenames.txt
4416 4422 36452 placenames.txt
$ morfessor -t placenames.txt -s placenames_model.pickled
INFO:morfessor.io:Reading corpus from 'placenames.txt'...
INFO:morfessor.io:Detected utf-8 encoding
INFO:morfessor.io:Detected utf-8 encoding
INFO:morfessor.io:Done.
INFO:morfessor.baseline:Compounds in training data: 4417 types / 4417 tokens
INFO:morfessor.baseline:Starting batch training
INFO:morfessor.baseline:Epochs: 0 Cost: 121595.428272
...........................................................
INFO:morfessor.baseline:Epochs: 1 Cost: 106209.31456
...........................................................
INFO:morfessor.baseline:Epochs: 2 Cost: 99081.6022444
...........................................................
INFO:morfessor.baseline:Epochs: 3 Cost: 94382.9848076
...........................................................
INFO:morfessor.baseline:Epochs: 4 Cost: 92101.3947599
...........................................................
INFO:morfessor.baseline:Epochs: 5 Cost: 91442.8257696
...........................................................
INFO:morfessor.baseline:Epochs: 6 Cost: 91180.9919002
...........................................................
INFO:morfessor.baseline:Epochs: 7 Cost: 91105.2827758
...........................................................
INFO:morfessor.baseline:Epochs: 8 Cost: 91058.3609378
...........................................................
INFO:morfessor.baseline:Epochs: 9 Cost: 91053.0997569
INFO:morfessor.baseline:Done.
Epochs: 9
Final cost: 91053.0997569
Training time: 36.919s
INFO:morfessor.io:Saving model to 'placenames_model.pickled'...
INFO:morfessor.io:Done.
$ morfessor -l placenames_model.pickled -T placenames.txt -o placenames.segmented
$ cat placenames.segmented | tr ' ' '\012' | grep "...." | soru
208 vere
178 küla
71 metsa
44 ssaare
33 järve
31 Vana
29 mõisa
26 nurme
26 Metsa
26 aste
25 välja
24 pere
24 palu
24 lepa
22 saare
21 otsa
19 salu
17 kese
16 Väike
16 taguse
15 Suure
14 Taga
14 Suur
14 Kari
13 ranna
13 Kure
13 Kivi
13 jala
12 Nõmme
12 Jaani
11 selja
11 Paju
11 Mets
11 Kiri
11 Järve
10 Tamme
10 pera
10 Palu
10 Pala
10 nurga
10 Must
10 laane
10 Allik
9 vitsa
9 Veski
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment