Last active
August 29, 2015 14:28
-
-
Save jeroenr/212d80ef95dd06e8a51b to your computer and use it in GitHub Desktop.
NGram extraction using Stackable traits in Scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Dit is een correcte Nederlandse zin volgens het van Dale woordenboek. | |
De actie die een succes werd, wordt volgend jaar herhaald. | |
De man op de voorgrond is de voorzitter. | |
Morgen moet je hem maar gaan helpen., | |
Het ongeluk had een langdurige onderbreking tot gevolg. | |
Ik vind dit een erg lelijke bank | |
Na een lang gevecht moest hij toch het onderspit delven. | |
De Donau, de op één na langste rivier van Europa, mondt uit in de Zwarte Zee. | |
Kees van Kooten, de schrijver van het boekenweekgeschenk van 2013, vindt dat Nederland een begeesterde politicus nodig heeft. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
object Boot extends App { | |
val extractor = new SentenceAnalyzer | |
with NGramExtraction | |
with Unigrams | |
with Bigrams | |
with Trigrams | |
val sentences = Source.fromInputStream(getClass.getResourceAsStream("/input.txt")).getLines().toIterable | |
println(extractor.analyze(sentences)) | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package com.github.ngram.extractor | |
trait NGrams { def arities: List[Int] } | |
trait NGramExtraction extends NGrams { override def arities = List.empty[Int] } | |
trait Unigrams extends NGrams { this: NGramExtraction => abstract override def arities = 1 :: super.arities } | |
trait Bigrams extends NGrams { this: NGramExtraction => abstract override def arities = 2 :: super.arities } | |
trait Trigrams extends NGrams { this: NGramExtraction => abstract override def arities = 3 :: super.arities } | |
trait SentenceAnalyzer { | |
this: NGrams => | |
def analyze(sentences: Iterable[String]) = { | |
arities.flatMap { N => | |
val tokenizedSentence = sentences.map(_.split("\\s").toList) | |
tokenizedSentence.flatMap(_.sliding(N)).filter(_.size == N) | |
} | |
} | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment