Skip to content

Instantly share code, notes, and snippets.

@tyrcho
Last active August 29, 2015 14:04
Show Gist options
  • Save tyrcho/fe00bfd05de577b5a78a to your computer and use it in GitHub Desktop.
Save tyrcho/fe00bfd05de577b5a78a to your computer and use it in GitHub Desktop.
Computing frequency table in scala using breeze

Computes the table for 2264 docs containing 20475 words in 23 seconds (Core I5)

You need to have installed breeze

With Maven :

<dependency>
		<groupId>org.scalanlp</groupId>
		<artifactId>breeze_2.10</artifactId>
		<version>0.8.1</version>
	</dependency>

FrequencyTable.scala

10000_txt_earn net 1
10000_txt_earn rogers 4
10000_txt_earn earnings 2
10000_txt_earn switch 1
10000_txt_earn conn 1
10000_txt_earn revenues 2
10000_txt_earn cts 1
10000_txt_earn company 1
10000_txt_earn ago 1
10000_txt_earn circuit 1
10000_txt_earn 114000 1
10000_txt_earn dlrs 2
10000_txt_earn sale 2
10000_txt_earn 26 1
10000_txt_earn line 1
10000_txt_earn first 2
10000_txt_earn said 4
10000_txt_earn supplier 1
10000_txt_earn early 1
10000_txt_earn higher 1
10000_txt_earn major 1
10000_txt_earn terms 1
10000_txt_earn four 1
10000_txt_earn share 1
10000_txt_earn 329 1
10000_txt_earn second 1
10000_txt_earn quarter 5
10000_txt_earn hartford 1
10000_txt_earn 1st 1
10000_txt_earn agreement 1
10000_txt_earn product 1
10000_txt_earn posted 1
10000_txt_earn march 1
10000_txt_earn qtr 1
10000_txt_earn significantly 2
10000_txt_earn disclosed 1
10000_txt_earn year 2
10000_txt_earn completed 1
10000_txt_earn molded 1
10000_txt_earn mln 1
10000_txt_earn will 2
10000_txt_earn reached 1
10000_txt_earn expects 1
10000_txt_earn last 1
10000_txt_earn corp 1
10000_txt_earn somewhat 1
10000_txt_earn sees 1
10000_txt_earn rog 1
10000_txt_earn reuter 1
10054_txt_earn raises 1
10054_txt_earn company 1
10054_txt_earn presplit 1
10054_txt_earn may 1
10054_txt_earn number 1
10054_txt_earn approved 1
10054_txt_earn twoforone 1
10054_txt_earn dividend 4
10054_txt_earn 67 1
10054_txt_earn qtly 1
10054_txt_earn shareholders 2
10054_txt_earn share 1
10054_txt_earn cts 2
10054_txt_earn mln 2
10054_txt_earn raised 1
10054_txt_earn quarterly 1
10054_txt_earn pct 2
10054_txt_earn authorized 1
10054_txt_earn fds 1
10054_txt_earn department 2
10054_txt_earn new 1
10054_txt_earn time 1
10054_txt_earn 200 1
10054_txt_earn march 1
10054_txt_earn 26 1
10054_txt_earn shares 2
10054_txt_earn also 1
10054_txt_earn 74 1
10054_txt_earn common 1
10054_txt_earn split 2
10054_txt_earn 11 1
10054_txt_earn reuter 1
10054_txt_earn increase 2
10054_txt_earn stock 4
10054_txt_earn april 2
10054_txt_earn 24 1
10054_txt_earn distributed 1
10054_txt_earn 10 1
10054_txt_earn approve 1
10054_txt_earn form 1
10054_txt_earn will 2
10054_txt_earn said 5
10054_txt_earn basis 1
10054_txt_earn record 1
10054_txt_earn ask 1
10054_txt_earn 400 1
10054_txt_earn 105 1
10054_txt_earn inc 1
10054_txt_earn cincinnati 1
10054_txt_earn federated 4
10054_txt_earn payable 1
10054_txt_earn stores 1
10054_txt_earn 100 1
10080_txt_crude disposal 1
10080_txt_crude amoco 1
10080_txt_crude real 1
10080_txt_crude looking 2
10080_txt_crude ownership 1
10080_txt_crude longrange 1
10080_txt_crude jumped 1
10080_txt_crude 118 3
10080_txt_crude year 5
10080_txt_crude market 4
10080_txt_crude 3734 1
10080_txt_crude announced 1
10080_txt_crude ucl 1
10080_txt_crude can 2
10080_txt_crude australian 1
10080_txt_crude ball 1
10080_txt_crude 8334 1
10080_txt_crude swirl 1
10080_txt_crude rose 4
10080_txt_crude 5958 1
10080_txt_crude way 1
10080_txt_crude ground 1
10080_txt_crude north 1
10080_txt_crude boost 1
10080_txt_crude texaco 1
10080_txt_crude british 8
10080_txt_crude offer 3
10080_txt_crude 134 1
10080_txt_crude york 1
10080_txt_crude rothschild 1
10080_txt_crude dean 1
10080_txt_crude extremely 1
10080_txt_crude values 2
10080_txt_crude speculated 1
10080_txt_crude unattractive 1
10080_txt_crude buy 1
10080_txt_crude court 1
10080_txt_crude bid 3
10080_txt_crude bruce 1
10080_txt_crude ahc 1
10080_txt_crude prescott 1
10080_txt_crude carl 1
10080_txt_crude bpbp 1
10080_txt_crude move 1
10080_txt_crude 74 2
10080_txt_crude view 1
10080_txt_crude eugene 1
10080_txt_crude bearish 1
10080_txt_crude revenues 2
10080_txt_crude hearts 1
10080_txt_crude 1986 1
10080_txt_crude clear 1
10080_txt_crude 317 1
10080_txt_crude industry 1
10080_txt_crude 14 1
10080_txt_crude uk 1
10080_txt_crude mln 2
10080_txt_crude acquisition 1
10080_txt_crude losses 1
10080_txt_crude earlier 3
10080_txt_crude point 2
10080_txt_crude beginning 1
10080_txt_crude planned 1
10080_txt_crude indicates 1
10080_txt_crude last 4
10080_txt_crude unocal 1
10080_txt_crude affirmation 1
10080_txt_crude usx 1
10080_txt_crude bp 3
10080_txt_crude 45 1
10080_txt_crude pct 4
10080_txt_crude inc 1
10080_txt_crude courted 1
10080_txt_crude high 1
10080_txt_crude behind 1
10080_txt_crude dlrs 11
10080_txt_crude going 2
10080_txt_crude heavy 1
10080_txt_crude co 1
10080_txt_crude march 1
10080_txt_crude street 1
10080_txt_crude petroleum 7
10080_txt_crude positions 1
10080_txt_crude governments 1
10080_txt_crude response 1
10080_txt_crude us 10
10080_txt_crude changed 1
10080_txt_crude expectations 1
10080_txt_crude higher 2
10080_txt_crude six 1
10080_txt_crude shearson 1
10080_txt_crude partners 1
10080_txt_crude less 1
10080_txt_crude fact 1
10080_txt_crude years 3
10080_txt_crude just 1
10080_txt_crude 15 1
10080_txt_crude rest 2
10080_txt_crude brothers 1
10080_txt_crude witter 1
10080_txt_crude united 1
10080_txt_crude slightly 1
10080_txt_crude benchmark 1
10080_txt_crude 3434 1
10080_txt_crude analysts 7
10080_txt_crude majors 1
10080_txt_crude oil 16
10080_txt_crude issues 1
10080_txt_crude alaskan 2
10080_txt_crude already 1
10080_txt_crude rumors 1
10080_txt_crude government 1
10080_txt_crude place 1
10080_txt_crude 214 1
10080_txt_crude trading 2
10080_txt_crude 18 1
10080_txt_crude situations 1
10080_txt_crude opec 1
10080_txt_crude holding 1
10080_txt_crude pay 1
10080_txt_crude analyst 3
10080_txt_crude maintained 1
10080_txt_crude 1860 1
10080_txt_crude reuter 1
10080_txt_crude states 1
10080_txt_crude prices 3
10080_txt_crude compared 1
10080_txt_crude west 1
10080_txt_crude strong 1
10080_txt_crude per 4
10080_txt_crude exxon 1
10080_txt_crude hasty 1
10080_txt_crude dlr 2
10080_txt_crude 26 1
10080_txt_crude continue 1
10080_txt_crude session 1
10080_txt_crude round 1
10080_txt_crude visibility 1
10080_txt_crude mentioned 1
10080_txt_crude 7118 1
10080_txt_crude might 1
10080_txt_crude billion 4
10080_txt_crude one 3
10080_txt_crude two 1
10080_txt_crude robert 1
10080_txt_crude restructured 1
10080_txt_crude firms 1
10080_txt_crude acquiring 1
10080_txt_crude texas 1
10080_txt_crude 614 1
10080_txt_crude increase 1
10080_txt_crude made 1
10080_txt_crude think 4
10080_txt_crude concern 1
10080_txt_crude retreat 1
10080_txt_crude new 1
10080_txt_crude margoshes 4
10080_txt_crude found 1
10080_txt_crude investor 1
10080_txt_crude largest 1
10080_txt_crude icahn 1
10080_txt_crude energy 1
10080_txt_crude around 2
10080_txt_crude outlook 1
10080_txt_crude lazier 3
10080_txt_crude signalled 1
10080_txt_crude shows 1
10080_txt_crude wall 1
10080_txt_crude huge 1
10080_txt_crude position 1
10080_txt_crude ago 1
10080_txt_crude 8812 1
10080_txt_crude interests 1
10080_txt_crude appropriately 1
10080_txt_crude 70 2
10080_txt_crude lehman 1
10080_txt_crude said 20
10080_txt_crude raise 1
10080_txt_crude also 2
10080_txt_crude confidence 1
10080_txt_crude oils 2
10080_txt_crude will 1
10080_txt_crude 1382 1
10080_txt_crude plan 1
10080_txt_crude may 1
10080_txt_crude 308 1
10080_txt_crude won 1
10080_txt_crude stay 1
10080_txt_crude today 3
10080_txt_crude amerada 1
10080_txt_crude hardtoreplace 1
10080_txt_crude 138 1
10080_txt_crude matchmaking 1
10080_txt_crude plans 1
10080_txt_crude attention 1
10080_txt_crude possibly 1
10080_txt_crude others 1
10080_txt_crude oxy 1
10080_txt_crude profit 1
10080_txt_crude raises 2
10080_txt_crude 1002 1
10080_txt_crude rosario 1
10080_txt_crude 5878 1
10080_txt_crude prudhoe 1
10080_txt_crude major 1
10080_txt_crude become 1
10080_txt_crude crack 1
10080_txt_crude able 1
10080_txt_crude signal 1
10080_txt_crude tender 1
10080_txt_crude chevron 1
10080_txt_crude sometime 1
10080_txt_crude sanford 1
10080_txt_crude company 2
10080_txt_crude ilacqua 1
10080_txt_crude brightest 1
10080_txt_crude falling 2
10080_txt_crude climbed 2
10080_txt_crude share 3
10080_txt_crude exceeded 1
10080_txt_crude implication 1
10080_txt_crude chv 1
10080_txt_crude corp 7
10080_txt_crude particularly 1
10080_txt_crude targets 1
10080_txt_crude xon 1
10080_txt_crude bps 3
10080_txt_crude unit 1
import breeze.linalg._
import scala.collection.mutable.Map
object FrequencyTable extends App {
val lines = io.Source.fromFile("t:/frequency.csv").getLines.take(10000000).toList ++
(List("washington", "taxes", "treasury") map { w => s"test,$w,1" })
val docs = Map.empty[String, Int]
val words = Map.empty[String, Int]
var docCount = -1
var wordCount = -1
init()
val matrix = initMatrix
val time = System.nanoTime / 1000000
val prod = matrix * matrix.t
val duration = System.nanoTime / 1000000 - time
println(s"computed in $duration ms")
val rows = prod.rows
val cols = prod.cols
assert(cols == docs.size)
val lastRow = prod(rows - 1 until rows, 0 until cols).iterator.toList.sortBy(-_._2)
println(prod.toDense)
for (((_, doc), value) <- lastRow if value > 0)
println(doc -> value)
def init(): Unit = {
for {
l <- lines
Array(doc, word, _) = l.split(",")
} {
addDoc(doc)
addWord(word)
}
println(s"${docs.size} docs")
println(s"${words.size} words")
}
def initMatrix(): CSCMatrix[Int] = {
val builder = new CSCMatrix.Builder[Int](rows = docs.size, cols = words.size)
for {
l <- lines
Array(doc, word, value) = l.split(",")
row = docs(doc)
col = words(word)
} {
builder.add(row, col, value.toInt)
}
builder.result()
}
def addDoc(doc: String): Int =
docs.getOrElseUpdate(doc, { docCount += 1; docCount })
def addWord(w: String): Int =
words.getOrElseUpdate(w, { wordCount += 1; wordCount })
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment