Skip to content

Instantly share code, notes, and snippets.

@willf
Created December 23, 2011 16:54
Show Gist options
  • Save willf/1514745 to your computer and use it in GitHub Desktop.
Save willf/1514745 to your computer and use it in GitHub Desktop.
Create an inverted index from a file
/**
* From a file that contains
* doc_id w1 w2 w3 ... lines, separated by tabs
* return an inverted index Map of w -> Set(doc_id)
*
* @param filename well isn't it obvious
* @return Map[String,Set[String]]
*/
import scala.collection.immutable.Map
def invertedIndex(filename:String) = {
io.Source.fromFile(filename).getLines. // this is an iterator over lines
map(_.split("\t")). // split at tabs
filter(_.size > 0). // make sure there is at least one item
map(x => x.drop(1).map(y => (y,x(0)) )). // get inverted pairs for all lines
toList. // ? required but i'm not sure why...
flatMap(x => x). // flatten to pairs -- you could filter on these
groupBy(_._1). // group by the first key
map(p => (p._1,p._2.map(_._2).toSet)) // map over groups values, turning 2nd value into sets
}
@lambdaknight
Copy link

Looks pretty good. Only a few minor things:

  1. flatMap(x => x) has it's own method called flatten
  2. map(p => (p._1,p.2.map(._2).toSet)) can be simplified a bit using mapValues and is made a bit more understandable.

So, it would look like this:

def invertedIndex(filename:String) = {
  io.Source.fromFile(filename).getLines.  // this is an iterator over lines
  map(_.split(" ")).               // split at tabs
  filter(_.size > 0).               // make sure there is at least one item
  map(x => x.drop(1).map(y => (y,x(0)) )). // get inverted pairs for all lines
  toList.
  flatten.                  // flatten to pairs -- you could filter on these
  groupBy(_._1).                    // group by the first key
  mapValues(_.map(_._2).toSet) // map over groups values, turning 2nd value into sets
}

Alternatively, you could do it with a for comprehension:

def invertedIndex(filename:String) = {
  (for {
    line <- io.Source.fromFile(filename).getLines
    (doc_id :: ws) = line.split("\t").toList if ws.size > 0
    w <- ws
  } yield {
    w -> doc_id
  }).toList.groupBy(_._1).mapValues(_.map(_._2).toSet)
}

I like that method a bit better, but the choice is yours!

@willf
Copy link
Author

willf commented Dec 23, 2011 via email

@lambdaknight
Copy link

Well, the issue at hand is that you get an Iterator[String] from io.Source.fromFile(foo).getLines and groupBy isn't defined on Iterators, only on Iterables. As far as the flatten, that is unavoidable, I think. You start with a list of strings and then break each of those lists in to their own list of strings, so you're going to have that nested list no matter what you do, so you'll need to flatten it somehow. The only reason the for-comprehension solution doesn't have one is because it has a flatMap hidden inside of it.

@willf
Copy link
Author

willf commented Dec 23, 2011

Yes -- thanks.

I think the mapValues is incorrect, since I need to return String -> Set[String] pairs, so:

def invertedIndex(filename:String) = {
  io.Source.fromFile(filename).getLines.  // this is an iterator over lines
  map(_.split("\t")).               // split at tabs
  filter(_.size > 0).               // make sure there is at least one item
  map(x => x.drop(1).map(y => (y,x(0)) )). // get inverted pairs for all lines
  toList.                                // convert to list (sigh)
  flatten.                               // flatten to pairs -- you could filter on these
  groupBy(_._1).                    // group by the first key
  map(p => (p._1,p._2.map(_._2).toSet)) // map over groups values, turning 2nd value into sets
}

I still don't think I understand why flatten isn't defined on Iterators.

@willf
Copy link
Author

willf commented Dec 23, 2011

Ah, but it is defined on Streams:

def invertedIndex(fn:String) = {
  io.Source.fromFile(fn).getLines.        // this is an iterator over lines
  map(_.split("\t")).                     // split at tabs
  filter(_.size > 0).                     // make sure there is at least one item
  map(x => x.drop(1).map(y => (y,x(0)) )) // get inverted pairs for all lines
  toStream.                               // convert to Stream
  flatten.                                // flatten to pairs -- you could filter on these
  groupBy(_._1).                          // group by the first key
  map(p => (p._1,p._2.map(_._2).toSet))   // map over groups values, turning 2nd value into sets
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment