Skip to content

Instantly share code, notes, and snippets.

@myedibleenso
Created September 16, 2017 18:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save myedibleenso/8f0681efabb794c53f3007eaa0c8c35f to your computer and use it in GitHub Desktop.
Save myedibleenso/8f0681efabb794c53f3007eaa0c8c35f to your computer and use it in GitHub Desktop.
Snippet to test nxmlreader on a subset of PubMed OA
import java.io.File
import scala.util.Random
import ai.lum.common.FileUtils._
import ai.lum.common.RandomUtils._
import ai.lum.nxmlreader.NxmlReader
val rand = new Random(42)
val nxmlDir = new File("/net/kate/storage/data/nlp/corpora/bmgf/OA-100K-sample/data/nlp/corpora/pmc_openaccess/pmc_aug2016/")
val nxmlFiles = nxmlDir.listFilesByWildcard("*.nxml", recursive = true)
val nxmlFilesSample = rand.sample(nxmlFiles, 1000).toVector
val reader = new NxmlReader()
nxmlFilesSample.foreach{ f =>
val contents = f.readString()
try {
reader.parse(contents)
} catch {
case e: Exception =>
println(s"problem with ${f.getCanonicalPath}")
println(e.getMessage)
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment