Skip to content

Instantly share code, notes, and snippets.

@Mortimerp9
Created January 11, 2014 14:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Mortimerp9/8371652 to your computer and use it in GitHub Desktop.
Save Mortimerp9/8371652 to your computer and use it in GitHub Desktop.
Jackson, only extract a single field from a json inputstream/string in scala, without parsing everything. I often deserialize whole objects with jackson to only access a field out of the json. This doesn't seem very efficient, Jackson provides a streaming API that can be used to make it a bit faster.

If the field contains an array or an object, it returns their json representation, you can chain these calls to access a leaf lower in the json tree.

Here are a few benchmarks ran on a json file from facebook API, 396Kb:

  • First uses a method deserializeToMap that uses Jackson object mapper to extract a Map[String, Any] (not shown here)
  • Second uses the stream parser

Even for short json this is still worth it. I guess that mapping to an object instead of using the deserializeToMap would be even slower.

From a File, using an InputStream

This might not be fair as the map approach loads the whole file in memory first.

def stream = {
   val json = new java.io.FileInputStream("/tmp/test.json")
   ScalaJson.getFieldOnly("test", json)
}
def map={
   val json:String = scala.io.Source.fromFile("/tmp/test.json").mkString
   val m = ScalaJson.deserializeToMap(json)
   m("test")
}

When the field is at the top of the file

scala> th.pbenchOff() {map}{stream}
Benchmark comparison (in 16.15 s)
Significantly different (p ~= 0)
  Time ratio:    0.00042   95% CI 0.00040 - 0.00044   (n=20)
    First     28.14 ms   95% CI 27.66 ms - 28.61 ms
    Second    11.83 us   95% CI 11.37 us - 12.29 us

When the field is at the bottom of the file

scala> th.pbenchOff() {map}{stream}
Benchmark comparison (in 8.259 s)
Significantly different (p ~= 0)
  Time ratio:    0.04471   95% CI 0.04068 - 0.04873   (n=20)
    First     32.45 ms   95% CI 30.75 ms - 34.15 ms
    Second    1.451 ms   95% CI 1.344 ms - 1.557 ms

Extracting from a String already in memory

This might be fairer for cases where the json is already in memory.

val json:String = scala.io.Source.fromFile("/tmp/test.json").mkString
def stream(j: String) = {
   ScalaJson.getFieldOnly("test", j)
}
def map(j: String)={
   val m = ScalaJson.deserializeToMap(j)
   m("test")
}

Top of File

scala> th.pbenchOff() {map(json)}{stream(json)}
Benchmark comparison (in 8.818 s)
Significantly different (p ~= 0)
  Time ratio:    0.00018   95% CI 0.00017 - 0.00018   (n=20)
    First     1.647 ms   95% CI 1.634 ms - 1.659 ms
    Second    289.7 ns   95% CI 286.6 ns - 292.8 ns

Missing from file

scala> th.pbenchOff() {try{map(json)} catch{case _ =>}}{try{stream(json)} catch{case _ =>}}
Benchmark comparison (in 9.264 s)
Significantly different (p ~= 0)
  Time ratio:    0.80233   95% CI 0.79113 - 0.81353   (n=20)
    First     1.685 ms   95% CI 1.670 ms - 1.700 ms
    Second    1.352 ms   95% CI 1.337 ms - 1.367 ms

Bottom of File

scala> th.pbenchOff() {map(json)}{stream(json)}
Benchmark comparison (in 9.183 s)
Significantly different (p ~= 0)
  Time ratio:    0.82000   95% CI 0.81205 - 0.82795   (n=20)
    First     1.651 ms   95% CI 1.641 ms - 1.662 ms
    Second    1.354 ms   95% CI 1.344 ms - 1.364 ms

From a smaller json string

val json = """{
      "id": "14975", 
      "from": {
        "category": "Movie", 
        "name": "Anchorman", 
      }, 
      "message": "Just realized that Facebook is NOT a personal journal for private thoughts."
}"""

around top

scala>  th.pbenchOff() {map(json)}{stream(json)}
Benchmark comparison (in 9.888 s)
Significantly different (p ~= 0)
  Time ratio:    0.21309   95% CI 0.21096 - 0.21522   (n=20)
    First     1.639 us   95% CI 1.629 us - 1.650 us
    Second    349.3 ns   95% CI 346.7 ns - 352.0 ns

missing

scala> th.pbenchOff() {try{map(json)} catch{case _ =>}}{try{stream(json)} catch{case _ =>}}
Benchmark comparison (in 11.12 s)
Significantly different (p ~= 0)
  Time ratio:    0.16361   95% CI 0.16248 - 0.16474   (n=20)
    First     5.794 us   95% CI 5.762 us - 5.826 us
    Second    947.9 ns   95% CI 944.0 ns - 951.9 ns

Bottom

scala>  th.pbenchOff() {map(json)}{stream(json)}
Benchmark comparison (in 16.82 s)
Significantly different (p ~= 0)
  Time ratio:    0.63468   95% CI 0.62477 - 0.64458   (n=20)
    First     1.591 us   95% CI 1.578 us - 1.605 us
    Second    1.010 us   95% CI 996.7 ns - 1.023 us
import com.fasterxml.jackson.core._
object ScalaJson {
lazy val factory = new JsonFactory()
implicit class JsonString(json: String) {
def jsonField(name: String) = ScalaJson.getFieldOnly(name, json)
}
implicit class JsonOptString(jsonOpt: Option[String]) {
def jsonField(name: String) = jsonOpt.flatMap(ScalaJson.getFieldOnly(name, _))
}
/**
* extract just one string value for a json string, this should be faster than deserializing a map
* or a specific object mapping. If the field contains an array or an object, it will return their json representation.
* If the json is not valid, this might still succeed properly.
* @param fieldName the name of the field to extract
* @param json the json to look into
* @return the value extracted from the json if it's found, otherwise None.
*/
def getFieldOnly(fieldName: String, json: java.io.InputStream): Option[String] = {
val parser = factory.createParser(json)
getFieldOnlyFromParser(fieldName, parser)
}
/**
* extract just one string value for a json string, this should be faster than deserializing a map
* or a specific object mapping. If the field contains an array or an object, it will return their json representation.
* If the json is not valid, this might still succeed properly.
* @param fieldName the name of the field to extract
* @param json the json to look into
* @return the value extracted from the json if it's found, otherwise None.
*/
def getFieldOnly(fieldName: String, json: String): Option[String] = {
val parser = factory.createParser(json)
getFieldOnlyFromParser(fieldName, parser)
}
/**
* a generic method doing the work from the parser. So we can accept inputstream and String
* @param fieldName
* @param parser will be closed in this method
* @return
*/
private def getFieldOnlyFromParser(fieldName: String, parser: JsonParser): Option[String] = {
def readJson(tokenIt: Iterator[JsonToken], start: JsonToken): String = {
var opens = 0
import scala.util.control.Breaks._
val until = start match {
case JsonToken.START_ARRAY => JsonToken.END_ARRAY
case JsonToken.START_OBJECT => JsonToken.END_OBJECT
case _ => throw new UnsupportedOperationException(s"don' know how to process $start")
}
val str = new StringWriter()
val gen = factory.createGenerator(str)
def write(tok: JsonToken) {
tok match {
case JsonToken.START_ARRAY =>
gen.writeStartArray()
opens += 1
case JsonToken.END_ARRAY =>
gen.writeEndArray()
opens -= 1
case JsonToken.START_OBJECT =>
gen.writeStartObject()
opens += 1
case JsonToken.END_OBJECT =>
gen.writeEndObject()
opens -= 1
case JsonToken.FIELD_NAME => gen.writeFieldName(parser.getText)
case JsonToken.VALUE_FALSE => gen.writeBoolean(false)
case JsonToken.VALUE_TRUE => gen.writeBoolean(true)
case JsonToken.VALUE_NULL => gen.writeNull()
case JsonToken.VALUE_NUMBER_FLOAT => gen.writeNumber(parser.getFloatValue)
case JsonToken.VALUE_NUMBER_INT => gen.writeNumber(parser.getIntValue)
case JsonToken.VALUE_STRING => gen.writeString(parser.getText)
case _ => throw new UnsupportedOperationException(s"don' know how to process $tok")
}
}
write(start)
breakable {
while (tokenIt.hasNext) {
val tok = tokenIt.next()
write(tok)
if (opens == 0 && tok.equals(until)) {
break
}
}
}
gen.close()
str.toString
}
}
import org.scalatest.FunSuite
import org.scalatest.matchers.ShouldMatchers
class ScalaJsonTest extends FunSuite with ShouldMatchers {
val json2 = """{
"id": "149754,
"array": ["a","b","c"],
"from": {
"category": "Movie",
"name": "Anchorman"
},
"nestO": {
"array": ["a", "b"],
"object": { "val": 10 }
},
"nestA": [["a", "b"], "c", ["d", "e"]]
}"""
test("getFieldOnly should find a top level field when its present") {
val v = ScalaJson.getFieldOnly("id", json2)
v should be(Some("149754"))
}
test("getFieldOnly should return null for missing fields") {
val v = ScalaJson.getFieldOnly("missing", json2)
v should be(None)
}
test("getFieldOnly should work with object") {
ScalaJson.getFieldOnly("from", json2) should be(Some( """{"category":"Movie","name":"Anchorman"}"""))
}
test("getFieldOnly should work with array") {
ScalaJson.getFieldOnly("array", json2) should be(Some( """["a","b","c"]"""))
}
test("getFieldOnly works on nested array") {
ScalaJson.getFieldOnly("nestA", json2) should be(Some("""[["a","b"],"c",["d","e"]]"""))
}
test("getFieldOnly works on nested objects") {
ScalaJson.getFieldOnly("nestO", json2) should be(Some("""{"array":["a","b"],"object":{"val":10}}"""))
}
test("impilcit conversion should allow chaining") {
import ScalaJson._
json2.jsonField("from").jsonField("name") should be(Some("Anchorman"))
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment