Skip to content

Instantly share code, notes, and snippets.

@bbpennel
Last active February 26, 2018 20:57
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bbpennel/9415f6cf3a5e285f13cf9d8b6aa4e666 to your computer and use it in GitHub Desktop.
Save bbpennel/9415f6cf3a5e285f13cf9d8b6aa4e666 to your computer and use it in GitHub Desktop.

Data: https://docs.google.com/spreadsheets/d/1el-h-UPaLsXMxtOjnx-B26AvYqX57a8iHWTMo0GinzM/edit?usp=sharing

Source: https://github.com/bbpennel/rdf-serialization-metrics

Jena RDF Format Performance Comparison

When serializing a model, the impact of the RDF format is generally minor, with all 3 formats serializing 10000 properties/literals in < 1 second. At 1,000,000 literals, JSON-LD starts to under perform by a factor of around 5, but still only took 4.5 seconds.

When deserializing, turtle moderately out performed n-triples, particularly at large scale. For example, at 10000 literals it completed 2.2 times faster (1028ms vs 2233ms).

However, both outperformed JSON-LD, particularly at the 5000 and higher scale. It took 185 times longer to deserialize 10000 literals from JSON-LD versus N-Triples (4134ms vs 22ms). At 1,000,000 literals, JSON-LD did not complete after 2 hours, while Turtle completed in 1897ms. Additionally, the rate at which performance degraded as more literals/properties were added was 1-2 orders of magnitude greater than the other formats.

In conclusion, either n-triples or turtle are similarly acceptable formats at small or large scale, with turtle performing slightly better when deserializing, and n-triples performing a little better when serializing. JSON-LD is not recommended past about 1000 literals/properties.

Fcrepo Metadata Retrieval Time Comparison versus RDF Format

For this case, I created container resources to test the effect of RDF serialization on retrieval of resources representing various ways of adding literals/triples. This included: resource with default triples, resource with ldp:contains to resources, resource with relationships populated by indirect container, resource with literals added, resource with the same property to many resources, resource with many properties to the same resource. This was tested against Fcrepo 4.7.4 running in Glassfish.

The RDF format specified via the Accept header was a minor factor, accounting for 1-7% execution time difference between formats. JSON-LD very slightly underperformed at scale for serializing purposes, and turtle did slightly better than the others. Based on the results from the Jena serialization testing, serializing the results may have represented < 1% of the retrieval time (1077ms to retrieve 10000 properties to different resources from Fcrepo using Turtle, versus 8.8ms to serialize directly in Jena).

I have not compared the affect of RDF format on PUT/POST time yet, but could if others are interested.

The type of containment used to associate a resource with other resources had a more significant impact. Retrieval of a container using an indirect container on average takes twice as long as a container which uses a direct container or is a simple container. It also underperformed non-containment relationship properties by about 43%.

I did not test the 1,000,000 relationship scenario. The default pair tree identifier behavior was used for objects created.

Other notes

The RDF Formats were selected to exactly match the defaults used in Fedora 4 (such as Turtle/Pretty Format and JSON-LD/compact pretty).

"base_record" and "blank" cases included as baselines, both representing a container created in Fcrepo 4.7 without any additional triples add.

Turtle/Pretty performed slightly better than the other two available Turtle formats (Blocks and Flat).

Increasing memory may improve some of the more expensive JSON-LD cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment