Skip to content

Instantly share code, notes, and snippets.

@spanishgum
Last active October 4, 2016 15:25
Show Gist options
  • Save spanishgum/22843dd0e99e0654d958e375ee9a4523 to your computer and use it in GitHub Desktop.
Save spanishgum/22843dd0e99e0654d958e375ee9a4523 to your computer and use it in GitHub Desktop.
My abstract submission for the AGU meeting in California.
Hey friends, I figured I would just a post a gist here so ya'll can get a feel for what I am working on.
You'll find the actual abstract below (between the lines). Or skip further down to see a layman's terms
description for the less technical folk :)
-----------------------------------------------------------------------
Enhancing SAMOS Data Access in DOMS via a Neo4j Property Graph Database.
The Shipboard Automated Meteorological and Oceanographic System (SAMOS) initiative provides routine
access to high-quality marine meteorological and near-surface oceanographic observations from research
vessels. The Distributed Oceanographic Match-Up Service (DOMS) under development is a centralized service
that allows researchers to easily match in situ and satellite oceanographic data from distributed
sources to facilitate satellite calibration, validation, and retrieval algorithm development. The service
currently uses Apache Solr as a backend search engine on each node in the distributed network. While Solr
is a high-performance solution that facilitates creation and maintenance of indexed data, it is limited
in the sense that its schema is fixed. The property graph model escapes this limitation by creating
relationships between data objects.
The authors will present the development of the SAMOS Neo4j property graph database including new search
possibilities that take advantage of the property graph model, performance comparisons with Apache Solr,
and a vision for graph databases as a storage tool for oceanographic data. The integration of the SAMOS
Neo4j graph into DOMS will also be described. Currently, Neo4j contains spatial and temporal records from
SAMOS which are modeled into a time tree and r-tree using Graph Aware and Spatial plugin tools for Neo4j.
These extensions provide callable Java procedures within CYPHER (Neo4j's query language) that generate
in-graph structures. Once generated, these structures can be queried using procedures from these
libraries, or directly via CYPHER statements.
Neo4j excels at performing relationship and path-based queries, which challenge relational-SQL databases
because they require memory intensive joins due to the limitation of their design. Consider a user who
wants to find records over several years, but only for specific months. If a traditional database only
stores timestamps, this type of query would be complex and likely prohibitively slow. Using the time tree
model, one can specify a path from the root to the data which restricts resolutions to certain
timeframes (e.g., months). This query can be executed without joins, unions, or other compute-intensive
operations, putting Neo4j at a computational advantage to the SQL database alternative.
------------------------------------------------------------------
OK. So what the heck was all that right? Don't worry too much about all the acronyms. Rather, consider
the simple idea that I am trying to process LOTS of data REALLY quickly. I have millions of data points
of the following form -> (latitude, longitude, time, ...other sciency variables...). Essentially I am
orchestrating a series of software tools to facilitate moving all this stuff around across a distributed
network (Currently between people here at FSU-Florida, some at NCAR-Colorado, and some at JPL-California).
The core of this research is really me trying to figure out how I can take advantage of a 'graph database'.
It stores information in a very different way than traditional systems. The challenging part is figuring
out how to take full advantage of the graph concept. This means shortest path finding, subgraph matching,
etc. These algorithms are easy to call since they are built in to the system. BUT, its figuring out HOW to
use them, and WHAT data structures I can build internally so that these algorithms actually do something
meaningful - and quickly.
As of current, I'm by no means coming out with anything out of this world. There are lots of people like
me working on this kind of stuff, but what separates my work is the domain. Working with geo spatial data
on this scale is more common today, but with working things like satelittes, ships, bouys, i.e. objects
moving on the global scale, is still a prominent area of research.
Thanks for your support guys :)
-Adam
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment