Skip to content

Instantly share code, notes, and snippets.

@blazetopher
Last active December 17, 2015 10:09
Show Gist options
  • Save blazetopher/5592321 to your computer and use it in GitHub Desktop.
Save blazetopher/5592321 to your computer and use it in GitHub Desktop.
Data Management items leftover from R2

Outstanding R2 DM Items

##Other

  • Consider a technology (i.e. vagrant) for enabling a common environment among developers

##Coverage Model

  • Current usage of attributes in HDF may be a bad choice for a few reasons, should be reevaluated and/or replaced with another mechanism of storing attributes:
    • After many ingestion events, metadata files end up very large and mostly 'unallocated' (i.e. empty) due to B-Tree change history
    • May leak memory (unverified)
    • May not be particularly fast (unverified)
  • Container blocks on HDF IO operations
  • Fill Value, Missing Value, Nil value, and NaN support all need to be reworked/supported
  • Non-isomorphic Parameter Functions need proper support, currently require too much understanding by the client (i.e. must know the "window" required by the function and account for it when they request data)
    • Infrastructure surrounding parameter functions does not properly support shape-in != shape-out
  • Alignment / coherence with the RDT could be improved; it's not currently awful, but it could be better
  • True service API to coverage; currently using ingestion/retrieve as the service API, however a true API should be developed and then that interface should be utilized by ingestion/retrieve
  • Coverage Doctor could still use a little more improvement
  • CRS implementation within coverage model is metadata-level only - just the EPSG code; it is not tied to any system or projection functionality

##Ingestion

  • Management of coverage cache is currently mostly solid, but needs some attention to make it fully correct
  • Tight coupling between ingestion worker & instrument: non-HA ingestion workers tend to be slow, bulky and block CPU
  • Pausing and resuming of ingestion of a data stream is a shortcut: currently uses reentrant locks, should be using a queue mechanism
  • Mechanism for finding Lookup Values is inefficient: gets them from the database for every granule to ensure updates to the LV are utilized
  • Realtime QC processing does not support non-isomorphic functions

##Replay / Retrieve

  • The accuracy/completeness of Replay is questionable: it was implemented very early (much has changed) and there are no clients and very few tests
    • At this time, all "Real Time" clients (i.e. viz, UI) utilize polled calls to Retrieve

##Parameter Management

  • Difficulty of managing parameter context, dictionary, stream, and function definitions from the Google Docs preload spreadsheet is very high and prevents some things from being done properly
  • Relationship between DataProduct, StreamDefinition and ViewCoverage should be reevaluated and strengthened/realigned
  • Coherence/alignment between ParameterContext resource (ion) and ParameterContext object (coverage model) needs improvement
  • Lookup Values: strategy for parsing, storing, and accessing within the system (i.e. from ingestion) should be rethought - it is not very efficient and precludes some use cases (i.e. values for local range test)
  • Dynamic parameter creation (i.e. lookup values, calibration coefficients) is a shortcut and should be rethought/refactored
  • Overall lack of tests:
    • ParameterTypes/Values need more robust end-to-end testing
    • Need more extensive "fail case" tests for ParameterValues (i.e. invalid values, shapes, etc)
    • Need tests for all of the REAL workflows for the instruments
      • Requires simulators and/or sample data for all instruments

##Discovery

  • Geospatial bounding searches (i.e. contains, intersects, etc)

##PubSub / Data Transport

  • Infrastructure for having "derived" StreamDefs / DataProducts is not being utilized to it's full extent which is resulting in endpoints (i.e. viz) needing logic for filtering parameters that shouldn't be necessary
  • Additional helper utilities in RDT to help endpoints work with the Stream/Granules more easily/efficiently
  • Reference Designator was stored in StreamDefinitions as a last-ditch effort: should be rethought
  • Topic Topologies: implemented, but not tested, exposed, or utilized

##Data Processes / Transforms (SA)

  • Buffering of granules: for support of things such as real-time processing of non-isomorphic functions (i.e. many of the QC functions)
  • Stream Multiplexing: some work was done, but was then abandoned - it is incomplete and needs reevaluation
  • "Loading" and launching of arbitrary transforms that are uploaded via eggs

##DataProducts (SA)

  • Derived DataProducts need more comprehensive support
    • Creation & management
    • PubSub support

##User Notification (SA)

  • ? SA's knowledge domain

##DataExternalization (EOI)

  • PyDAP & ERDDAP incorporation is not particularly good: needs reevaluation and fixes
  • Externalization Framework is not solid: needs improvement
  • Catalog externalization is completely lacking
  • Mechanism for External Subscription/Delivery (i.e. Dispatcher) is non-existant
  • Data Agent Framework may have recently become divergent (need to evaluate)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment