blazetopher/r2_dm_leftovers.md

## r2_dm_leftovers.md

      
    Raw
  

              r2_dm_leftovers.md
            
          
    Outstanding R2 DM Items

##Other

Consider a technology (i.e. vagrant) for enabling a common environment among developers

##Coverage Model

Current usage of attributes in HDF may be a bad choice for a few reasons, should be reevaluated and/or replaced with another mechanism of storing attributes:

After many ingestion events, metadata files end up very large and mostly 'unallocated' (i.e. empty) due to B-Tree change history
May leak memory (unverified)
May not be particularly fast (unverified)


Container blocks on HDF IO operations
Fill Value, Missing Value, Nil value, and NaN support all need to be reworked/supported
Non-isomorphic Parameter Functions need proper support, currently require too much understanding by the client (i.e. must know the "window" required by the function and account for it when they request data)

Infrastructure surrounding parameter functions does not properly support shape-in != shape-out


Alignment / coherence with the RDT could be improved; it's not currently awful, but it could be better
True service API to coverage; currently using ingestion/retrieve as the service API, however a true API should be developed and then that interface should be utilized by ingestion/retrieve
Coverage Doctor could still use a little more improvement
CRS implementation within coverage model is metadata-level only - just the EPSG code; it is not tied to any system or projection functionality

##Ingestion

Management of coverage cache is currently mostly solid, but needs some attention to make it fully correct
Tight coupling between ingestion worker & instrument: non-HA ingestion workers tend to be slow, bulky and block CPU
Pausing and resuming of ingestion of a data stream is a shortcut: currently uses reentrant locks, should be using a queue mechanism
Mechanism for finding Lookup Values is inefficient: gets them from the database for every granule to ensure updates to the LV are utilized
Realtime QC processing does not support non-isomorphic functions

##Replay / Retrieve

The accuracy/completeness of Replay is questionable: it was implemented very early (much has changed) and there are no clients and very few tests

At this time, all "Real Time" clients (i.e. viz, UI) utilize polled calls to Retrieve


##Parameter Management

Difficulty of managing parameter context, dictionary, stream, and function definitions from the Google Docs preload spreadsheet is very high and prevents some things from being done properly
Relationship between DataProduct, StreamDefinition and ViewCoverage should be reevaluated and strengthened/realigned
Coherence/alignment between ParameterContext resource (ion) and ParameterContext object (coverage model) needs improvement
Lookup Values: strategy for parsing, storing, and accessing within the system (i.e. from ingestion) should be rethought - it is not very efficient and precludes some use cases (i.e. values for local range test)
Dynamic parameter creation (i.e. lookup values, calibration coefficients) is a shortcut and should be rethought/refactored
Overall lack of tests:

ParameterTypes/Values need more robust end-to-end testing
Need more extensive "fail case" tests for ParameterValues (i.e. invalid values, shapes, etc)
Need tests for all of the REAL workflows for the instruments

Requires simulators and/or sample data for all instruments


##Discovery

Geospatial bounding searches (i.e. contains, intersects, etc)

##PubSub / Data Transport

Infrastructure for having "derived" StreamDefs / DataProducts is not being utilized to it's full extent which is resulting in endpoints (i.e. viz) needing logic for filtering parameters that shouldn't be necessary
Additional helper utilities in RDT to help endpoints work with the Stream/Granules more easily/efficiently
Reference Designator was stored in StreamDefinitions as a last-ditch effort: should be rethought
Topic Topologies: implemented, but not tested, exposed, or utilized

##Data Processes / Transforms (SA)

Buffering of granules: for support of things such as real-time processing of non-isomorphic functions (i.e. many of the QC functions)
Stream Multiplexing: some work was done, but was then abandoned - it is incomplete and needs reevaluation
"Loading" and launching of arbitrary transforms that are uploaded via eggs

##DataProducts (SA)

Derived DataProducts need more comprehensive support

Creation & management
PubSub support


##User Notification (SA)

? SA's knowledge domain

##DataExternalization (EOI)

PyDAP & ERDDAP incorporation is not particularly good: needs reevaluation and fixes
Externalization Framework is not solid: needs improvement
Catalog externalization is completely lacking
Mechanism for External Subscription/Delivery (i.e. Dispatcher) is non-existant
Data Agent Framework may have recently become divergent (need to evaluate)