pwalsh/dataflows.md

## dataflows.md

      
    Raw
  

              dataflows.md
            
          
    incorrect validation

A resource descriptor with fields with format set to None validates with validate, but fails with dump_to_sql.
{'name': 'srm_services', 'path': 'res_1.csv', 'profile': 'tabular-data-resource', 'schema': {'fields': [{'name': 'id', 'type': 'string', 'format': None, 'constraints': {}}, {'name': 'source', 'type': 'string', 'format': None, 'constraints': {}}, {'name': 'name', 'type': 'string', 'format': None, 'constraints': {}}, {'name': 'payment_required', 'type': 'string', 'format': None, 'constraints': {}}, {'name': 'urls', 'type': 'string', 'format': None, 'constraints': {}}, {'name': 'description', 'type': 'string', 'format': None, 'constraints': {}}, {'name': 'details', 'type': 'string', 'format': None, 'constraints': {}}, {'name': 'status', 'type': 'string', 'format': None, 'constraints': {}}, {'name': 'payment_details', 'type': 'string', 'format': None, 'constraints': {}}], 'missingValues': ['']}}

Solution: set format to "default"
{'name': 'srm_services', 'path': 'res_1.csv', 'profile': 'tabular-data-resource', 'schema': {'fields': [{'name': 'id', 'type': 'string', 'format': 'default', 'constraints': {}}, {'name': 'source', 'type': 'string', 'format': 'default', 'constraints': {}}, {'name': 'name', 'type': 'string', 'format': 'default', 'constraints': {}}, {'name': 'payment_required', 'type': 'string', 'format': 'default', 'constraints': {}}, {'name': 'urls', 'type': 'string', 'format': 'default', 'constraints': {}}, {'name': 'description', 'type': 'string', 'format': 'default', 'constraints': {}}, {'name': 'details', 'type': 'string', 'format': 'default', 'constraints': {}}, {'name': 'status', 'type': 'string', 'format': 'default', 'constraints': {}}, {'name': 'payment_details', 'type': 'string', 'format': 'default', 'constraints': {}}], 'missingValues': ['']}}

validate does not validate primary keys are unique

does it work with constraints.unique?
load from SQL and text fields

Investigate: some fields being read out as any and not string, for no clear reason. Breaks downstream in pipeline (elastic search doesnt handle any)
array fields with None

Investigate: Experienced some exceptions with rows when an array field is None. Can't really deal with upstream in current case (dont have any type data from airtable)..... not sure this is the actual error, just looked like it from looking at data with a more cryptic exception message.
error handling

it would be really great to have a flag on Flow that, on any exception in the flow, drops you into a debugging repl, maybe with some handy variables set up to inspect current state. This will be more convenient than on_error on DataStreamProcessor for processors in the flow, as many processors just implment part of that interface as functions and don't have on_error. Plus, you just set up your desired behaviour once, on the flow, for development.
breakpoint processor

Related to above, the same or similar machinery but inserted as a processor at any point in the flow, so not triggered by an exception, and allow simple inspection of current state. at point in pipeline
non-intuitive resource naming on load from SQL (maybe more?)

You have a database called, for example, srm, and want to load a table from it:
load(DUMP_DB, table="srm_services")

One would expect a resource in the package called srm_services (table > resource mapping is conceputally correct), but, the resource gets called srm, which conceptially is the name of the package (collection of tables > collection of resources (package)).
load from sql (postgres) doesn't map JSONB to array or object

psql srm

srm=# \d+ srm_services
                                    Table "public.srm_services"
      Column      | Type  | Collation | Nullable | Default | Storage  | Stats target | Description 
------------------+-------+-----------+----------+---------+----------+--------------+-------------
 __airtable_id    | text  |           |          |         | extended |              | 
 id               | text  |           |          |         | extended |              | 
 source           | text  |           |          |         | extended |              | 
 name             | text  |           |          |         | extended |              | 
 payment_required | text  |           |          |         | extended |              | 
 urls             | text  |           |          |         | extended |              | 
 description      | text  |           |          |         | extended |              | 
 details          | text  |           |          |         | extended |              | 
 situations       | jsonb |           |          |         | extended |              | 
 status           | text  |           |          |         | extended |              | 
 payment_details  | text  |           |          |         | extended |              | 
 responses        | jsonb |           |          |         | extended |              | 
Access method: heap

but
Flow(
    load(DUMP_DB, table="srm_services"),
)

generates the following descriptor:
{'profile': 'data-package', 'resources': [{'path': 'srm.sql', 'profile': 'tabular-data-resource', 'name': 'srm', 'schema': {'fields': [{'name': '__airtable_id', 'type': 'string', 'format': 'default'}, {'name': 'id', 'type': 'string', 'format': 'default'}, {'name': 'source', 'type': 'string', 'format': 'default'}, {'name': 'name', 'type': 'string', 'format': 'default'}, {'name': 'payment_required', 'type': 'string', 'format': 'default'}, {'name': 'urls', 'type': 'string', 'format': 'default'}, {'name': 'description', 'type': 'any', 'format': 'default'}, {'name': 'details', 'type': 'any', 'format': 'default'}, {'name': 'situations', 'type': 'any', 'format': 'default'}, {'name': 'status', 'type': 'string', 'format': 'default'}, {'name': 'payment_details', 'type': 'any', 'format': 'default'}, {'name': 'responses', 'type': 'any', 'format': 'default'}], 'missingValues': ['']}, 'format': 'sql'}]}

I guess without inpsecting data the converter can't know what to do with a JSONB field. So, what probably needs to happen is that tableschema_sql should map array columns to postgres array fields (would require that all items in array are of the same type).
dump_to_es fails if no primary key

Because if not present in schema, sets default to [] ... but ES doesnt have a concept of primary key.
dump datapackage, then try to read it, does not read everything

Here is a dumped DP:
location/
location/datapackage.json
location/responses.csv
location/situations.csv
location/services.csv
location/organizations.csv
location/branches.csv
location/locations.csv


But the following flow:
DF.Flow(
    DF.load('location/datapackage.json'),
    DF.dump_to_path('other_location'),
).process()

produces:
other_location/
other_location/organizations.csv
other_location/responses.csv
other_location/situations.csv

!!!!
processors

DF.map_rows
DF.dereference
join on array fields (?)