Skip to content

Instantly share code, notes, and snippets.

@chriddyp
Created December 21, 2018 03:50
Show Gist options
  • Save chriddyp/afcaf01d124b610e69de066753d36fc5 to your computer and use it in GitHub Desktop.
Save chriddyp/afcaf01d124b610e69de066753d36fc5 to your computer and use it in GitHub Desktop.

Data Types in DataTable

Many features in DataTable seem to be related to a notion of data types. That is, certain features may have different behaviour depending on the type of data in the column.

Which Data Types? I imagine that we'll at least have these data types:

  • string
  • number
  • date
  • null/None

Other data types to consider in this discussion include:

  • ip address
  • bool
  • email address
  • float / int
  • time delta
  • time
  • category / enum

However, it's premature to conclude that this is the definite list without considering all of the features that might be affected by the type of the column.

For example, consider a bool type. This is a common data type in programming languages, but does it warrant any unique behaviour in a dash_table.DataTable? It may be helpful to filter bool-types, but would it behave any differently than a string column type that has just two unique values? Or perhaps it has its own unique renderers in the future.

So, in order to determine the set of data types in the table, let's first consider the dash_table.DataTable features that may be influenced by data types. With this consideration, we may find a higher level abstraction.

Data Type Features

  • Filtering

    • e.g. the default operators in the filtering UI may be different depending on the type
  • Sorting

    • e.g. numerical data may sort differently than dates, times, or strings
  • Editing the cells

    • e.g. an "enum" data type might be editable via a dropdown
    • e.g. a "numerical" input might reject non-numerical characters (as well as .)
  • Cell renderers

    • e.g. displaying numerical data in a column as a bar chart (each cell is a horizontal solid bar)
  • Cell value formatting

    • e.g. ability to control the number of decimal places displayed for numerical data
  • Cell style

    • e.g. right aligning numerical data, left aligning strings
  • Conditional formatting (via filtering)

    • conditional formatting uses the filtering expression language
  • Cell value validation

    • e.g. typing strings in a numerical column might display a validation error in the UI
    • e.g. a regex validation might be allowed in string columns
  • Cell value coercion, normalization, and parsing

    • e.g. if raw data was supplied as "43.12x" in a numerical column it might be coerced as 43.12
    • e.g. if "2018/10/16" was supplied as a date, it might be coerced as "2018-10-16" when passed into the callback.
    • e.g. if a date column is represented as "2018-10-16", there might be the ability to convert it back to a number (ms since epoch) when passed back to a callback
    • e.g. if the user converts data in a table back into a server-side data structure, then do we only use data types that are available in that language (e.g. no "ip address") or do we leave that up to the developer?

One way to think about data types is 'if the class of values has unique behaviour with respect to one of the features listed above, then it may be considered to be its own data type'.

Other Considerations

Beyond how these data types manifest as dash_table features, there are a few other considerations:

  • Compatibility between Python or R data types (numpy, pandas, dataframes) in how data is casted to and from callbacks
  • Is there inheritance? e.g. a number type but also a int and a float type?
  • Are the features above strictly 1-1 with the data types? e.g. a "enum" type might not be rendered as a dropdown but it would have cell value validation.
  • How well defined are these features? How can we design this system so that JS-savvy users and customers can plug-in their own data types?

Data Types and Filtering

In this discussion, I'm considering "filtering v2". These features do not necessarily exist in the current version of the table.

In filtering, we will have several interfaces:

  • Human-readable, per-column filtering expression

    • Each column has its own input box and the user can type natural, intuitive expressions:
      • 50.3
      • < 50.3
      • john
      • = john
      • == john
      • is john
      • = john or = david
      • == john or == david
      • < 50 or > 70
      • > 50 and < 70
  • Human-readable, table query expression

    • A per-table query input box where users can write expressions in similar language the per-column expressions with the exception that they have reference the column names
      • weight == 50.3
      • weight < 50.3 and name == john
      • weight < 50.3 and name is john
  • Human-readable, table search

    • A dead-simple string contains search across the table. All cells are treated as a string and the expression does the "contains" operator across all stringified cells
  • Computer-readable filtering representation

    • A stable, machine-readable representation of the filtering expression.
    • The higher-level expressions would be translated into this low level representation for consumption via callbacks (and e.g. backend filtering)
    • It might be structured as a JSON expression:
    {
        "lhs": {
            "lhs": "weight",
            "operator": "<",
            "rhs": 50
        },
        "operator": "and",
        "rhs": {
            "lhs": "height",
            "operator": ">=",
            "rhs": 90
        }
    }

    which, in a higher level language, might be written as

    weight < 50 and height >= 90
    
  • Alternative computer-readable "translated" representations

    • There may be multiple representations in order to ease backend consumption:
      • e.g. pd.query (NumExpr)
      • e.g. dplyr filter
      • e.g. sql
    • These representations may have different limitations w.r.t. data types and operators. For example, pd.query doesn't have a date type nor a "contains" operator.
    • Callbacks could also write data to this representation
  • A human-friendly user-interface for filtering.

    • Range-slider for numbers and/or dates
    • Searchable checklist for strings

So, in which ways would these features be affected by data types?

Data Types Will Affect the "Human readable, per-column filtering expression" The way that a filtering expression is interpreted may depend on the data type.

  • string and enum
    • Default operator is still "contains"
      • e.g.
    • >, >=, <, <= will filter alphanumerically
    • Regex will be supported
  • number
    • Default operator is =, contains is not available
    • >, >=, <, <= will filter numerically
    • Decimal places will be supported
  • date
    • Default operator is "contains" and will filter all dates at a lower resolution. For example:
      • Searching 2015 will filter all values with the year 2015, regardless of the month, day, or time
      • Searching 2015-10-05 will filter all values on the day 2015-10-05, regardless of the time
      • In this mode, it will not be possible to search just "January" across all years
      • >, >=, <, <= will filter by date, starting at the lowest date (e.g. > 2015 is equivalent to > 2015-01-01 00:00:00)
      • = won't be supported. The user can search a specific date with the default and implicit "contains" operator

Open Question - Data Types Could Affect the Filtering Representations via Casting

What do we do if if the data types aren't or can't be defined up-front (consider a table that was created through manual entry or by uploading a CSV)?

In other words, how should filtering behave if the data type isn't defined?

Some options:

  1. We assume the column is a string. < 50 just doesn't work.
  2. We incorporate type casting into the filter expression language. That is, a user could type e.g. num(pressure) < 50
  3. We allow the user to change the type of the column via the column's setting UI (plotly/dash-table#316)
  4. We infer the data type if it's not defined by taking a random sample of the column. This gets written to derived_columns

(I'm in favor of (4))

Data Types could affect how the "high readable" expressions are interpreted

In the human readable expression language, users shouldn't have to wrap their strings in quotes. That is, users should be able to write name contains John or just John rather than name contains "John" or "John".

The conversion between these human readable expressions and the machine-readable expressions will need to incorporate the data type of the column. If a user types 054 then it could either get translated into machine readable code as zipcode contains "054" or zipcode equals 054 depending on the type of the zipcode column.

Data Types could affect the "human friendly user interface" for filtering

The filtering user interface for filtering may have several different renderers.

Our filtering UI could render different UIs by default depending on the type (e.g. slider for numbers, dropdown for strings). If no UI is available for the data type (e.g. if we have a data type plug in system and the user didn't write a filtering UI), then it could fallback to the simple text input.

While this UI may be inferred by its data type, it should be possible to override this behaviour. (I suspect that many features will be this way).

Data Types could affect how the alternative computer-readable queries are generated

When generating the computer-readable queries, the type of the column will need to be considered as certain dash_table.DataType data types might not be supported in the computer-readable language. For example, dash_table.DataType might support a date type (for features like displaying a date picker in a cell)

It's conceivable that the front-end could provide certain features for data types that aren't supported by the back-end query language.

For example, pd.query doesn't natively support dates. It's unclear how this type of filter should be translated.


How data types could affect sorting

Accurate sorting needs to account for the data type. If we considered all columns to be strings, then certain sorting operations would be wrong.

This is clear for numbers:

>>> 2 > 10
False
>>> '2' > '10'
True

Other considerations:

  1. dates

    For dates formatted year-month-day, string comparison will provide the correct answer. However, if we accept dates in alternative formats (e.g. YYYY/DD/MM), then sorting will need to do an internal conversion.

  2. empty strings

    '' < 'a' but users don't always want to see empty cells when they sort from least to greatest. Should empty strings be a null type? Or is this just a special option when filtering strings?


[... work in progress ...]


I think what I'm converging to is that data types influence the default behaviour of all of these features. Of course, everything will be overridable, but we want the default behaviour to be useful, thoughtful, and good for data exploration.

So, when we consider data types (in particular, which data types we want to support) we need to consider what the default behaviour is for each feature and whether the feature would benefit from a particular dedicated data-type.

@tantrev
Copy link

tantrev commented Jan 8, 2019

It'd be really nice if chemical structures of small molecules (such as those demoed in the drug discovery dash app) and images could be added to this list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment