chriddyp/data-types.md Secret

## data-types.md

      
    Raw
  

              data-types.md
            
          
    Data Types in DataTable

Many features in DataTable seem to be related to a notion of data types. That is, certain features may have different behaviour depending on the type of data in the column.
Which Data Types?
I imagine that we'll at least have these data types:

string
number
date
null/None

Other data types to consider in this discussion include:

ip address
bool
email address
float / int
time delta
time
category / enum

However, it's premature to conclude that this is the definite list without considering all of the features that might be affected by the type of the column.
For example, consider a bool type. This is a common data type in programming languages, but does it warrant any unique behaviour in a dash_table.DataTable? It may be helpful to filter bool-types, but would it behave any differently than a string column type that has just two unique values? Or perhaps it has its own unique renderers in the future.
So, in order to determine the set of data types in the table, let's first consider the dash_table.DataTable features that may be influenced by data types. With this consideration, we may find a higher level abstraction.
Data Type Features


Filtering

e.g. the default operators in the filtering UI may be different depending on the type


Sorting

e.g. numerical data may sort differently than dates, times, or strings


Editing the cells

e.g. an "enum" data type might be editable via a dropdown
e.g. a "numerical" input might reject non-numerical characters (as well as .)


Cell renderers

e.g. displaying numerical data in a column as a bar chart (each cell is a horizontal solid bar)


Cell value formatting

e.g. ability to control the number of decimal places displayed for numerical data


Cell style

e.g. right aligning numerical data, left aligning strings


Conditional formatting (via filtering)

conditional formatting uses the filtering expression language


Cell value validation

e.g. typing strings in a numerical column might display a validation error in the UI
e.g. a regex validation might be allowed in string columns


Cell value coercion, normalization, and parsing

e.g. if raw data was supplied as "43.12x" in a numerical column it might be coerced as 43.12
e.g. if "2018/10/16" was supplied as a date, it might be coerced as "2018-10-16" when passed into the callback.
e.g. if a date column is represented as "2018-10-16", there might be the ability to convert it back to a number (ms since epoch) when passed back to a callback
e.g. if the user converts data in a table back into a server-side data structure, then do we only use data types that are available in that language (e.g. no "ip address") or do we leave that up to the developer?


One way to think about data types is 'if the class of values has unique behaviour with respect to one of the features listed above, then it may be considered to be its own data type'.
Other Considerations
Beyond how these data types manifest as dash_table features, there are a few other considerations:

Compatibility between Python or R data types (numpy, pandas, dataframes) in how data is casted to and from callbacks
Is there inheritance? e.g. a number type but also a int and a float type?
Are the features above strictly 1-1 with the data types? e.g. a "enum" type might not be rendered as a dropdown but it would have cell value validation.
How well defined are these features? How can we design this system so that JS-savvy users and customers can plug-in their own data types?


Data Types and Filtering


In this discussion, I'm considering "filtering v2". These features do not necessarily exist in the current version of the table.

In filtering, we will have several interfaces:


Human-readable, per-column filtering expression

Each column has its own input box and the user can type natural, intuitive expressions:

50.3
< 50.3
john
= john
== john
is john
= john or = david
== john or == david
< 50 or > 70
> 50 and < 70


Human-readable, table query expression

A per-table query input box where users can write expressions in similar language the per-column expressions with the exception that they have reference the column names

weight == 50.3
weight < 50.3 and name == john
weight < 50.3 and name is john


Human-readable, table search

A dead-simple string contains search across the table. All cells are treated as a string and the expression does the "contains" operator across all stringified cells


Computer-readable filtering representation

A stable, machine-readable representation of the filtering expression.
The higher-level expressions would be translated into this low level representation for consumption via callbacks (and e.g. backend filtering)
It might be structured as a JSON expression:

{
    "lhs": {
        "lhs": "weight",
        "operator": "<",
        "rhs": 50
    },
    "operator": "and",
    "rhs": {
        "lhs": "height",
        "operator": ">=",
        "rhs": 90
    }
}
which, in a higher level language, might be written as
weight < 50 and height >= 90


Alternative computer-readable "translated" representations

There may be multiple representations in order to ease backend consumption:

e.g. pd.query (NumExpr)
e.g. dplyr filter
e.g. sql


These representations may have different limitations w.r.t. data types and operators. For example, pd.query doesn't have a date type nor a "contains" operator.
Callbacks could also write data to this representation


A human-friendly user-interface for filtering.

Range-slider for numbers and/or dates
Searchable checklist for strings


So, in which ways would these features be affected by data types?
Data Types Will Affect the "Human readable, per-column filtering expression"
The way that a filtering expression is interpreted may depend on the data type.

string and enum

Default operator is still "contains"

e.g.


>, >=, <, <= will filter alphanumerically
Regex will be supported


number

Default operator is =, contains is not available
>, >=, <, <= will filter numerically
Decimal places will be supported


date

Default operator is "contains" and will filter all dates at a lower resolution. For example:

Searching 2015 will filter all values with the year 2015, regardless of the month, day, or time
Searching 2015-10-05 will filter all values on the day 2015-10-05, regardless of the time
In this mode, it will not be possible to search just "January" across all years
>, >=, <, <= will filter by date, starting at the lowest date (e.g. > 2015 is equivalent to > 2015-01-01 00:00:00)
= won't be supported. The user can search a specific date with the default and implicit "contains" operator


Open Question - Data Types Could Affect the Filtering Representations via Casting
What do we do if if the data types aren't or can't be defined up-front (consider a table that was created through manual entry or by uploading a CSV)?
In other words, how should filtering behave if the data type isn't defined?
Some options:

We assume the column is a string. < 50 just doesn't work.
We incorporate type casting into the filter expression language. That is, a user could type e.g. num(pressure) < 50
We allow the user to change the type of the column via the column's setting UI (plotly/dash-table#316)
We infer the data type if it's not defined by taking a random sample of the column. This gets written to derived_columns

(I'm in favor of (4))
Data Types could affect how the "high readable" expressions are interpreted
In the human readable expression language, users shouldn't have to wrap their strings in quotes. That is, users should be able to write name contains John or just John rather than name contains "John" or "John".
The conversion between these human readable expressions and the machine-readable expressions will need to incorporate the data type of the column. If a user types 054 then it could either get translated into machine readable code as zipcode contains "054" or zipcode equals 054 depending on the type of the zipcode column.
Data Types could affect the "human friendly user interface" for filtering
The filtering user interface for filtering may have several different renderers.
Our filtering UI could render different UIs by default depending on the type (e.g. slider for numbers, dropdown for strings).
If no UI is available for the data type (e.g. if we have a data type plug in system and the user didn't write a filtering UI), then it could fallback to the simple text input.
While this UI may be inferred by its data type, it should be possible to override this behaviour. (I suspect that many features will be this way).
Data Types could affect how the alternative computer-readable queries are generated
When generating the computer-readable queries, the type of the column will need to be considered as certain dash_table.DataType data types might not be supported in the computer-readable language. For example, dash_table.DataType might support a date type (for features like displaying a date picker in a cell)
It's conceivable that the front-end could provide certain features for data types that aren't supported by the back-end query language.
For example, pd.query doesn't natively support dates. It's unclear how this type of filter should be translated.

How data types could affect sorting
Accurate sorting needs to account for the data type. If we considered all columns to be strings, then certain sorting operations would be wrong.
This is clear for numbers:
>>> 2 > 10
False
>>> '2' > '10'
True

Other considerations:


dates
For dates formatted year-month-day, string comparison will provide the correct answer. However, if we accept dates in alternative formats (e.g. YYYY/DD/MM), then sorting will need to do an internal conversion.


empty strings
'' < 'a' but users don't always want to see empty cells when they sort from least to greatest. Should empty strings be a null type? Or is this just a special option when filtering strings?


[... work in progress ...]

I think what I'm converging to is that data types influence the default behaviour of all of these features. Of course, everything will be overridable, but we want the default behaviour to be useful, thoughtful, and good for data exploration.
So, when we consider data types (in particular, which data types we want to support) we need to consider what the default behaviour is for each feature and whether the feature would benefit from a particular dedicated data-type.