Many features in DataTable
seem to be related to a notion of data types. That is, certain features may have different behaviour depending on the type of data in the column.
Which Data Types? I imagine that we'll at least have these data types:
- string
- number
- date
null
/None
Other data types to consider in this discussion include:
- ip address
- bool
- email address
- float / int
- time delta
- time
- category / enum
However, it's premature to conclude that this is the definite list without considering all of the features that might be affected by the type of the column.
For example, consider a bool
type. This is a common data type in programming languages, but does it warrant any unique behaviour in a dash_table.DataTable
? It may be helpful to filter bool-types, but would it behave any differently than a string
column type that has just two unique values? Or perhaps it has its own unique renderers in the future.
So, in order to determine the set of data types in the table, let's first consider the dash_table.DataTable
features that may be influenced by data types. With this consideration, we may find a higher level abstraction.
Data Type Features
-
Filtering
- e.g. the default operators in the filtering UI may be different depending on the type
-
Sorting
- e.g. numerical data may sort differently than dates, times, or strings
-
Editing the cells
- e.g. an "enum" data type might be editable via a dropdown
- e.g. a "numerical" input might reject non-numerical characters (as well as
.
)
-
Cell renderers
- e.g. displaying numerical data in a column as a bar chart (each cell is a horizontal solid bar)
-
Cell value formatting
- e.g. ability to control the number of decimal places displayed for numerical data
-
Cell style
- e.g. right aligning numerical data, left aligning strings
-
Conditional formatting (via filtering)
- conditional formatting uses the filtering expression language
-
Cell value validation
- e.g. typing strings in a numerical column might display a validation error in the UI
- e.g. a regex validation might be allowed in string columns
-
Cell value coercion, normalization, and parsing
- e.g. if raw data was supplied as "43.12x" in a numerical column it might be coerced as
43.12
- e.g. if "2018/10/16" was supplied as a date, it might be coerced as "2018-10-16" when passed into the callback.
- e.g. if a date column is represented as "2018-10-16", there might be the ability to convert it back to a number (ms since epoch) when passed back to a callback
- e.g. if the user converts data in a table back into a server-side data structure, then do we only use data types that are available in that language (e.g. no "ip address") or do we leave that up to the developer?
- e.g. if raw data was supplied as "43.12x" in a numerical column it might be coerced as
One way to think about data types is 'if the class of values has unique behaviour with respect to one of the features listed above, then it may be considered to be its own data type'.
Other Considerations
Beyond how these data types manifest as dash_table
features, there are a few other considerations:
- Compatibility between Python or R data types (numpy, pandas, dataframes) in how data is casted to and from callbacks
- Is there inheritance? e.g. a
number
type but also aint
and afloat
type? - Are the features above strictly 1-1 with the data types? e.g. a "enum" type might not be rendered as a dropdown but it would have cell value validation.
- How well defined are these features? How can we design this system so that JS-savvy users and customers can plug-in their own data types?
In this discussion, I'm considering "filtering v2". These features do not necessarily exist in the current version of the table.
In filtering, we will have several interfaces:
-
Human-readable, per-column filtering expression
- Each column has its own input box and the user can type natural, intuitive expressions:
50.3
< 50.3
john
= john
== john
is john
= john or = david
== john or == david
< 50 or > 70
> 50 and < 70
- Each column has its own input box and the user can type natural, intuitive expressions:
-
Human-readable, table query expression
- A per-table query input box where users can write expressions in similar language the per-column expressions with the exception that they have reference the column names
weight == 50.3
weight < 50.3 and name == john
weight < 50.3 and name is john
- A per-table query input box where users can write expressions in similar language the per-column expressions with the exception that they have reference the column names
-
Human-readable, table search
- A dead-simple string contains search across the table. All cells are treated as a string and the expression does the "contains" operator across all stringified cells
-
Computer-readable filtering representation
- A stable, machine-readable representation of the filtering expression.
- The higher-level expressions would be translated into this low level representation for consumption via callbacks (and e.g. backend filtering)
- It might be structured as a JSON expression:
{ "lhs": { "lhs": "weight", "operator": "<", "rhs": 50 }, "operator": "and", "rhs": { "lhs": "height", "operator": ">=", "rhs": 90 } }
which, in a higher level language, might be written as
weight < 50 and height >= 90
-
Alternative computer-readable "translated" representations
- There may be multiple representations in order to ease backend consumption:
- e.g.
pd.query
(NumExpr
) - e.g. dplyr filter
- e.g. sql
- e.g.
- These representations may have different limitations w.r.t. data types and operators. For example,
pd.query
doesn't have a date type nor a "contains" operator. - Callbacks could also write data to this representation
- There may be multiple representations in order to ease backend consumption:
-
A human-friendly user-interface for filtering.
- Range-slider for numbers and/or dates
- Searchable checklist for strings
So, in which ways would these features be affected by data types?
Data Types Will Affect the "Human readable, per-column filtering expression" The way that a filtering expression is interpreted may depend on the data type.
string
andenum
- Default operator is still "contains"
- e.g.
>
,>=
,<
,<=
will filter alphanumerically- Regex will be supported
- Default operator is still "contains"
number
- Default operator is
=
,contains
is not available >
,>=
,<
,<=
will filter numerically- Decimal places will be supported
- Default operator is
date
- Default operator is "contains" and will filter all dates at a lower resolution. For example:
- Searching
2015
will filter all values with the year2015
, regardless of the month, day, or time - Searching
2015-10-05
will filter all values on the day2015-10-05
, regardless of the time - In this mode, it will not be possible to search just "January" across all years
>
,>=
,<
,<=
will filter by date, starting at the lowest date (e.g.> 2015
is equivalent to> 2015-01-01 00:00:00
)=
won't be supported. The user can search a specific date with the default and implicit "contains" operator
- Searching
- Default operator is "contains" and will filter all dates at a lower resolution. For example:
Open Question - Data Types Could Affect the Filtering Representations via Casting
What do we do if if the data types aren't or can't be defined up-front (consider a table that was created through manual entry or by uploading a CSV)?
In other words, how should filtering behave if the data type isn't defined?
Some options:
- We assume the column is a string.
< 50
just doesn't work. - We incorporate type casting into the filter expression language. That is, a user could type e.g.
num(pressure) < 50
- We allow the user to change the type of the column via the column's setting UI (plotly/dash-table#316)
- We infer the data type if it's not defined by taking a random sample of the column. This gets written to
derived_columns
(I'm in favor of (4))
Data Types could affect how the "high readable" expressions are interpreted
In the human readable expression language, users shouldn't have to wrap their strings in quotes. That is, users should be able to write name contains John
or just John
rather than name contains "John"
or "John"
.
The conversion between these human readable expressions and the machine-readable expressions will need to incorporate the data type of the column. If a user types 054
then it could either get translated into machine readable code as zipcode contains "054"
or zipcode equals 054
depending on the type of the zipcode
column.
Data Types could affect the "human friendly user interface" for filtering
The filtering user interface for filtering may have several different renderers.
Our filtering UI could render different UIs by default depending on the type (e.g. slider for numbers, dropdown for strings). If no UI is available for the data type (e.g. if we have a data type plug in system and the user didn't write a filtering UI), then it could fallback to the simple text input.
While this UI may be inferred by its data type, it should be possible to override this behaviour. (I suspect that many features will be this way).
Data Types could affect how the alternative computer-readable queries are generated
When generating the computer-readable queries, the type of the column will need to be considered as certain dash_table.DataType
data types might not be supported in the computer-readable language. For example, dash_table.DataType
might support a date
type (for features like displaying a date picker in a cell)
It's conceivable that the front-end could provide certain features for data types that aren't supported by the back-end query language.
For example, pd.query
doesn't natively support dates. It's unclear how this type of filter should be translated.
How data types could affect sorting
Accurate sorting needs to account for the data type. If we considered all columns to be strings, then certain sorting operations would be wrong.
This is clear for numbers:
>>> 2 > 10
False
>>> '2' > '10'
True
Other considerations:
-
dates
For dates formatted year-month-day, string comparison will provide the correct answer. However, if we accept dates in alternative formats (e.g.
YYYY/DD/MM
), then sorting will need to do an internal conversion. -
empty strings
'' < 'a'
but users don't always want to see empty cells when they sort from least to greatest. Should empty strings be anull
type? Or is this just a special option when filtering strings?
[... work in progress ...]
I think what I'm converging to is that data types influence the default behaviour of all of these features. Of course, everything will be overridable, but we want the default behaviour to be useful, thoughtful, and good for data exploration.
So, when we consider data types (in particular, which data types we want to support) we need to consider what the default behaviour is for each feature and whether the feature would benefit from a particular dedicated data-type.
It'd be really nice if chemical structures of small molecules (such as those demoed in the drug discovery dash app) and images could be added to this list.