Skip to content

Instantly share code, notes, and snippets.

@fjuniorr
Last active October 7, 2023 21:08
Show Gist options
  • Save fjuniorr/5dae805f6cc87f6fcd6cb80f07a7278d to your computer and use it in GitHub Desktop.
Save fjuniorr/5dae805f6cc87f6fcd6cb80f07a7278d to your computer and use it in GitHub Desktop.
SQLite DB that is annotated by a `datapackage.json`

SQLite DB with metadata in data package

As far as I know the docs are not super clear on this. But if your resources have the correct path and sql dialect (new in v5) the workflow you described should work with a call to Resource.to_pandas.

    {
      "name": "date",
      "path": "sqlite:///database.db",
      "dialect": {
        "sql": {
          "table": "date"
        }
      },
      "schema": {...}
    }

Here is an example. First reading the data package

import pandas as pd
from frictionless import Package

package = Package('datapackage.json')

and now checking that the types were corrected converted from the original schema

sales = package.get_resource('sales').to_pandas()
# unit_price and total_price are stored as text in sqlite with euopean separators (eg. 1.049,99)
sales.dtypes
date_id          int64
product_id       int64
quantity         int64
unit_price     float64
total_price    float64
dtype: object
date = package.get_resource('date').to_pandas()
date.dtypes
date_val     datetime64[ns]
year                  int64
yearmonth            object
weekend                bool
dtype: object

You can also do validation. Here a regex pattern and a foreign-key validation errors are raised:

package.validate()
{'valid': False,
 'stats': {'tasks': 3, 'errors': 2, 'warnings': 0, 'seconds': 0.019},
 'warnings': [],
 'errors': [],
 'tasks': [{'name': 'date',
            'type': 'table',
            'valid': True,
            'place': 'sqlite:///database.db',
            'labels': ['id', 'date_val', 'year', 'yearmonth', 'weekend'],
            'stats': {'errors': 0,
                      'warnings': 0,
                      'seconds': 0.004,
                      'fields': 5,
                      'rows': 5},
            'warnings': [],
            'errors': []},
           {'name': 'product',
            'type': 'table',
            'valid': False,
            'place': 'sqlite:///database.db',
            'labels': ['id', 'code', 'name', 'description', 'category'],
            'stats': {'errors': 1,
                      'warnings': 0,
                      'seconds': 0.004,
                      'fields': 5,
                      'rows': 3},
            'warnings': [],
            'errors': [{'type': 'constraint-error',
                        'title': 'Constraint Error',
                        'description': 'A field value does not conform to a '
                                       'constraint.',
                        'message': 'The cell "sku002" in row at position "4" '
                                   'and field "code" at position "2" does not '
                                   'conform to a constraint: constraint '
                                   '"pattern" is "SKU\\d{3}"',
                        'tags': ['#table', '#row', '#cell'],
                        'note': 'constraint "pattern" is "SKU\\d{3}"',
                        'cells': ['3',
                                  'sku002',
                                  'Product C',
                                  'Yet another great product',
                                  'Category 2'],
                        'rowNumber': 4,
                        'cell': 'sku002',
                        'fieldName': 'code',
                        'fieldNumber': 2}]},
           {'name': 'sales',
            'type': 'table',
            'valid': False,
            'place': 'sqlite:///database.db',
            'labels': ['sale_id',
                       'date_id',
                       'product_id',
                       'quantity',
                       'unit_price',
                       'total_price'],
            'stats': {'errors': 1,
                      'warnings': 0,
                      'seconds': 0.01,
                      'fields': 6,
                      'rows': 5},
            'warnings': [],
            'errors': [{'type': 'foreign-key',
                        'title': 'ForeignKey Error',
                        'description': 'Values in the foreign key fields '
                                       'should exist in the reference table',
                        'message': 'Row at position "6" violates the foreign '
                                   'key: for "product_id": values "4" not '
                                   'found in the lookup table "product" as '
                                   '"id"',
                        'tags': ['#table', '#row'],
                        'note': 'for "product_id": values "4" not found in the '
                                'lookup table "product" as "id"',
                        'cells': ['5', '5', '4', '3', '14,99', '44,97'],
                        'rowNumber': 6,
                        'fieldNames': ['product_id'],
                        'fieldCells': ['4'],
                        'referenceName': 'product',
                        'referenceFieldNames': ['id']}]}]}

Source

---
title: SQLite DB with metadata in data package
format: gfm
jupyter: python3
---
As far as I know the docs are not super clear on this. But if your resources have the correct `path` and sql `dialect` (new in v5) the workflow you described should work with a call to [`Resource.to_pandas`](https://framework.frictionlessdata.io/docs/formats/pandas.html).
```json
{
"name": "date",
"path": "sqlite:///database.db",
"dialect": {
"sql": {
"table": "date"
}
},
"schema": {...}
}
```
Here is an example. First reading the data package
```{python}
import pandas as pd
from frictionless import Package
package = Package('datapackage.json')
```
and now checking that the types were corrected converted from the [original schema](create.sql)
```{python}
sales = package.get_resource('sales').to_pandas()
# unit_price and total_price are stored as text in sqlite with euopean separators (eg. 1.049,99)
sales.dtypes
```
```{python}
date = package.get_resource('date').to_pandas()
date.dtypes
```
You can also do validation. Here a regex pattern and a foreign-key validation errors are raised:
```{python}
package.validate()
```
-- [Source](https://frictionlessdata.slack.com/archives/C0369HZ2SLT/p1680887543670399)
-- Create the date dimension table
CREATE TABLE date (
id INTEGER PRIMARY KEY,
date_val DATE NOT NULL,
year INTEGER NOT NULL,
yearmonth TEXT NOT NULL,
weekend TEXT NOT NULL
);
-- Create the product dimension table
CREATE TABLE product (
id INTEGER PRIMARY KEY,
code TEXT NOT NULL,
name TEXT NOT NULL,
description TEXT,
category TEXT NOT NULL
);
-- Create the sales fact table
CREATE TABLE sales (
sale_id INTEGER PRIMARY KEY,
date_id INTEGER NOT NULL,
product_id INTEGER NOT NULL,
quantity INTEGER NOT NULL,
unit_price TEXT NOT NULL,
total_price TEXT NOT NULL,
FOREIGN KEY (date_id) REFERENCES date(id),
FOREIGN KEY (product_id) REFERENCES product(id)
);
-- Insert some sample data into the date_dimension table
INSERT INTO date (id, date_val, year, yearmonth, weekend)
VALUES
(1, '2023-04-01', 2023, '2023-04', '1'),
(2, '2023-04-02', 2023, '2023-04', '1'),
(3, '2023-05-03', 2023, '2023-05', '0'),
(4, '2023-05-04', 2023, '2023-05', '0'),
(5, '2023-05-05', 2023, '2023-05', '0');
-- Insert some sample data into the product table
INSERT INTO product (id, code, name, description, category)
VALUES
(1, 'SKU001' ,'Product A', 'A great product', 'Category 1'),
(2, 'SKU001' ,'Product B', 'Another great product', 'Category 1'),
(3, 'sku002' ,'Product C', 'Yet another great product', 'Category 2');
-- Insert some sample data into the sales table
INSERT INTO sales (sale_id, date_id, product_id, quantity, unit_price, total_price)
VALUES
(1, 1, 1, 10,'9,99', '99,90'),
(2, 2, 2, 5, '19,99', '99,95'),
(3, 3, 1, 2, '12,50', '25,00'),
(4, 4, 3, 1, '1.049,99', '1.049,99'),
(5, 5, 4, 3, '14,99', '44,97');
{
"name": "sqlite",
"title": "Data package with resource data stored in sqlite",
"profile": "data-package",
"resources": [
{
"name": "date",
"path": "sqlite:///database.db",
"dialect": {
"sql": {
"table": "date"
}
},
"schema": {
"fields": [
{
"name": "id",
"type": "integer"
},
{
"name": "date_val",
"type": "date",
"constraints": {
"required": true
}
},
{
"name": "year",
"type": "integer",
"constraints": {
"required": true
}
},
{
"name": "yearmonth",
"type": "yearmonth",
"constraints": {
"required": true
}
},
{
"name": "weekend",
"type": "boolean",
"constraints": {
"required": true
},
"trueValues": [
"1"
],
"falseValues": [
"0"
]
}
],
"primaryKey": [
"id"
]
}
},
{
"name": "product",
"path": "sqlite:///database.db",
"dialect": {
"sql": {
"table": "product"
}
},
"schema": {
"fields": [
{
"name": "id",
"type": "integer"
},
{
"name": "code",
"type": "string",
"constraints": {
"required": true,
"pattern": "SKU\\d{3}"
}
},
{
"name": "name",
"type": "string",
"constraints": {
"required": true
}
},
{
"name": "description",
"type": "string"
},
{
"name": "category",
"type": "string",
"constraints": {
"required": true
}
}
],
"primaryKey": [
"id"
]
}
},
{
"name": "sales",
"type": "table",
"path": "sqlite:///database.db",
"dialect": {
"sql": {
"table": "sales"
}
},
"schema": {
"fields": [
{
"name": "sale_id",
"type": "integer"
},
{
"name": "date_id",
"type": "integer",
"constraints": {
"required": true
}
},
{
"name": "product_id",
"type": "integer",
"constraints": {
"required": true
}
},
{
"name": "quantity",
"type": "integer",
"constraints": {
"required": true
}
},
{
"name": "unit_price",
"type": "number",
"constraints": {
"required": true
},
"decimalChar": ",",
"groupChar": "."
},
{
"name": "total_price",
"type": "number",
"constraints": {
"required": true
},
"decimalChar": ",",
"groupChar": "."
}
],
"primaryKey": [
"sale_id"
],
"foreignKeys": [
{
"fields": [
"product_id"
],
"reference": {
"resource": "product",
"fields": [
"id"
]
}
},
{
"fields": [
"date_id"
],
"reference": {
"resource": "date",
"fields": [
"id"
]
}
}
]
}
}
]
}
anyio==3.6.2
appnope==0.1.3
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asttokens==2.2.1
attrs==22.2.0
backcall==0.2.0
beautifulsoup4==4.12.2
bleach==6.0.0
certifi==2022.12.7
cffi==1.15.1
chardet==5.1.0
charset-normalizer==3.1.0
click==8.1.3
colorama==0.4.6
comm==0.1.3
commonmark==0.9.1
debugpy==1.6.7
decorator==5.1.1
defusedxml==0.7.1
executing==1.2.0
fastjsonschema==2.16.3
fqdn==1.5.1
frictionless==5.10.5
humanize==4.6.0
idna==3.4
ipykernel==6.22.0
ipython==8.12.0
ipython-genutils==0.2.0
ipywidgets==8.0.6
isodate==0.6.1
isoduration==20.11.0
jedi==0.18.2
Jinja2==3.1.2
jsonpointer==2.3
jsonschema==4.17.3
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.6.3
jupyter_client==8.1.0
jupyter_core==5.3.0
jupyter_server==2.5.0
jupyter_server_terminals==0.4.4
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.7
marko==1.3.0
MarkupSafe==2.1.2
matplotlib-inline==0.1.6
mistune==2.0.5
nbclassic==0.5.5
nbclient==0.7.3
nbconvert==7.3.0
nbformat==5.8.0
nest-asyncio==1.5.6
notebook==6.5.4
notebook_shim==0.2.2
numpy==1.24.2
packaging==23.0
pandas==2.0.0
pandocfilters==1.5.0
parso==0.8.3
petl==1.7.12
pexpect==4.8.0
pickleshare==0.7.5
platformdirs==3.2.0
prometheus-client==0.16.0
prompt-toolkit==3.0.38
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.2.2
pycparser==2.21
pydantic==1.10.7
Pygments==2.14.0
pyrsistent==0.19.3
python-dateutil==2.8.2
python-json-logger==2.0.7
python-slugify==8.0.1
pytz==2023.3
PyYAML==6.0
pyzmq==25.0.2
qtconsole==5.4.2
QtPy==2.3.1
requests==2.28.2
rfc3339-validator==0.1.4
rfc3986==2.0.0
rfc3986-validator==0.1.1
rich==12.6.0
Send2Trash==1.8.0
shellingham==1.5.0.post1
simpleeval==0.9.13
six==1.16.0
sniffio==1.3.0
soupsieve==2.4
SQLAlchemy==2.0.9
stack-data==0.6.2
stringcase==1.2.0
tabulate==0.9.0
terminado==0.17.1
text-unidecode==1.3
tinycss2==1.2.1
tornado==6.2
traitlets==5.9.0
typer==0.7.0
typing_extensions==4.5.0
tzdata==2023.3
uri-template==1.2.0
urllib3==1.26.15
validators==0.20.0
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.5.1
widgetsnbextension==4.0.7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment