Skip to content

Instantly share code, notes, and snippets.

@cboettig
Created October 6, 2022 04:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cboettig/376e5449b6d8a63042ecb584d412bfc1 to your computer and use it in GitHub Desktop.
Save cboettig/376e5449b6d8a63042ecb584d412bfc1 to your computer and use it in GitHub Desktop.
---
title: "Demo"
format: html
editor: visual
jupyter: python3
---
```{bash}
python -m venv venv
source venv/bin/activate
pip install aqueduct-ml sklearn
```
```{python}
import pandas as pd
import aqueduct
from aqueduct import op, check, metric
import os
```
```{python}
# use your own address & key
address = "https://aqueduct.thelio.carlboettiger.info/"
api_key = os.environ['AQUEDUCT_TOKEN']
client = aqueduct.Client(api_key, address)
```
```{python}
demodb = client.integration('aqueduct_demo')
wines = demodb.sql("select * from wine;")
# peek at the data
wines.get().head()
```
```{python}
# The @op decorator here allows Aqueduct to run this function as
# a part of the Aqueduct workflow. It tells Aqueduct that when
# we execute this function, we're defining a step in the workflow.
# While the results can be retrieved immediately, nothing is
# published until we call `publish_flow()` below.
@op()
def fix_residual_sugar(df):
'''
This function takes in a DataFrame representing wines data and cleans
the DataFrame by replacing any missing values in the `residual_sugar`
column with the values that would be predicted based on the other columns.
Internally, this function uses the sklearn LinearRegression model to
predict what the values of the `residual_sugar` column should be when
they are missing.
'''
from sklearn.linear_model import LinearRegression
# Convert residual_sugar back to numeric values with missing values as NaN
df['residual_sugar'] = pd.to_numeric(df['residual_sugar'], errors='coerce')
print("missing residual sugar values:", df['residual_sugar'].isna().sum())
# Fit a LinearRegression model on the other numeric columns, which is everything but
# quality, residual_sugar.
imputer = LinearRegression()
other_cols = df.columns[df.dtypes == 'float'].difference(['quality', 'residual_sugar', 'id'])
imputer.fit(df.dropna()[other_cols], df.dropna()['residual_sugar'])
# Use our newly-trained imputer to predict the missing values of `residual_sugar`
# and replace the NaN values with our new predicted values.
predicted_sugar = imputer.predict(df[df['residual_sugar'].isna()][other_cols])
df.loc[df['residual_sugar'].isna(), 'residual_sugar'] = predicted_sugar
return df
```
This execution errors when the quarto notebook is run from RStudio.
```{python}
wines_cleaned = fix_residual_sugar(wines)
wines_cleaned.get().head()
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment