Created
October 6, 2022 04:13
-
-
Save cboettig/376e5449b6d8a63042ecb584d412bfc1 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Demo" | |
format: html | |
editor: visual | |
jupyter: python3 | |
--- | |
```{bash} | |
python -m venv venv | |
source venv/bin/activate | |
pip install aqueduct-ml sklearn | |
``` | |
```{python} | |
import pandas as pd | |
import aqueduct | |
from aqueduct import op, check, metric | |
import os | |
``` | |
```{python} | |
# use your own address & key | |
address = "https://aqueduct.thelio.carlboettiger.info/" | |
api_key = os.environ['AQUEDUCT_TOKEN'] | |
client = aqueduct.Client(api_key, address) | |
``` | |
```{python} | |
demodb = client.integration('aqueduct_demo') | |
wines = demodb.sql("select * from wine;") | |
# peek at the data | |
wines.get().head() | |
``` | |
```{python} | |
# The @op decorator here allows Aqueduct to run this function as | |
# a part of the Aqueduct workflow. It tells Aqueduct that when | |
# we execute this function, we're defining a step in the workflow. | |
# While the results can be retrieved immediately, nothing is | |
# published until we call `publish_flow()` below. | |
@op() | |
def fix_residual_sugar(df): | |
''' | |
This function takes in a DataFrame representing wines data and cleans | |
the DataFrame by replacing any missing values in the `residual_sugar` | |
column with the values that would be predicted based on the other columns. | |
Internally, this function uses the sklearn LinearRegression model to | |
predict what the values of the `residual_sugar` column should be when | |
they are missing. | |
''' | |
from sklearn.linear_model import LinearRegression | |
# Convert residual_sugar back to numeric values with missing values as NaN | |
df['residual_sugar'] = pd.to_numeric(df['residual_sugar'], errors='coerce') | |
print("missing residual sugar values:", df['residual_sugar'].isna().sum()) | |
# Fit a LinearRegression model on the other numeric columns, which is everything but | |
# quality, residual_sugar. | |
imputer = LinearRegression() | |
other_cols = df.columns[df.dtypes == 'float'].difference(['quality', 'residual_sugar', 'id']) | |
imputer.fit(df.dropna()[other_cols], df.dropna()['residual_sugar']) | |
# Use our newly-trained imputer to predict the missing values of `residual_sugar` | |
# and replace the NaN values with our new predicted values. | |
predicted_sugar = imputer.predict(df[df['residual_sugar'].isna()][other_cols]) | |
df.loc[df['residual_sugar'].isna(), 'residual_sugar'] = predicted_sugar | |
return df | |
``` | |
This execution errors when the quarto notebook is run from RStudio. | |
```{python} | |
wines_cleaned = fix_residual_sugar(wines) | |
wines_cleaned.get().head() | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment