cboettig/aqueduct.qmd

## aqueduct.qmd
---
title: "Demo"
format: html
editor: visual
jupyter: python3
---

```{bash}
python -m venv venv
source venv/bin/activate
pip install aqueduct-ml sklearn
```


```{python}
import pandas as pd
import aqueduct
from aqueduct import op, check, metric
import os
```

```{python}
# use your own address & key
address = "https://aqueduct.thelio.carlboettiger.info/"
api_key = os.environ['AQUEDUCT_TOKEN']

client = aqueduct.Client(api_key, address)
```


```{python}
demodb = client.integration('aqueduct_demo')
wines = demodb.sql("select * from wine;")

# peek at the data
wines.get().head()
```

```{python}
# The @op decorator here allows Aqueduct to run this function as
# a part of the Aqueduct workflow. It tells Aqueduct that when
# we execute this function, we're defining a step in the workflow.
# While the results can be retrieved immediately, nothing is
# published until we call `publish_flow()` below.
@op()
def fix_residual_sugar(df):
    '''
    This function takes in a DataFrame representing wines data and cleans
    the DataFrame by replacing any missing values in the `residual_sugar`
    column with the values that would be predicted based on the other columns.

    Internally, this function uses the sklearn LinearRegression model to
    predict what the values of the `residual_sugar` column should be when
    they are missing.
    '''
    from sklearn.linear_model import LinearRegression

    # Convert residual_sugar back to numeric values with missing values as NaN
    df['residual_sugar'] = pd.to_numeric(df['residual_sugar'], errors='coerce')
    print("missing residual sugar values:", df['residual_sugar'].isna().sum())

    # Fit a LinearRegression model on the other numeric columns, which is everything but
    # quality, residual_sugar.
    imputer = LinearRegression()
    other_cols = df.columns[df.dtypes == 'float'].difference(['quality', 'residual_sugar', 'id'])
    imputer.fit(df.dropna()[other_cols], df.dropna()['residual_sugar'])

    # Use our newly-trained imputer to predict the missing values of `residual_sugar`
    # and replace the NaN values with our new predicted values.
    predicted_sugar = imputer.predict(df[df['residual_sugar'].isna()][other_cols])
    df.loc[df['residual_sugar'].isna(), 'residual_sugar'] = predicted_sugar
    return df


```

This execution errors when the quarto notebook is run from RStudio.

```{python}
wines_cleaned = fix_residual_sugar(wines)
wines_cleaned.get().head()
```
	---
	title: "Demo"
	format: html
	editor: visual
	jupyter: python3
	---

	```{bash}
	python -m venv venv
	source venv/bin/activate
	pip install aqueduct-ml sklearn
	```


	```{python}
	import pandas as pd
	import aqueduct
	from aqueduct import op, check, metric
	import os
	```

	```{python}
	# use your own address & key
	address = "https://aqueduct.thelio.carlboettiger.info/"
	api_key = os.environ['AQUEDUCT_TOKEN']

	client = aqueduct.Client(api_key, address)
	```


	```{python}
	demodb = client.integration('aqueduct_demo')
	wines = demodb.sql("select * from wine;")

	# peek at the data
	wines.get().head()
	```

	```{python}
	# The @op decorator here allows Aqueduct to run this function as
	# a part of the Aqueduct workflow. It tells Aqueduct that when
	# we execute this function, we're defining a step in the workflow.
	# While the results can be retrieved immediately, nothing is
	# published until we call `publish_flow()` below.
	@op()
	def fix_residual_sugar(df):
	'''
	This function takes in a DataFrame representing wines data and cleans
	the DataFrame by replacing any missing values in the `residual_sugar`
	column with the values that would be predicted based on the other columns.

	Internally, this function uses the sklearn LinearRegression model to
	predict what the values of the `residual_sugar` column should be when
	they are missing.
	'''
	from sklearn.linear_model import LinearRegression

	# Convert residual_sugar back to numeric values with missing values as NaN
	df['residual_sugar'] = pd.to_numeric(df['residual_sugar'], errors='coerce')
	print("missing residual sugar values:", df['residual_sugar'].isna().sum())

	# Fit a LinearRegression model on the other numeric columns, which is everything but
	# quality, residual_sugar.
	imputer = LinearRegression()
	other_cols = df.columns[df.dtypes == 'float'].difference(['quality', 'residual_sugar', 'id'])
	imputer.fit(df.dropna()[other_cols], df.dropna()['residual_sugar'])

	# Use our newly-trained imputer to predict the missing values of `residual_sugar`
	# and replace the NaN values with our new predicted values.
	predicted_sugar = imputer.predict(df[df['residual_sugar'].isna()][other_cols])
	df.loc[df['residual_sugar'].isna(), 'residual_sugar'] = predicted_sugar
	return df


	```

	This execution errors when the quarto notebook is run from RStudio.

	```{python}
	wines_cleaned = fix_residual_sugar(wines)
	wines_cleaned.get().head()
	```