Last active
April 5, 2020 10:37
-
-
Save omegaml/14e08ea74d413834ced695a98839d6df to your computer and use it in GitHub Desktop.
omega|ml tutorial notebook
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<img style='float:left' src='https://omegaml.omegaml.io/static/logo.a3fc30c8aa01.jpg'>\n", | |
"<br><br><br><br>\n", | |
"\n", | |
"**Work with data & machine learning models**\n", | |
"\n", | |
"* easily store data in a high-performance data cluster (MongoDB)\n", | |
"* store your fitted or unfitted scikit-learn models\n", | |
"* run predictions on the compute cluster directly from stored data\n", | |
"* store & use remote data (ftp, http, s3)\n", | |
"\n", | |
"**Easily use compute resources in the cluster**\n", | |
"\n", | |
"* fit models in the compute cluster, in parallel\n", | |
"* perform grid search\n", | |
"* all asynchronously\n", | |
"\n", | |
"**Share data, notebooks**\n", | |
"\n", | |
"* write, store & share notebooks directly online, no setup required\n", | |
"* run jobs on a regular schedule\n", | |
"* share notebooks and data across users\n", | |
"\n", | |
"**Automatic REST API for any client**\n", | |
"\n", | |
"* datasets\n", | |
"* models\n", | |
"* jobs (reports)\n", | |
"* arbitrary custom scripts (python)\n", | |
"\n", | |
"**On-Premise or On-Cloud Custom Installation**\n", | |
"\n", | |
"* customizable backends (e.g. Spark, R, SAS)\n", | |
"* custom runtimes (e.g. dask, Spark)\n", | |
"* arbitrary data storage extensions API\n", | |
"* custom data types extensions API\n", | |
"* native-Python data streaming API (like Spark Streaming, much simpler)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!pip install pandas_datareader tqdm" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import omegaml as om \n", | |
"om.setup()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# list datasets stored in cluster\n", | |
"om.datasets.list()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# list models stored in clusters\n", | |
"om.models.list()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# list jobs & results stored in cluster\n", | |
"om.jobs.list()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Enterprise Edition\n", | |
"# list custom scripts stored in cluster\n", | |
"# om.scripts.list()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# store any python data\n", | |
"om.datasets.put(['any data'], 'mydata')\n", | |
"om.datasets.get('mydata')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# store numpy arrays and pandas dataframes\n", | |
"import pandas as pd\n", | |
"from sklearn.datasets import load_iris\n", | |
"X, y = load_iris(True)\n", | |
"data = pd.DataFrame(X)\n", | |
"data['y'] = y\n", | |
"data.head()\n", | |
"om.datasets.put(data, 'iris')\n", | |
"om.datasets.get('iris').head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Enterprise Edition\n", | |
"# store remote datasets as a reference (no copy)\n", | |
"# om.datasets.put('http://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv?accessType=DOWNLOAD', 'demographics')\n", | |
"# om.datasets.get('demographics')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# store financial time series including indicies\n", | |
"%matplotlib inline\n", | |
"import pandas as pd\n", | |
"import pandas_datareader.data as web\n", | |
"import datetime\n", | |
"\n", | |
"start = datetime.datetime(2017, 1, 1)\n", | |
"end = datetime.datetime(2018, 1, 31)\n", | |
"prices = web.DataReader(\"GOOGL\", 'yahoo', start, end)\n", | |
"prices.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# get data back in their original format\n", | |
"om.datasets.put(prices, 'google', append=False)\n", | |
"prices = om.datasets.get('google')\n", | |
"prices.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# filter data in the database -- notice the nice syntax\n", | |
"%time om.datasets.get('google', Close__gte=900, Close__lte=920)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# filter & aggregate data locally (let's make it large)\n", | |
"from tqdm import tqdm\n", | |
"N = 1e6\n", | |
"ldf_google_large = om.datasets.getl('google-large')\n", | |
"dupl = int((N - len(ldf_google_large or [])) / len(prices) + 1)\n", | |
"for i in tqdm(range(dupl)):\n", | |
" om.datasets.put(prices, 'google-large')\n", | |
"print(\"google-large has {} records\".format(len(om.datasets.getl('google-large'))))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# filter & aggregate data locally (let's make it large)\n", | |
"def getdata():\n", | |
" data = om.datasets.get('google-large')\n", | |
" return data[(data.Close >= 900) & (data.Close <= 920)].mean() \n", | |
"\n", | |
"%time getdata()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# filter and aggregate by database - 2-3x faster\n", | |
"%time om.datasets.getl('google-large', Close__gte=900, Close__lte=920).mean().iloc[0]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# index based access by loading data first\n", | |
"def getdata():\n", | |
" dfx = om.datasets.get('google-large')\n", | |
" return dfx.loc[pd.to_datetime('2017-01-03')]\n", | |
"%time getdata()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# index-based access directly in database\n", | |
"dfx = om.datasets.getl('google-large')\n", | |
"%time dfx.loc[pd.to_datetime('2017-01-03')].value" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# train models locally\n", | |
"%matplotlib inline\n", | |
"import pandas as pd \n", | |
"\n", | |
"from sklearn.svm import SVR\n", | |
"\n", | |
"prices = om.datasets.get('google')\n", | |
"X = prices[['High', 'Low']].rolling(5).mean().dropna()\n", | |
"y = prices.iloc[4:]['Close']\n", | |
"print(X.shape, y.shape)\n", | |
"\n", | |
"train_loc = X.shape[0] // 2\n", | |
"\n", | |
"model = SVR(kernel='linear', tol=0.1)\n", | |
"model.fit(X.iloc[0:train_loc], y.iloc[0:train_loc])\n", | |
"\n", | |
"r2 = model.score(X, y)\n", | |
"yhat = pd.DataFrame({'yhat': model.predict(X[train_loc:])})\n", | |
"yhat.index = X.index[train_loc:]\n", | |
"\n", | |
"ax = prices.iloc[train_loc:]['Close'].plot()\n", | |
"yhat.plot(color='r', ax=ax)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# predict remotely\n", | |
"\n", | |
"# store models and new data\n", | |
"om.models.put(model, 'google-predict')\n", | |
"om.datasets.put(X[train_loc:], 'google-rolling', append=False)\n", | |
"\n", | |
"# then predict remotely\n", | |
"pred = om.runtime.model('google-predict').predict('google-rolling[High,Low]').get()\n", | |
"\n", | |
"# show results\n", | |
"pred = pd.DataFrame({'yhat': pred}, index=range(len(pred)))\n", | |
"actual = om.datasets.get('google[Close]').iloc[train_loc:]\n", | |
"pred.index = actual.index[:len(pred)]\n", | |
"ax = actual.plot()\n", | |
"pred.plot(color='r', ax=ax)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# we can also train remote\n", | |
"import matplotlib.pyplot as plt\n", | |
"from mpl_toolkits.mplot3d import Axes3D\n", | |
"import numpy as np\n", | |
"\n", | |
"iris = load_iris()\n", | |
"X = iris.data\n", | |
"y = iris.target\n", | |
"\n", | |
"df = pd.DataFrame(X)\n", | |
"df['y'] = y\n", | |
"\n", | |
"from sklearn.cluster import KMeans\n", | |
"model = KMeans(n_clusters=8)\n", | |
"\n", | |
"# fit & predict remote\n", | |
"om.models.drop('iris-model', True)\n", | |
"om.models.put(model, 'iris-model')\n", | |
"om.runtime.model('iris-model').fit(X, y).get()\n", | |
"\n", | |
"# get back remote fitted model and show results\n", | |
"model = om.models.get('iris-model')\n", | |
"labels = model.labels_\n", | |
"\n", | |
"fig = plt.figure(figsize=(4, 3))\n", | |
"ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)\n", | |
"ax.scatter(X[:, 3], X[:, 0], X[:, 2],\n", | |
" c=labels.astype(np.float), edgecolor='k')\n", | |
"fig.show()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# we store lots of information on models\n", | |
"om.models.metadata('iris-model').attributes" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# perform gridsearch on cluster\n", | |
"om.datasets.put(df, 'iris', append=False)\n", | |
"params = {\n", | |
" 'n_clusters': range(1,8),\n", | |
" }\n", | |
"om.runtime.model('iris-model').gridsearch('iris[^y]', 'iris[y]', parameters=params).get()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# see what gridsearch results we have\n", | |
"gsresult = om.models.metadata('iris-model')['attributes']['gridsearch']" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# look at gridsearch results\n", | |
"gsModel = gsresult[0]['gsModel']\n", | |
"gs = om.models.get(gsModel)\n", | |
"gs.best_estimator_" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# use the model REST API \n", | |
"import requests\n", | |
"from omegaml.client.auth import OmegaRestApiAuth\n", | |
"import omegaml as om \n", | |
"\n", | |
"# -- setup authentication and API URL\n", | |
"auth = OmegaRestApiAuth.make_from(om)\n", | |
"url = getattr(om.defaults, 'OMEGA_RESTAPI_URL', 'http://localhost:5000')\n", | |
"modelname = 'iris-model'\n", | |
"dataset = 'iris'\n", | |
"# -- prepare dataset\n", | |
"om.datasets.put(pd.DataFrame(X), 'iris', append=False)\n", | |
"# -- call REST API\n", | |
"print('Requesting from', url)\n", | |
"resp = requests.put('{url}/api/v1/model/{modelname}/predict?datax={dataset}'.format(**locals()), auth=auth)\n", | |
"resp.json()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# use the datasets REST API \n", | |
"import requests\n", | |
"\n", | |
"print('Requesting from', url)\n", | |
"resp = requests.get('{url}/api/v1/dataset/{dataset}'.format(**locals()), auth=auth)\n", | |
"resp.json()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Enterprise Edition\n", | |
"# deploy lambda-style arbitrary algorithms\n", | |
"# om.scripts.put('pkg:///app/omegapkg/demo/helloworld/', 'helloworld')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Enterprise Edition\n", | |
"# run lambdas\n", | |
"# from datetime import datetime\n", | |
"# dtnow = datetime.now().isoformat()\n", | |
"# om.runtime.script('helloworld').run(foo=dtnow).get()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Enterprise Edition\n", | |
"# use REST API to run lambdas\n", | |
"# import requests\n", | |
"# from omegacommon.auth import OmegaRestApiAuth\n", | |
"# auth = OmegaRestApiAuth(**auth_config)\n", | |
"# resp = requests.post('https://omegaml.omegaml.io/api/v1/script/helloworld/run/', \n", | |
"# params=dict(foo=dtnow), auth=auth)\n", | |
"# resp.json()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# run jobs (python notebooks) online\n", | |
"if 'scheduled-report.ipynb' in om.jobs.list():\n", | |
" om.runtime.job('scheduled-report').run()\n", | |
" om.jobs.list()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Enterprise Edition\n", | |
"\n", | |
"### per-user online dashboard \n", | |
"http://omegaml.omegaml.io/dashboard\n", | |
" \n", | |
"### per-user online notebook automated setup\n", | |
"http://omjobs.omegaml.io/" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.8" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment