Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
getting started with featurestream.io
{
"metadata": {
"name": "getting-started"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting started\n",
"The following example opens a stream, asynchronously adds some events from a csv file, and retrieves a prediction.\n",
"Import the library and give it your access key:"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"import featurestream as fs\n",
"fs.set_access('your_access_key')\n",
"# do a quick health check on the service\n",
"print 'healthy=',fs.check_health()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"set access key to your_access_key\n",
"healthy= True\n"
]
}
],
"prompt_number": 21
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're going to load some events from a CSV file (this is an example that from the KDD'99 cup - see http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html)\n",
"\n",
"Import the `featurestream.csv` library and get an iterator of events from a CSV file:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import featurestream.csv as csv\n",
"events = csv.csv_iterator('../resources/KDDTrain_1Percent.csv')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"guessing types..\n",
"types= [('0', 'NUMERIC'), ('1', 'CATEGORIC'), ('2', 'CATEGORIC'), ('3', 'CATEGORIC'), ('4', 'NUMERIC'), ('5', 'NUMERIC'), ('6', 'NUMERIC'), ('7', 'NUMERIC'), ('8', 'NUMERIC'), ('9', 'NUMERIC'), ('10', 'NUMERIC'), ('11', 'NUMERIC'), ('12', 'NUMERIC'), ('13', 'NUMERIC'), ('14', 'NUMERIC'), ('15', 'NUMERIC'), ('16', 'NUMERIC'), ('17', 'NUMERIC'), ('18', 'NUMERIC'), ('19', 'NUMERIC'), ('20', 'NUMERIC'), ('21', 'NUMERIC'), ('22', 'NUMERIC'), ('23', 'NUMERIC'), ('24', 'NUMERIC'), ('25', 'NUMERIC'), ('26', 'NUMERIC'), ('27', 'NUMERIC'), ('28', 'NUMERIC'), ('29', 'NUMERIC'), ('30', 'NUMERIC'), ('31', 'NUMERIC'), ('32', 'NUMERIC'), ('33', 'NUMERIC'), ('34', 'NUMERIC'), ('35', 'NUMERIC'), ('36', 'NUMERIC'), ('37', 'NUMERIC'), ('38', 'NUMERIC'), ('39', 'NUMERIC'), ('40', 'NUMERIC'), ('41', 'CATEGORIC')]\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The parser automatically tries to infer types based on a sample of the file; in this case we don't want to change its type inference; see later for how to do this and more advanced use. If the CSV file has no header, the parser creates variable names `0,1,2,3,...` according to the column numbers (see the documentation for more details on parsing CSV files and other formats)\n",
"\n",
"Look at the first event; we'll use this later."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"e = events.next()\n",
"e"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 5,
"text": [
"{'0': 0.0,\n",
" '1': 'tcp',\n",
" '10': 0.0,\n",
" '11': 0.0,\n",
" '12': 0.0,\n",
" '13': 0.0,\n",
" '14': 0.0,\n",
" '15': 0.0,\n",
" '16': 0.0,\n",
" '17': 0.0,\n",
" '18': 0.0,\n",
" '19': 0.0,\n",
" '2': 'ftp_data',\n",
" '20': 0.0,\n",
" '21': 0.0,\n",
" '22': 2.0,\n",
" '23': 2.0,\n",
" '24': 0.0,\n",
" '25': 0.0,\n",
" '26': 0.0,\n",
" '27': 0.0,\n",
" '28': 1.0,\n",
" '29': 0.0,\n",
" '3': 'SF',\n",
" '30': 0.0,\n",
" '31': 150.0,\n",
" '32': 25.0,\n",
" '33': 0.17,\n",
" '34': 0.03,\n",
" '35': 0.17,\n",
" '36': 0.0,\n",
" '37': 0.0,\n",
" '38': 0.0,\n",
" '39': 0.05,\n",
" '4': 491.0,\n",
" '40': 0.0,\n",
" '41': 'normal',\n",
" '5': 0.0,\n",
" '6': 0.0,\n",
" '7': 0.0,\n",
" '8': 0.0,\n",
" '9': 0.0}"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Events are simple JSON maps `{'name1':value1, ..., 'name_k':value_k}`. If `value` is enclosed in quotes then it is treated as a categoric type, otherwise it is treated as numeric type. For example `event={'some_numeric_val':12.1, 'some_categoric_val':'True', 'numeric_as_categoric':'12.1'}`. You can also specify explicit types if you want; see `api.py` for further documentation. (The engine also supports other types including textual and datetime - TODO describe this later!)\n",
"\n",
"Start a new stream:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"stream = fs.start_stream(targets={'41':'CATEGORIC','40':'NUMERIC'})"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This should try to create a stream with two targets, one for column `41` with categoric type and one for column `40` with numeric type. Check that the stream was created successfully:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"stream"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 7,
"text": [
"Stream[stream_id=3121598785123213694, targets={'40': 'NUMERIC', '41': 'CATEGORIC'}, endpoint=http://vm:8088/mungio/api]"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A stream is created by calling `start_stream(targets)` where `targets` is a map of target names to their types, either CATEGORIC or NUMERIC at present. Each stream is uniquely identified by its `stream_id`. If you close your python console or lose the stream handle, you can call `fs.get_stream(stream_id)` to retrieve the stream object."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"stream.stream_id"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 8,
"text": [
"3121598785123213694L"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Send the events iterator asynchronously to the stream:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"t=stream.train_iterator(events, batch=500)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This returns an object `t` which gives you access to the training process (see the documentation for more details).\n",
"\n",
"Wait for the stream to consume some of the events (there are 2500 events in the file, wait until at least 1500 are done!) (almost all the time is spent transferring data, particularly since the servers are in the `us-east-1` AWS region currently).\n",
"Examine the progress:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"t"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 10,
"text": [
"AsyncTrainer[stream_id=3121598785123213694, is_running=False, train_count=2499, error_count=0, batch=500]"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"See if it predicts one of the original events correctly:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"stream.predict(e)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 11,
"text": [
"{u'40': 0.0035392940503651336, u'41': u'normal'}"
]
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This returns a simple prediction for each target.\n",
"You can also get estimated probabilities for categoric targets by using `predict_full`:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"stream.predict_full(e)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 12,
"text": [
"{u'40': 0.0035392940503651336,\n",
" u'41': {u'anomaly': 0.30498747069759924, u'normal': 0.6950125293024008}}"
]
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Featurestream's engine is very good at handling missing values, or noisy data. In particular, for missing values, it can 'integrate them out' to get predictions. For example, the following (predicting with the empty event) returns the distribution of the entire stream:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"stream.predict_full({})"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 14,
"text": [
"{u'40': 0.12046304950468253,\n",
" u'41': {u'anomaly': 0.47833171345263276, u'normal': 0.5216682865473672}}"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, about 47.7% of events had variable 'type' as 'anomaly' and 52.3% as 'normal', and the average value of variable 'oo' was 0.12. In the future, we can allow returning more full values for numeric targets, including distributions. The ability to leave out missing values makes featurestream very powerful for handling a wide range of real-life data sources."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Examine which variables are most related to a target variable:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"stream.related_fields('41')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 15,
"text": [
"[(u'2', 0.3770347330783282),\n",
" (u'1', 0.19141914188838352),\n",
" (u'3', 0.19008269160361074),\n",
" (u'4', 0.08723446747760971),\n",
" (u'5', 0.0684207229686264)]"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This returns a distribution over all variables, summing to 1, which describes how strongly each variable contributes to predicting the value of the target variable. This allows you to understand more about the structure of your data. By default, it returns the top 5 variables but you can change this by passing the argument `k=10` (for the top 10) or `k=-1` (for all fields).\n",
"\n",
"Examine the stream statistics for one of the targets:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"stream.get_stats()['41']"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 16,
"text": [
"{u'accuracy': 0.8783513405362144,\n",
" u'auc': -1.0,\n",
" u'confusion': {u'anomaly': {u'anomaly': 1001, u'normal': 192},\n",
" u'normal': {u'anomaly': 112, u'normal': 1194}},\n",
" u'exp_accuracy': [0.8783513405362154,\n",
" 0.9243864078858117,\n",
" 0.9387295341257783,\n",
" 0.948508066478172,\n",
" 0.962130093642003],\n",
" u'n_correct': 2195.0,\n",
" u'n_models': 30,\n",
" u'n_total': 2499.0,\n",
" u'scores': {u'anomaly': {u'F1': 0.8681699913269733,\n",
" u'precision': 0.8390611902766136,\n",
" u'recall': 0.89937106918239},\n",
" u'normal': {u'F1': 0.887072808320951,\n",
" u'precision': 0.9142419601837672,\n",
" u'recall': 0.8614718614718615}},\n",
" u'type': u'classification'}"
]
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The section below about stats explains what these statistics represent. Featurestream calculates these statistics without you having to do k-fold cross-validation, training/test set splits, and so on. Furthermore, they are computed in near real-time as your stream is ingested.\n",
"\n",
"You can also examine the stats for the numeric target:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"stream.get_stats()['40']"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 17,
"text": [
"{u'correlation_coefficient': 0.9080827150546844,\n",
" u'exp_rmse': [0.047846214020193026,\n",
" 0.026736475934155627,\n",
" 0.022184095577206565,\n",
" 0.022028100688291842,\n",
" 0.02069833929335216],\n",
" u'mean_abs_error': 0.04784621402019295,\n",
" u'n_models': 30,\n",
" u'n_predictable': 2498,\n",
" u'n_total': 2499,\n",
" u'n_unpredictable': 1,\n",
" u'rmse': 0.13471805922350763,\n",
" u'type': u'regression'}"
]
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How did we do so far?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"stream.get_stats()['41']['accuracy']"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 18,
"text": [
"0.8783513405362144"
]
}
],
"prompt_number": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pretty good, we hope!\n",
"\n",
"You can list the streams that you've created:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"fs.get_streams()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 19,
"text": [
"[u'8390918127711644986',\n",
" u'1048662742599671471',\n",
" u'1837065825648386056',\n",
" u'499061556669715890',\n",
" u'8401330973023360610',\n",
" u'8224193276057961194',\n",
" u'3660291060781236780',\n",
" u'7196764205700459945',\n",
" u'2591390975362313808',\n",
" u'8293131819037029096',\n",
" u'1980880942405231734',\n",
" u'1810798868927243447',\n",
" u'2985362765030371318',\n",
" u'8572597830220354586',\n",
" u'505419779668312605',\n",
" u'4001582126205069442',\n",
" u'5622142170443720935',\n",
" u'3121598785123213694']"
]
}
],
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's a good idea to close the stream once you're done:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"stream.close()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 20,
"text": [
"True"
]
}
],
"prompt_number": 20
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We hope this example gives you a flavor of what featurestream.io can do for you. You've just scratched the surface! See below for some more details about the calls and objects used above, and more information. We will also be updating this document as we improve the service and add more functionality. We'd love to get some more use case examples. Please say [hello@featurestream.io](hello@featurestream.io)"
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment