cmoradi/model_prediction_post03__testing.md Secret

## model_prediction_post03__testing.md

      
    Raw
  

              model_prediction_post03__testing.md
            
          
    Model Prediction APIs (Part 3): Testing

In the last post, we improved the error handling of our prediction API and considered the nuanced decision about which records we should score. In this post, we'll look at how we can test our API using pytest.
As always, we're going to use Python 3, and I'm going to assume that you are either using Anaconda or that you've set up an environment with these packages installed: flask, scikit-learn, and pytest.
Adjustments for Better Testing

Before we get into writing the actual tests, there's one thing that I'd like to change about the response we're generating. Currently, we're only sending back the predicted class (iris type). While this is what the users of our prediction API need, it doesn't provide a ton of information for us to test. The predicted class is chosen because our model score for this class (iris type) is higher than the model scores for the other classes. Because of this thresholding, the predicted class could be the same even if the underlying scores are somewhat different. This is analogous to a function that performs complicated and precise calculations, but returns a value that is rounded to the nearest integer. Even if the returned integer values for many inputs match the expected values, the underlying calculations may not be correct. Ideally, we'd like to verify that the precision results of the calculations are correct before they are rounded.
To provide ourselves with more data to verify that we're correctly scoring the model, we're going to change the API response to include the probabilities of each class. This is accomplished by calling model.predict_proba instead of model.predict. We can use argmax to get the index of the largest value. We'll return these to the user through probabilities.
# Filename: predict_api.py
from flask import Flask, request, jsonify
from sklearn.externals import joblib

app = Flask(__name__)

MODEL = joblib.load('iris-rf-v1.0.pkl')
MODEL_LABELS = ['setosa', 'versicolor', 'virginica']

HTTP_BAD_REQUEST = 400

@app.route('/predict')
def predict():
    sepal_length = request.args.get('sepal_length', default=5.8, type=float)
    sepal_width = request.args.get('sepal_width', default=3.0, type=float)
    petal_length = request.args.get('petal_length', default=3.9, type=float)
    petal_width = request.args.get('petal_width', default=1.2, type=float)

    features = [[sepal_length, sepal_width, petal_length, petal_width]]

    # Changed section.
    probabilities = MODEL.predict_proba(features)
    label_index = probabilities[0].argmax()
    label = MODEL_LABELS[label_index]
    class_probabilities = dict(zip(MODEL_LABELS, probabilities))
    return jsonify(status='complete', label=label,
                   probabilities=class_probabilities)

if __name__ == '__main__':
    app.run(debug=True)
We can then run the API (python predict_api.py) and make a test call through requests:
import requests

data = {'petal_length': 5.1, 'petal_width': 2.3,
        'sepal_length': 6.9, 'sepal_width': 3.1}

response = requests.get('http://127.0.0.1:5000/predict', params=data)
print('Status code: {}'.format(response.status_code))
print('Payload:\n{}'.format(response.text))

# Response should be something like this:
# Status code: 200
# Payload:
# {
#   "label": "virginica",
#   "probabilities": {
#     "setosa": 0.0,
#     "versicolor": 0.2,
#     "virginica": 0.8
#   },
#   "status": "complete"
# }
Basic Test of the API

We're going to begin with a simple example where we just score the same example as before, but we'll use pytest to do this. We first need to create a testing file: test_predict_api.py. For now, we can put this in the same directory as our API file: predict_api.py. Note: pytest is capable of automatically locating testing files and functions, but you need to assist it in doing this. By default, it will inspect any file whose name begins with the "test_" prefix.
# test_predict_api.py
import json
from predict_api import app

def test_single_api_call():
    data = {'petal_length': 5.1, 'petal_width': 2.3,
            'sepal_length': 6.9, 'sepal_width': 3.1}
    expected_response = {
        "label": "virginica",
        "probabilities": {
            "setosa": 0.0,
            "versicolor": 0.2,
            "virginica": 0.8
        },
        "status": "complete"
    }

    with app.test_client() as client:
        # Test client uses "query_string" instead of "params"
        response = client.get('/predict', query_string=data)
        # Check that we got "200 OK" back.
        assert response.status_code == 200
        # response.data returns bytes, convert to a dict.
        assert json.loads(response.data) == expected_response
One important thing to call out is that we're using a test_client to test calls to our API. This is a feature of flask. Here, we create a new instance using a context manager. From this, we can simulate get requests to our API. The query_string keyword argument provides functionality to params for the requests package; it allows us to pass data that is used to create the query string.
From the response, we're checking that we received the "200 OK" status, and finally we check that the payload of response matches what we expected. Since it comes as bytes, we can use json.loads() to convert it into a dict.
We can now execute the tests using pytest at the command line.
$ pytest
============================= test session starts ==============================
platform linux -- Python 3.6.2, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /home/chris/part3_testing, inifile:
collected 1 item

test_predict_api.py .

=========================== 1 passed in 0.82 seconds ===========================
Fail First

Whenever you're writing automated tests, it's important to verify that you're actually testing something--that your tests can fail. While this is an obvious statement, a common pitfall is that tests are written that don't actually test anything. When they pass, the developer assumes that the code that is being tested is correct. However, the tests are passing because the tests were poorly written.
One way of preventing this kind of error is to use test-driven development (TDD). We're not going to go into this in depth, but it's a development process where:

You start by writing a test for a new feature.
You verify that the test fails.
You write code to implement that feature.
You verify that the test now passes.

If you haven't tried TDD before, I definitely recommend it. It takes discipline, especially when getting started, and buy-in from other developers and stakeholders. However, dedication to the process is rewarded with fewer bugs and lower stress when implementing new features.
If you're not up for TDD, the lazy method that I've used to verify that each test is actually testing something is to change my assert expressions to explicitly fail. For example, we'd change assert response.status_code == 200 to assert response.status_code != 200. If you make this change and rerun the tests, you should receive a failure similar to this:
============================ FAILURES ============================
______________________ test_single_api_call ______________________
    def test_single_api_call():
        data = {'petal_length': 5.1, 'petal_width': 2.3,
                'sepal_length': 6.9, 'sepal_width': 3.1}
        expected_response = {
            "label": "virginica",
            "probabilities": {
                "setosa": 0.0,
                "versicolor": 0.2,
                "virginica": 0.8
            },
            "status": "complete"
        }

        with app.test_client() as client:
            # Test client uses "query_string" instead of "params"
            response = client.get('/predict', query_string=data)
            # Check that we got "200 OK" back.
>           assert response.status_code != 200
E           assert 200 != 200
E            +  where 200 = <Response streamed [200 OK]>.status_code

test_predict_api.py:22: AssertionError
If you're going to use this method, be aware that pytest will only report the first AssertionError that occurs. So, you must change each assert separately and retest.
More Testing

We now have a test call for our API, and it's working. How can we extend this to test multiple calls with different values for the features and different expected results (labels and probabilities)? One quick option is to use our test dataset that we created during the model build. However, we need to class probabilities and predicted label to use as the expected result for each input record.
One important thing to note is that we are testing the API platform and not the model itself. Basically, this means that we don't care if the model is making false predictions; we just want to verify that the model output as scored on the API platform matches the model output from the build/offline/development environment. We'll also need to test that the preparation of features (e.g., mean imputation) is done correctly on the API platform. We're going to save that for the next section of this post.
Since our test dataset may change with each new version of our model, we should incorporate the generation of these data into our model build. I did some light refactoring (more is needed) to our model build script and added the dataset code to the bottom. At the top, there's a function called prep_test_cases which just reformats the features and probabilities into a list of dictionaries with the format:
[{
    "features": {
        "sepal_length": 6.1, "sepal_width": 2.8,
        "petal_length": 4.7, "petal_width": 1.2},
    "expected_status_code": 200,
    "expected_response": {
        "label": "versicolor",
        "probabilities": {
            "setosa": 0.0, "versicolor": 1.0, "virginica": 0.0},
        "status": "complete"}
  }, ...
]
Here's the modified version of our model build code that incorporates the test dataset generation:
import json
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib

MODEL_VERSION = '1.0'

def prep_test_cases(all_features, all_probs, feature_names, target_names):
    all_test_cases = []
    for feat_vec, prob_vec in zip(all_features, all_probs):
        feat_dict = dict(zip(feature_names, feat_vec))
        prob_dict = dict(zip(target_names, prob_vec))
        expected_label = target_names[prob_vec.argmax()]
        expected_response = dict(label=expected_label,
                                 probabilities=prob_dict,
                                 status='complete')
        test_case = dict(features=feat_dict,
                         expected_status_code=200,
                         expected_response=expected_response)
        all_test_cases.append(test_case)
    return all_test_cases

def main():
    # Grab the dataset from scikit-learn
    data = datasets.load_iris()
    X = data['data']
    y = data['target']
    target_names = data['target_names']
    feature_names = [f.replace(' (cm)', '').replace(' ', '_')
                     for f in data.feature_names]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                        random_state=42)
    # Build and train the model
    model = RandomForestClassifier(random_state=101)
    model.fit(X_train, y_train)
    print("Score on the training set is: {:2}"
          .format(model.score(X_train, y_train)))
    print("Score on the test set is: {:.2}"
          .format(model.score(X_test, y_test)))

    # Save the model
    model_filename = 'iris-rf-v{}.pkl'.format(MODEL_VERSION)
    print("Saving model to {}...".format(model_filename))
    joblib.dump(model, model_filename)

    # ***** Generate test data *****
    print('Generating test data...')
    all_probs = model.predict_proba(X_test)
    all_test_cases = prep_test_cases(X_test,
                                     all_probs,
                                     feature_names,
                                     target_names)
    test_data_fname = 'testdata_iris_v{}.json'.format(MODEL_VERSION)
    with open(test_data_fname, 'w') as fout:
        json.dump(all_test_cases, fout)

if __name__ == '__main__':
    main()
Now that we've generated our test data, we need to add a new test (or refactor) to score all the records in this file and check the responses:
# test_predict_api.py
import json
from path import Path
from predict_api import app

DATA_DIR = Path(__file__).abspath().dirname()

def test_api():
    dataset_fname = DATA_DIR.joinpath('testdata_iris_v1.0.json')
    # Load all the test cases
    with open(dataset_fname) as f:
        test_data = json.load(f)

    with app.test_client() as client:
        for test_case in test_data:
            features = test_case['features']
            expected_response = test_case['expected_response']
            expected_status_code = test_case['expected_status_code']
            # Test client uses "query_string" instead of "params"
            response = client.get('/predict', query_string=features)
            # Check that we got "200 OK" back.
            assert response.status_code == expected_status_code
            # response.data returns a byte array, convert to a dict.
            assert json.loads(response.data) == expected_response
Because we structured the test data in a clean way where each test case has the features (API inputs) as well as the expected response (API outputs), the test code is very clean. One deficiency with this approach is that class probabilities are floats, and we're doing an exact comparison of these values. Typically, some tolerance is allowed when comparing float values so that values that are very close are considered equivalent. To handle this, we'd need parse the expected response and use pytest.approx() when comparing each value in the probabilities. It doesn't require much more code, but I thought it would clutter this discussion a bit, so I'm leaving out the implementation.
Handling Missing Values

Our API is configured to use mean imputation to replace bad or missing values, but our test dataset doesn't include any records with missing values. However, this is not an issue since we can simulate these data using the data we already have. We simply need to replace the existing values with the mean value for a feature and rescore the records. To our model build script we'll add the following after our original test data generation code:
# ***** Generate test data (Missing) *****
print('Generating test data with missing values...')

# Each group refers to the column indexes with missing features.
# Start with each column by itself, then all pairs, triples...
missing_grps = [(0,), (1,), (2,), (3,),
                (0, 1), (0, 2), (0, 3),
                (1, 2), (1, 3), (2, 3),
                (0, 1, 2), (0, 1, 3),
                (0, 2, 3), (1, 2, 3)]
X_mean = X_train.mean(axis=0).round(1)
all_features = []
all_probs = []
for missing_cols in missing_grps:
    # Cast to "object" type to allow None value (otherwise it's nan).
    X_missing = X_test.copy().astype('object')
    X_scored = X_test.copy()
    for col in missing_cols:
        X_missing[:, col] = None
        X_scored[:, col] = X_mean[col]
    # Use the imputed one to find expected probabilities
    all_probs.extend(model.predict_proba(X_scored))
    all_features.extend(X_missing)

# Add for (0, 1, 2, 3) case. All missing
all_features.extend([[None, None, None, None]])
all_probs.extend(model.predict_proba([X_mean]))
all_test_cases_missing = prep_test_cases(all_features,
                                         all_probs,
                                         feature_names,
                                         target_names)

test_data_fname = 'testdata_iris_missing_v{}.json'.format(MODEL_VERSION)
with open(test_data_fname, 'w') as fout:
    json.dump(all_test_cases_missing, fout)
We likely don't need to be this thorough, but here we're making a copy of the test dataset and for each set of columns that could be missing, replacing the values with the mean value for that column. We score that record to get the prediction probabilities that we'll expect to see returned by the API, as the missing values will be imputed by the mean. The feature vectors are also stored with None for the features that have missing values. To make it easier on the testing side, we'll filter out these features before storing this test case in the JSON file. We can do this by modifying how we create feat_dict in the prep_test_cases function. Here's the revised function:
def prep_test_cases(all_features, all_probs, feature_names, target_names):
    all_test_cases = []
    for feat_vec, prob_vec in zip(all_features, all_probs):
        # Drop features that have value == None
        # Here's the change >>>>>>
        feat_dict = {name: val for name, val
                     in zip(feature_names, feat_vec)
                     if not val is None}
        prob_dict = dict(zip(target_names, prob_vec))
        expected_label = target_names[prob_vec.argmax()]
        expected_response = dict(label=expected_label,
                                 probabilities=prob_dict,
                                 status='complete')
        test_case = dict(features=feat_dict,
                         expected_status_code=200,
                         expected_response=expected_response)
        all_test_cases.append(test_case)
    return all_test_cases
We also need change our tests to use this new file. While we could just copy the last test function test_api() and replace the filename testdata_iris_v1.0.json, that would result duplicated code. Since we need the test function to be exactly the same except for the filename, a better approach is to use pytest's parameterize functionality. We simply add a decorator that allows us to specify arguments for the test function and rerun the test for each of these values. In this case, we'll pass in the filename:
# test_predict_api.py
import json
import pytest
from path import Path
from predict_api import app

# Find the directory where this script is.
# **ASSUMES THAT THE TEST DATASET FILES ARE HERE.
DATA_DIR = Path(__file__).abspath().dirname()

@pytest.mark.parametrize('filename',
                         ['testdata_iris_v1.0.json',
                          'testdata_iris_missing_v1.0.json'])
def test_api_from_file(filename):
    dataset_fname = DATA_DIR.joinpath(filename)
    # Load all the test cases
    with open(dataset_fname) as f:
        test_data = json.load(f)

    with app.test_client() as client:
        for test_case in test_data:
            features = test_case['features']
            expected_response = test_case['expected_response']
            expected_status_code = test_case['expected_status_code']
            # Test client uses "query_string" instead of "params"
            response = client.get('/predict', query_string=features)
            # Check that we got "200 OK" back.
            assert response.status_code == expected_status_code
            # response.data returns a byte array, convert to a dict.
            assert json.loads(response.data) == expected_response
Testing Errors

In the last post on error handling (XXXXLINKXXXX), I mentioned that we could be more selective in which records we were willing to score, but I did not provide an example of this. Here we'll tweak our API to look at a simple example where we'll reject requests that are missing data for petal_width. We'll score all other records, using mean imputation if needed.
As a slight tangent, how did I chose petal_width? Well, if we look at the feature importances (using model.feature_importances_), we see that the fourth feature (petal_width) has a normalized score of 0.51. Since this is the most important feature in our model, it makes the most sense to reject records that are missing this feature rather than just using the mean value.
A simple way to implement this is to remove the default value for petal_width and then handle the case if it's missing. Nearly all of the code remains the same, but I've included it here for context.
@app.route('/predict')
def predict():
    sepal_length = request.args.get('sepal_length', default=5.8, type=float)
    sepal_width = request.args.get('sepal_width', default=3.0, type=float)
    petal_length = request.args.get('petal_length', default=3.9, type=float)
    # CHANGED: Don't impute for petal_width
    petal_width = request.args.get('petal_width', default=None, type=float)

    # CHANGED: If this is missing, return an error
    if petal_width is None:
        # Provide the caller with feedback on why the record is unscorable.
        message = ('Record cannot be scored because petal_width '
                   'is missing or has an unacceptable value.')
        response = jsonify(status='error',
                           error_message=message)
        # Sets the status code to 400
        response.status_code = HTTP_BAD_REQUEST
        return response

    features = [[sepal_length, sepal_width, petal_length, petal_width]]

    # Changed section.
    probabilities = MODEL.predict_proba(features)[0]
    label_index = probabilities.argmax()
    label = MODEL_LABELS[label_index]
    class_probabilities = dict(zip(MODEL_LABELS, probabilities.tolist()))
    return jsonify(status='complete', label=label,
                   probabilities=class_probabilities)
We can do a quick test to make sure it works for a simple case:
>>> import requests                                                                                           
>>> resp = requests.get('http://localhost:5000/predict?sepal_length=5&sepal_width=3.1&petal_length=2.5')        
>>> resp.status_code                                                                                          
400                                                                                                           
>>> print(resp.text)                                                                                          
{                                                                                                             
  "error_message": "Record cannot be scored because petal_width is missing or has an unacceptable value.",    
  "status": "error"                                                                                           
}                                                                                                             
Great! It works! Now, we just need to add this to our test suite.
For simplicity, I'm going to skip showing how to modify our old missing value tests and just implement the new tests that handle missing or bad values for petal_width. Basically, I removed all tuples from missing_grps that had a 3 in them (index of petal_width) and the test where all features are missing.
For our new tests, we could use the same JSON format. This would be a cleaner implementation. For clarity though, I'm just going to implement these tests in a separate function that has two test cases for petal_width: it's missing and it has a bad value ("junk").
@pytest.mark.parametrize('data',
                         [{'petal_length': 5.1,
                           'sepal_length': 6.9,
                           'sepal_width': 3.1},
                          {'petal_length': 5.1,
                           'petal_width': 'junk',
                           'sepal_length': 6.9,
                           'sepal_width': 3.1},])
def test_reject_requests_missing_petal_width(data):
    expected_response = {
        "error_message": (
            "Record cannot be scored because petal_width "
            "is missing or has an unacceptable value."),
        "status": "error"
    }

    with app.test_client() as client:
        # Test client uses "query_string" instead of "params"
        response = client.get('/predict', query_string=data)
        # Check that we got "400 Bad Request" back.
        assert response.status_code == 400
        # response.data returns a byte array, convert to a dict.
        assert json.loads(response.data) == expected_response
We can rerun our tests and verify that these pass. Of course, we should also try to change the == to != to verify that they fail for each condition too. This will help ensure that we're testing what we actually think we're testing.
Identifying Issues

Now that we have our tests, we might wonder if they're actually going to catch bugs in our code. Perhaps, you already found some issues when you were creating these tests and trying them on your API code. If not, here are some simple tests that you can try (do each of these independently):

In the API code, change the default (mean imputation) value for sepal_length from 5.8 to 5.3.
In the API, put back the mean imputation for petal_width. You should see failures in the tests that are expecting the API to return "400 Bad Request" when petal_width is missing.
In the API, change the text of the error message that is sent when petal_width is missing.
We can also mimic the case where someone accidently modifies the model and tries to deploy it. To test this, we can build an alternate version of your model, deploy it, but use the test data files (JSON) for the original model. A quick way to implement this is to use a different value for random_state in the training/test set split (e.g., random_state=30). Remember to change the model output filename in joblib.dump() to something else (e.g., 'iris-rf-altmodel.pkl'); you'll need to change MODEL in the API to reference this file. Also, make sure you don't execute the code that generates the test data files as these will rebuild them based on the alternate model. When you rerun your tests, you will likely see failures in all tests except for the ones that reject requests when petal_width is missing or invalid. If your tests still pass, try another random_state as it's possible that the model will be equivalent because the training set may remain the same or the changes aren't enough to change the model.

Our tests are definitely catching problems, but are we catching all of the problems? The simple answer is that we probably are not catching everything. While creating this post, I tried changing the mean imputation (default value) for sepal_width to 3.1 instead of 3.0. When I reran the tests, they all passed. Perhaps this isn't a big deal; perhaps our model just isn't that sensitive to small shifts in sepal_width around the mean value. This is the feature of lowest importance. However, we used our test set for test cases, and these data points don't necessarily fall near the boundary of different classes. If we had more test cases or just better test cases, we may have been able to catch this type of bug.
GENERALLY IT"s HARD TO TEST COMPLETELY, BUT JUST BECAUSE WE KNOW WE'RE NOT CATCHING EVERYTHING DOESN'T MEAN WE SHOULDN'T WRITE TESTS.... AND WE SHOULD CONTINUE TO EXTEND THEM AS NEW BUGS ARE FOUND AND FIXED.
Wrapping Up

We have seen that automated tests can help us find bugs in our code. While we started with testing a single API call, we were able to quickly move towards a framework for running numerous test cases, and it only required adding a little extra code.
Testing is an important topic, so it's likely we'll revisit this topic in the future. Here's a quick preview of some areas that we didn't cover:

Speed of Tests: It's beneficial to run tests often while you're changing your code. This makes it easier to detect errors early when refactoring existing code or adding new features. If the tests take a while to run, developers are less likely to do this. One approach is to separate tests that run quickly from those that take more time.
Mocking & Patching: We saw that small changes to the default (mean imputation) value for sepal_width didn't cause our tests to fail. If this was a requirement, we could use patching to intercept the call to model.predict_proba() during scoring to verify that the correct values are being substituted.
Fixtures: This is a feature of pytest that you create, configure, and destroy resources that will allow you to establish a clean and consistent environment in which each test can run. If you've familiar with "setup" and "teardown" in many unit testing frameworks, fixtures are an extension of this idea.
Integration of Subsystems: Currently, we just have our model and our API. In subsequent posts, we'll look at adding a database backend and perhaps some other services. How do we test these? How do we test the system as a whole?
Test Coverage: Did we test every line of our code? We could create a test coverage report to help us see which lines of our code were run during testing and which were not. This won't tell us if we've handled all possible cases, but it can give us information into where our test suite is falling short.
Advanced Testing Methodologies: We're unlikely to cover these topics, but I wanted to mention them. With property-based testing (see hypothesis), you create parameterized tests and the framework generates an extensive set of test cases for you. This can lead to more comprehensive tests without requiring you to think up all the edge cases. Mutation testing (see Cosmic Ray) takes a very different approach. It works with your existing test cases and actually modifies your source code (code under test) in some small way (mutation) to see if your existing tests fail. If all tests still pass, your test code is incomplete as it's unable to find the bugs that were introduced by the mutation.

FOOTNOTES

Let's suppose you have several "testdata" files that you want to include, but you don't want to list the files explicitly. Instead, you'd like to use every JSON file that matches a specific naming convention or in a "testdata" directory. One easy way to implement this is to use  glob from Python's standard library.
from glob import glob
@pytest.mark.parametrize('filename',
                         glob(DATA_DIR.joinpath('testdata_*.json')))