Skip to content

Instantly share code, notes, and snippets.

Last active September 22, 2022 08:54
What would you like to do?
Examples of things we can test for autograding and auto-feedback

Test Examples

Trying to identify testable things for use in autograders and automated code feedback tools...

Automated testing of computer code can be used to support learning in several ways.

For example, automated tests may be used to:

  • structure learning activities by providing simple tests that can check code development as it is produced;
  • provide feedback through static and dynamic code analysis;
  • support automated grading for marking large numbers of scripts quickly and exactly.

Automated tests may also be useful in "marker support", helping markers manually mark code based assessments.

This document reviews several Python packages that may be used in support of automated testing for autograding and automatic feedback generation within a Jupyter notebook context.

Testing also complements use of debuggers and debugging tools. Debugging tools and approaches will be reviewed in a separate document, and then a third document will explore workflows that combine test and debug approaches, or at least, how they can be used in tandem to support teaching learning.

To read up on: Evaluation of a tool for Java structural specification checking, A Dil & J Osunde: looks at autograding Java code along various dimensions, including "structural specification testing as compared to other kinds of support, including syntax error assistance, style checking and functionality testing". [A skim of the paper suggests a similar sentiment regarding testing... One thing we need to watch out for is not autograding against some criteria just becuase it's easy to automate those tests...]

Python Errors and Warnings

Automated tests often report back using errors and warning messages, so it's worth starting by seeing how they are handled in a notebook environment.

Python errors and warnings are raised in notebooks as highlighted report messages.


If you have installed and enabled the jupyter_contrib_nbextensions package, the skip-traceback extension will style the error message with a collapsible heading the reports the error message, but hides the traceback detail.

inspect Module

The inspect module is a Python module that "provides several useful functions to help get information about live objects such as modules, classes, methods, functions, tracebacks, frame objects, and code objects".

We can use the inspect module to help us test a variety of code features, such as various properties of a function.

# An example

# A pointless function
# with a comment spread over two lines
def myFunction():
    ''' Here is a docstring,
        split over two lines.
    #This function doesn't do anything of substance
    print('Goodbye' \

We can run a naive test for the dosctring of a function by extracting is as follows:


We can also get the function name as a string:


Alternatively, the inspect.getdoc() method allows us to extract the docstring and clean up any indentation using inspect.cleandoc().

For the function signature (i.e. the args passed into the function), we can use inspect.signature.

Before we start, an existence test could be useful to check than an expected object with a specific name already exists...

except NameError as n:
import inspect

First, let's test that we do indeed have a function...


Now let's check for a docstring:


We can also obtain any comments that are defined immediately before a function definition using the inspect.getcomments() method:


We can obtain the code used to define a function as a single string (inspect.getsource()) or one string per line.


The inspect.getsourcelines() method also returns the line number of the start line of the function relative to the start of the 'file' it is defined in:


Note that the strings returned by the getsource* methods do not include the (redundant) backslash character from line 15.

We can use the inspect.signature() method to inspect the arguments passed into a function and any default values assigned to them:

def myFunctionWithArgs(x, y = 1):
    tmp = 0
    return tmp

The signature is returned as an (ordered) tuple of values:

sig = inspect.signature(myFunctionWithArgs)

#Also view it as a string: str(sig)

We can also probe the signature a little further:


And further still...

str(sig.parameters['y'].kind), sig.parameters['x'].default,
str(sig.parameters['x'].kind), sig.parameters['y'].default,

So if we have asked a student to define a function documented with a docstring, with a given name and some specified function arguments, at least one of which takes a specified default value, we should be able to test and feedback on those elements.

Notes on testing functions

If we ask students to define a function and then, in separate cells, run that function with certain arguments and return the answer as the output to the cell, we can test whether the output values of those cells are as expected.

If we more generally ask a student to define a function with a given name so that we can run one or more tests against that function ourselves, what are we to do with students who have correctly implemented the function in all respects apart from the function name?

A human marker would typically look to award what marks they can, but what of the automated marker. If we don't provide a function template containing the required for testing purposes function name, how might we catch the mis-named function?

One thing we might do is look for mistakes in capitalisation and/or the inclusion/omission of underscores in a function name (depending in part on style guidance).

We can do this by inspecting the names of defined objects in the current namespace.

import types

functions = [f for f in vars().values() if type(f) == types.FunctionType]

We can then inspect these functions as before:


We could then take the function names annd look to see if we get matches.

For example, if we were expecting to see my_function rather than myFunction we could test on that, perhaps around a signature:

desired = 'my_function'
actual = 'myFunction'

assert actual.replace('_','').lower() == desired.replace('_','').lower()

Having identified the actual name used by the student for a function, if it's close enough to the desired name to be identified, we can call the function within out test indirectly through the function's actual name:


Another way of trying to spot whether a misnamed function is the one we want may be to compare its arguments or code features with those of the desired function via various inspect calls.


Linters are tools that can statically analyse code to look for stylistic features within the code.

Linters may be useful for assessing and/or providing feedback on code style, as well as identifying particular issues, errors or bugs within the code.

  • pycodestyle [docs], "formerly called pep8", is a simple Python style checker for checking the style of code contained within one or more python files;

  • flake8) [docs] is a more general tool that "glues together pep8, pyflakes, mccabe, and third-party plugins to check the style and quality of some python code". The jupyterlab-flake8 extension provides support for flake8 within JupyterLab. Additional plugins provide support for test code against PEP8 naming conventions (pep8-naming) and checking for common bugs (flake8-bugbear).

  • pylint [docs] is another Python linter that offers "static code analysis tool which looks for programming errors, helps enforcing a coding standard, sniffs for code smells and offers simple refactoring suggestions". Again, this is developed primarily as a command line tool that expects either a package name of a file path from which it will load the code to be analysed.

pycodestyle and flake8 are both most easily invoked in a notebook context using the pycodestyle_magic extension.

Other tools are capable of repairing non-style guide conforming code. For example, jupyter-autopep8 [docs] can be used to reformat/prettify code in a notebook code cell so that it conforms to PEP 8 style guide using the autopep8 Python module. isort (also available as a flake8 plugin, flake8-isort) will tidy and sort package import statements.

Several other Jupyter extensions also exist that can be used to automtically style code within a code cell:

  • the code_prettify/code_prettify [docs] provides a toolbar button that runs Google's yapf code formatter to style individual code cells or all code cells in a notebook;
  • the code_prettify/autopep8 extension [docs] can be used to restyle code within a single code cell or across all code cells in a notebook;
  • there are several extensions available for both Jupyter notebooks and JupyterLab that will automatically style code using the Black code formatter; for example:
    • blackbook and joli: run black formatter over a set of notebooks;
    • nb_black: automatically restyle code whenever a code cell is run; (this appears to use an IPython post_run_cell event handler);
    • jupyter-black: toolbar buttons and keyboard shortcuts for formatting a selected code cell or all code cells using black;
    • jupyterlab_code_formatter: a universal code formatter for JupyterLab (supports Black, YAPF, Autopep8, isort); this extension also supports R code styling with styler and formatR;
    • jupyterlab_black: JupyterLab extension to apply Black formatter to code within codecell;
    • blackcellmagic: IPythom magic (%%black) to format python code in cell using black;
    • jupyterlab_formatblack: uses blackcellmagic to provide a "hacky" JupyterLab cell formatter that adds a "Format cell with Black" command to JupyterLab.

Further extensions provide support for additional diagnostics:

  • execute_time/ExecuteTime [docs] will report the last time at which a cell was executed, along with how long it took to execute, as code cell output;
  • the Variable Inspector extension [docs] can display a pop-up window detailing the currently set variables and the values assigned to them.

For security related linting, see the bandit linter.

In addition to linters, there are also several packages for analysing code complexity, including mccabe and radon.

For plagiarism detection, and comparing files in order to find similar code, see symilar.

Although linters may be used for feedback and/or assessment, we need to be clear about the context in which students may produce code. Whilst testing that students hand-craft PEP8 compliant code may provide for an easy automated assessment and feedback opportunity, we should also ask ourselves whether students would be better served by encouraging them to use environments that nudge them towards producing PEP8 compliant code anyway. This may take the form of providing students with tools in any supplied programming environment that:

  • automatically and silently restyle code;
  • display warnings about incorrectly styled code but make no attempt to improve it;
  • display warnings about incorrectly styled code and then allow students to automatically fix it with a provided tool;
  • display warnings about incorrectly styled code, allow students to automatically fix it and provide a "track changes" / diff view over the original code and the automatically restyled code;

There are several reasons for taking the latter reporting approach:

  • it provides a way of helping developing professional skills and practice in the use of professional coding support tools and seeing what they can detect;
  • providing linters that run over code automatically provides students with a form of continously provided, automoated self-test and feedback whenever they run code fragments.


pycodestyle_magic is an IPython magic function for pycodestyle and the flake8 module.

Using the block magic allows you to run the linter over the contents of the code cell, generating temporary files that can be passed to the linters, and then creating code cell output reports wherever there is a divergence from the style guide.

The magic does not raise an exception, which could otherwise be detected, but the magic code is quite simple and should be easily modified to provided alternative ways of reporting errors for handling by nbgrader, for example.

#!pip3 install flake8 pycodestyle pycodestyle_magic pep8-naming flake8-bugbear
%load_ext pycodestyle_magic
print( 'hello' )
# A pointless function
# with a comment spread over two lines
def myFunction():
    ''' Here is a docstring,
        split over two lines.
    #This function doesn't do anything of substance
    print('Goodbye' \
# A pointless function
# with a comment spread over two lines
def myFunction():
    ''' Here is a docstring,
        split over two lines.
    #This function doesn't do anything of substance
    print('Goodbye' \
# A pointless function
# with a comment spread over two lines
def my_BrokenFunction():
    ''' Here is a broken function.
    # a is undefined...
    # though it could be a global so we won't know till runtime...
    b = a
    #The following statemet is broken
    print( 'goodbye' - )

(Looks like we're missing a line number at the end there?)

For a list of flake8 warning codes, see here:

PyFlakes warning codes can be seen here:

If the pep8-naming package (a plugin for flake8) is installed, flake8 checks will additionally test code against PEP8 naming conventions.

By inspection of the magic for the flake8 library, we can start to pull out reports for each line of code:

def lineNumprint(txt, start=1, sep=':\t'):
    ''' Display line number followed by
        corresponding line of code. '''
    txt = txt.split('\n')
    out = []
    for i in range(start,len(txt)+start):
        out.append('{ln}{sep}{txt}'.format(ln=i, sep=sep,
    return '\n'.join(out)
#Based on the pycodestyle_magic code
import tempfile
from flake8.api import legacy as flake8_module
import io
from contextlib import redirect_stdout

#Let's test the myFunction code...
cell = inspect.getsource(myFunction)

#Won't this start to clutter things with undeleted temporary files?
with tempfile.NamedTemporaryFile(mode='r+', delete=False) as f:
    # save to file
    # make sure it's written

flake = flake8_module.get_style_guide(extend_ignore=['W292',
with io.StringIO() as buf, redirect_stdout(buf):
    _ = flake.check_files([])
    for line in buf.getvalue().splitlines():
            # on windows drive path also contains :
            temp_file, line, col, error = line.split(':')[-4:]
            zz.append('{}:{}:{}'.format(int(line), col, error))


Note that the backslash warning has dropped out... Should we escape things on the way in?

nose tests

nose is a Python testing framework containing a wide range of tools that support test creation.

A convenient list of assertion tests can be found here:

from import assert_equal, assert_is_instance, assert_is_none, \

assert_equal allows you to test a wide range of datatypes for equality. For example:

assert_equal(1,1), assert_equal("a",'a'), assert_equal( (1,'a'), (1, 'a')), \
assert_equal([1,2],[1,2]), assert_equal({1,2},{2,1}),

assert_is_instance allows you to check that an object is of a specific type:

assert_is_instance([1,2], list), assert_is_instance({1,2}, set), \
assert_is_instance({'a':1}, dict), \
assert_is_instance(1, int), assert_is_instance(1.0, float),

assert_is_none will test if a variable os set to None.


Another useful null tester is the pandas isnull function:

from numpy import nan, NaN
from pandas import isnull, NaT
from import assert_true

#Hacky... should really raise a "is not null" error...
def assert_is_null(value):
isnull(None), isnull(nan), isnull(NaN), isnull(None), isnull(NaT), \
assert_is_null(None), assert_is_null(nan), assert_is_null(NaN), \
assert_is_null(None), assert_is_null(NaT)

When testing numerical values, it can be useful to test if two numbers are "nearly" the same, to a particular degree of precision (eg to N decimal places):

assert_almost_equal(1.20,1.21,1), assert_almost_equal(1.19,1.21,1), \
assert_almost_equal(0.96,0.94,1), assert_almost_equal(-0.02,0.029,1), \
assert_almost_equal(10.1,1e1,-1), assert_almost_equal(1010,1e3,-2)

The assert_almost_equal exception will tell you how far you are out if an exception is raised:


mypy Static Type Checker

mypy is a static type check for Python code that can be used to help identify some common code errors.

#!pip3 install mypy
# Example magic (needs installer, run locally) for now:
def broken_types(x: int, y: str):
    ''' This function is annotated with an incorrect static type argument.
    return x + y

We can call the mypy analyser as a module on code retrieved using the inspect utility:

import mypy['-c', inspect.getsource(broken_types)])

It would be useful if we could use mypy for static typing of functions created by students that do not explicitly type function arguments but that we annotate with desired types.

For example, if we could pass some sort of arg_types definition:['-c', inspect.getsource(broken_types)],
             arg_types({'a': int, 'b': str}) )

then we might be able to modify the function definition using a string replace in the function argument definition.

Alternatively, we might not trust students to correctly name arguments, but we might expect them to present them in a particular order, in which case rather than pass the desired types of named arguments via a dict, we might pass them as a list: [int, str]

It might also be useful if we could dynamically rewrite the function signature using a decorator if we wanted to analyse the code using dynamic typing:

@patched_vars({'a': int, 'b': str})
def example(a, b):
    return a + b

pandas tests

The pandas package contains several utilities for testing one or more null values as well as more general tests on pandas Series and DataFrames.

For example, the isnull() function will test one or more null values:

import pandas as pd

pd.isnull(None), pd.isnull([None,True,2])

As for Series and DataFrames for can test for things like empty dataframes:

pd.Series().empty, pd.DataFrame().empty, pd.DataFrame(columns=['x','y']).empty

The pandas equals() method lets us test the equality of two Dataframes:

pd.DataFrame({'a':[1,2], 'b':[1,2]}).equals( pd.DataFrame({'a':[1,2], 'b':[1,2]}) )

We can also use an undocumented pandas.testing module to run assertion tests for equivalence of dataframes:

import pandas.testing as pdtest

pdtest.assert_frame_equal(pd.DataFrame({'a':[1,2], 'b':[1,2]}), 
                          pd.DataFrame({'a':[1,2], 'b':[1,2]}))

gprof2dot "converts profiling output to a dot graph"

#!pip3 install gprof2dot graphviz gprof2dot_magic
#Note reqiured?
#%load_ext gprof2dot_magic
%gprof2dot print('hello world')

Verbose reporting

Switch modes for the exception handlers.

Valid modes: Plain, Context, Verbose, and Minimal.

If called without arguments, acts as a toggle.

%xmode Verbose

The pandas.testing module also supports assert_index_equal and assert_series_equal tests.

Testing charts

As well as running static and dynamic tests over code, we can also run tests and linting style tests over charts.

This may be useful when trying to provide feedback on charts created by students, as well as the possibly more brittle approach of trying to write tests against them.


The plotchecker package [docs] was created by Jess Hamrick to support the autograding of matplotlib chart objects using nbgrader.

The package defines a wide range of methods that can be used to inspect / extract the vlues of particular chart properties and run assertion tests against them.

As well as generic chart properties such as chart title, axis tick labels and locations, axis limits, etc), tools also exist for checking properties of line charts, scatterplots and bar charts.


The vislint_mpl package as described in Linting for Visualization: Towards a Practical Automated Visualization Guidance System, provides "a prototype visualization linter for matplotlib".

The linter has a dependency on the tesseract OCR package, which may complicate installation...


Berkeley seems to be a hive of activity for autograding solutions, althogh I'm not sure how the various pieces all fit together, whether they compete, or whether certain solutions have deprecated others. Gofer-Grader is small library for autograding Jupyter notebooks & python files. docs write grading tests in multiple formats Run grading tests (interactively or in batch) against a Jupyter Notebook or Python file. Provide multiple strategies for determining overall grade from pass / fail results of single tests. "an effort to help autograding with Berkeley's offering of Data 8 online, Gofer also works with two other components that could be useful for other courses. courses. The primary one, Gofer service, is a tornado service that receives notebook submissions and runs/grades them in docker containers. The second piece, Gofer submit is a Jupyter notebook extension that submits the current notebook to the service. Though they could be modified to work on your own setup, these are meant to play particularly nicely with Jupyterhub." Example of tests in ok format - Each test runs in turn and if one fails, the feedback / hint associated with that is displayed, as well as the error. (If a test passes, it could be useful to also (optionally?) feedback on that - 'well done, your code correctly did the thing the test was testing' etc) DSEP (Data Science Education Program) Infrastructure extension for assignment navigation and fetching assignments. (Based off nbgrader [], for integration with okPy and (old) Hard to know what was then and what is now, also how things work (or don't work) with each other. using ci-travis

A Custom Jupyter Widget Library For Providing Flexible Grading of Nbgrader Based Jupter Notebooks ?? looks like it might let you do things like manually(?) mark all answers in turn for a particular student, or each question in turn across all students. No docs / example of how to use it / what intended workflow is

okpy Automate Grading & Personalize Feedback. hosted service for free if you teach a computer science or data science course at an accredited school. Open Source The source code for OK is available on GitHub. You can run an instance of OK on your own servers. repo supports programming projects by running tests, tracking progress, and assisting in debugging. docs A lightweight feedback extension for Jupyter [no real docs; works inside notebook; need a OUseful review? Compare with othr tools that compare state of output etc - NII tool, Microsoft tool?] Ordo allows users to add feedback messages in a cell's metadata. The feedback is appended to cell's output as a success or failure message based on the result the cell produces. ​ a very lightweight python framework for creating auto-evaluated exercises inside a jupyter (python) notebook. Given a text that describes the expectations, students are invited to write their own code, and can then see the outcome on teacher-defined data samples, compared with the results obtained through a teacher-provided solution, with a visual feedback. Format and tools for authoring and distributing Jupyter notebook assignments - The notebook format is not specific to a programming language or autograding framework, but was designed to be used with okpy, which is Python based. Contributions to support other testing frameworks, such as [nbgrader, and other programming languages are welcome. "This format is designed for easy assignment authoring. A notebook in this format can be converted automatically to the OK format before it is distributed to students." ??format doesnlt seem to have feedback, just the test? ???but can it also work with the nbgrader autograder? Can nbautograder handle okpy?

useful from R? from 2016 and seems to have stalled since then "an automatic short answer grading system written in Python. Given a question, a correct reference answer and a student response, it computes a real-valued score for the student response based on its semantic similarity with the correct answer"

Exercises - ordo? Jupyter exercise extension; A simple extension to hide solution cells in Jupyter Lab. Meant for teachers and students. c.JupyterLabRmotrSolutions.is_enabled = True # True, False c.JupyterLabRmotrSolutions.role = 'teacher' # 'teacher', 'student' Teacher mode lets you set a code or markdown cell as an exercise answer (can you highlight multiple cells?); student mode display and answer button that reveals answers.



Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment