Skip to content

Instantly share code, notes, and snippets.

@bbdaniels
Last active February 22, 2021 16:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bbdaniels/246867d78f07db5b2baecd0d8a22ef1a to your computer and use it in GitHub Desktop.
Save bbdaniels/246867d78f07db5b2baecd0d8a22ef1a to your computer and use it in GitHub Desktop.
Asking for help with statistical programming

Asking for help with errors in statistical programming

No matter how experienced you are in statistical programming, you are certain to encounter errors in your code from time to time. You will recognize some of them and be able to solve them quickly, but you will never stop encountering unexpected new errors that you need help to resolve. Asking for help with statistical programming is hard. You need to give enough information to the person helping you so that they can produce the exact same problem you have without getting overwhelmed by extra details. These "extra details", unfortunately, are the rest of the code and data you are working with – and if you knew exactly which parts were causing the problem, you might have been able to solve the problem yourself.

This guide is designed to teach you two processes that will help you narrow down the cause of the error in a disciplined way. First, it will describe a process we will call "fault isolation". This process will, in fact, often allow you to solve problems yourself by systematically removing potential causes of the error. Second, if fault isolation does not solve the problem for you, this guide will describe how to prepare an "error report" and a "minimal reproducible example" (or "reprex") that you can share with others for help. Unless you have a computer science background, it is likely that nobody ever showed these processes formally.

Whenever your code stops running with an error, perform your "due diligence" before asking someone else for help. You should always begin by explaining your problem to the rubber duck you keep on your desk. If that is not enough, do the following. First, read the error message before you begin these processes. Many error messages are helpful and may include the solution. Good programmers spend lots of time writing informative messages that check for common mistakes or misconceptions about their code and guide the user to the correct implementation. Next, check the documentation for any clues about the specific error message you are encountering. Finally, search the internet for solutions: often the combination of "your error message" + "programming language name" will return specialized results from programming forums with clues to what is going wrong.

If these steps don't work, try running the code with a different dataset. This will help determine whether the source of the error is the code or the data, and seeing whether the nature of the error changes (or doesn't) with different data will also help you diagnose it. If you still cannot figure out how to get your code to run, start to isolate the fault. Then, if you can isolate the fault but still cannot find a solution, prepare an error report and a reprex to send to your support person. The rest of this guide will lead you through that process.

Definitions:

  • A fault is a catchall term for something that is going wrong in code.
  • A break is the point where code stops executing instead of finishing execution.
  • An error is a code instruction to break the code and inform the user something is wrong.
  • A bug is a catchall term for unexpected behavior:
    • Breaking bugs are unknown causes for incomplete code execution.
    • Non-breaking bugs are unknown causes for code that runs, but produces results that are not desired.
    • Silent bugs are unknown causes for code that runs, and unknown to the user are producing incorrect results.

Fault isolation in statistical programming

Statistical programming is different from regular programming because it involves the interaction of code and data. Trying a new command on data you have had for a long time can lead you to encounter an error. Similarly, code that has always worked in the past may break when it is used on new data. The second type of error, in particular, may be "deep" in the code in the sense that this type or pattern of data was never predicted by the code author, and tend to pose the hardest challenges. In statistical programming, abstract data features such as zero variation in a subgroup, missing values, non-English or non-ASCII character types, data size or length, data storage types, and rounding precision can all cause code to fail in unexpected ways (to name just a few). Finding these requires both conceptual creativity as well as coding skill.

Each programming language or environment will have a suite of tools that correspond to its use. The DIME Wiki has a guide to debugging tools in Stata. Most debugging tools involve features that you don't usually want to invoke during ordinary execution, such as printing lots of output to a console, revealing subroutine calls and macro evaluations, or pausing and resuming execution interactively. Take some time to familiarize yourself with these tools.

In statistical programming, you should first try running your code on a different dataset with the same structure. Often this is enough to determine whether the problem is arising from the code or the data. (If the error happens anyway, it's probably the code; if it doesn't, it's probably the data.) From there, fault isolation follows the following three steps in order:

  1. Systematically remove variables or inputs from the command returning the error. Stop if a single variable or input always causes the command to break, and investigate further. Check all the characteristics of the input, including its meta-information such as naming, labelling, levels, value labels, and missing values. In Stata, commands like codebook are key here. Use simple tests to discern if particular observations cause failure. Test that a randomly generated input or variable with similar characteristics does not also cause failure.

  2. Systematically reduce code to isolate the error. Usually, this will mean creating another file and reducing the code as close as possible to simply loading the data and executing the failing command. You should be able to isolate a single command that fails, or whose removal from the code allows the code to succeed, which isolates the problem to a small portion of the code for closer analysis, in particular using a trace command or similar to find which subcommand is causing the failure.

  3. Attempt to produce and then break a working example. Some problems can be "reverse-engineered". Create a minimal code file that works with your data, or create minimal data that will execute correctly with the code you already have. Using this, try to determine how the failing code and data differ. This process is helpful because, with intentionally-created data or code, you should be able to predict the expected result at each step and catch more subtle failures than in the actual data you are using.

If, after attempting fault isolation, you still get stuck, you will have greatly improved your ability to describe the problem. In particular, you might already have some examples of either code or data in which your problem does not occur using a setup that seems like it should be identical. When you get to this point, you should ask for help by creating two things: an error report and a minimal reproducible example (sometimes called a "reprex"). These two things combined should give another person as close to a complete understanding of the error and your attempted solutions, as well as the exact materials needed to get the same problem to happen on their machine. They can then help you with their knowledge and expertise.

Preparing an error report and minimal reproducible example

Report the error precisely. For example, do not write "my regression errored out". When asked, "when did it break", the answer is not "when I ran it"; and when asked "what did it do", the answer is not "it gave me an error". A good intial report answers these questions (what, where, when) precisely. For example, a report might read, in part, "the xtnbreg command on Line 89 of analysis.do quit after Iteration 12 with error code 430 and the message: cannot compute an improvement -- discontinuous region encountered". Note anything else that appeared unusual to you at the time (if any).

Corollary: Locate the error precisely. In many statistical programming languages, functions or commands are not self-contained. They usually call subroutines, which are other functions or commands that already exist. Using the debugging tools to find out which subroutine or subcommand actually produced the error will be very helpful to someone who knows the general types of errors various commands can encounter. Additionally, if you are encountering a nonspecific "syntax error", it is especially important to trace the root source of the error, since this type of error often means there is a problem arising from the way a command or subcommand is literally written into the program (such as changes in structure between program or package versions).

Keep only the variables needed to cause the problem. This will usually leave you with a very small dataset, as it is rare (but not impossible!) that it is a combination of two variables that is causing your problem. During the fault isolation step, you should have taken a process such as systematically removing each variable from processes like regression until the code worked -- allowing you to pinpoint exactly the minimum set of information needed to reproduce the error. Your reprex should only have these variables.

Keep only the code needed to reproduce the problem. The person helping you should not have to run lots of unrelated code in order to get to the error you are encountering. In particular, the code you provide should quickly load the data (given that the user sets their directory path appropriately on the first line) and it should run through for the user without the need to, say, set up a directory structure to hold intermediate and output files. Test this yourself in a blank working space before sending.

If applicable, provide a working version. If you were able to get the code to run during the fault isolation exercise, provide the closest version of the code and data that successfully ran as part of the reprex and note the differences. The person reading them will find this useful for hypothesizing what is broken.

Make sure you can provide a de-identified dataset! Usually this is not hard -- it is unlikely that you are doing analysis on an identifying variable, such as names or geolocations, so simply restricting the reprex to the data points involved will usually satisfy this requirement. Think carefully about it, however, before sending data over the internet or to people not authorized to view confidential data.

A simple example of an error report and reprex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment