Like most people hired at LEAR, you probably have a long experience of coding, and alreay well-established coding habits. You typically have an idea of what language you will code and with what libraries. You know common coding practices well. However, it turns out that code, especially when produced by brillant coders, has annoying shortcommings. Therefore, please consider the points in this page to avoid common mistakes.
Code at LEAR is always linked to a paper, referred to as "the paper" in the following.
You should not need to talk about or show code to your advisors. Advisors think about the paper, which does not contain code, and scientific contributions. Only people who "have bugs" talk about code.
What is expected from LEAR research code, ranked by importance.
It should give good results in terms of precision, speed,... (whatever you claim in the paper). Random segfaults are not acceptable.
Refactoring should be easy. For example, you should be ready to replace parts of your code or extract parts from it to be used elsewhere.
Always assume that your advisor will ask you to re-run the experiments. You should know, and preferably state in LaTeX comments in the paper, what to run to produce each number and figure in the paper.
Seed random generators in a reproducible way.
Even if you have been developing your stuff alone for 1.5 years, you should assume that your code will be transferred: if the paper is successful, people will want to re-use your code. There have been several instances of PhDs leaving with code that was too complicated for followers to re-use.
Please start from the good coding habits you already have (or take a look at [1,2]):
-
use a versioning system (svn or git). Creating a git repository in a directory costs nothing.
-
indent your code and be consistent in naming.
-
do not optimize code unless needed.
But relax, there are typical software practices that are not so important:
-
portability: all machines run 64 bit linux. You can assume this will last.
-
uniform coding style: indent 3 or 8 spaces, nobody will care.
-
documentation: there should not be much more to document than what is written in the paper.
-
helpful error messages: you can assume that the one running your code is a developer, so assertions are ok.
Depending on the project, you may or may not be allowed to choose your programming language. If you choose a non-standard language, this will place a burden on followers, so there should be a good reason for this.
The main numerical languages used at Lear are Matlab/Octave and Python/numpy. Python is a much richer language and does not have licensing problems, but Matlab is simpler.
The main low-level language is C. It is interfaced with mex for Matlab, and cython or SWIG for Python, or simply called as a subprocess.
Some languages (read C++) require or encourage to write a lot of code that does not actually translate to any machine instruction. When you end up writing a lot of get/set, public/private, virtual, namespace, or even comments, this should ring a bell.
It turns out that it is difficult to write simple code, because it is difficult/subjective to define what simple is. Here are a few tentative guidelines.
In the time you saved, think of smart research ideas.
Deep call stacks are hard to follow, especially if functions are scattered over several files in different directories,. The guy re-using your your code has the paper in his hands, and wants to see where equation (5) is applied to data X, not follow a 3-level call stack.
C is simpler than C++, Matlab is very simple, but it is possible to write complicated code in any language. Languages often have shiny "advanced" features. Here is a table with a few examples:
Language | Complicated Features (non-exhaustive!) |
---|---|
C++ | templates, operator overloading, boost, C++11 |
Python | operator overloading, dynamic addition of methods to instances |
Matlab | manipulation of caller's symbol table |
If you think you need advanced features, please think again.
If you still think so, please choose the smallest subset that you can live with.
Although portability is not a major concern, it is a good test for code simplicity. Does your Matlab code work with Octave? Does your Python code run on Python 2.6? Does your C++ code compile on gcc 3.x?
Do you need grayscale images with pixels other than 32-bit float? (or images in more than 2 dimensions!) Or matrices with elements other than double? Are you ever going to use something else than L2 normalization? Genericity comes at a cost in terms of lost focus and code bloat, so use only if you really think that it will be useful. And remove if it turns out that it was not necessary.
If reinventing the wheel takes 10 lines of code, please do so. Dependencies always incur more work to understand. You can copy code from libraries if relevant (and allowed by the license). Corollary: avoid layers. For many useful libraries or programs there are wrappers to make them "cleaner", "easier to use" (eg. scikitlearn for libsvm, C++ mex interface above mex, Boost's interface above BLAS, Python's threading module above theads). Please evaluate whether the wrapper adds something significant to the original library, or if you could write a more focused wrapper yourself?
Building "frameworks", "toolboxes" or "pipelines" is engineering work. LEAR does not sell this kind of things.
Unused code is harmful. It is unlikely that more than 10 % of the code you write will be in the execution path to a result in the paper. During development you will test many variants, most of which fail or are not optimal. If these variants are not worthwile to put in the paper, they are not worthwile to keep in the code. Remove them.
"I keep it just in case" is not an option: use a source code versioning system to keep track of them (svn or git). Do not "comment out" code.
At LEAR, people are doing science, not technique. There is rarely a reason why bleeding-edge research must be built on new techniques. Old libraries have undergone Darwinian selection. If they survived, they are probably worth something.
Corollary: new = untrusted. New libraries or techniques should be used very cautiously. We are not the beta-testers of the latest machine learning package.
References
-
The Linux kernel coding style is full of useful remarks, see eg. Section 8 about comments.
-
Google C++ Style Guide coding style with arguments (favorite: do not use iostream/fstream).