mdouze/LEAR coding guidelines.md Secret

## LEAR coding guidelines.md

      
    Raw
  

              LEAR coding guidelines.md
            
          
    LEAR coding guidelines

Rationale

Like most people hired at LEAR, you probably have a long experience of coding, and alreay well-established coding habits. You typically have an idea of what language you will code and with what libraries. You know common coding practices well.
However, it turns out that code, especially when produced by brillant coders, has annoying shortcommings. Therefore, please consider the points in this page to avoid common mistakes.
Code at LEAR is always linked to a paper, referred to as "the paper" in the following.
You should not need to talk about or show code to your advisors. Advisors think about the paper, which does not contain code, and scientific contributions. Only people who "have bugs" talk about code.
Objectives

What is expected from LEAR research code, ranked by importance.
Code should work

It should give good results in terms of precision, speed,... (whatever you claim in the paper).
Random segfaults are not acceptable.
Code should be flexible

Refactoring should be easy. For example, you should be ready to replace parts of your code or extract parts from it to be used elsewhere.
The paper should be reproducible

Always assume that your advisor will ask you to re-run the experiments.
You should know, and preferably state in LaTeX comments in the paper, what to run to produce each number and figure in the paper.
Seed random generators in a reproducible way.
Code should be transferrable

Even if you have been developing your stuff alone for 1.5 years, you should assume that your code will be transferred: if the paper is successful, people will want to re-use your code.
There have been several instances of PhDs leaving with code that was too complicated for followers to re-use.
Recommendations

Baseline

Please start from the good coding habits you already have (or take a look at [1,2]):


use a versioning system (svn or git). Creating a git repository in a directory costs nothing.


indent your code and be consistent in naming.


do not optimize code unless needed.


But relax, there are typical software practices that are not so important:


portability: all machines run 64 bit linux. You can assume this will last.


uniform coding style: indent 3 or 8 spaces, nobody will care.


documentation: there should not be much more to document than what is written in the paper.


helpful error messages: you can assume that the one running your code is a developer, so assertions are ok.


Languages

Depending on the project, you may or may not be allowed to choose your programming language.
If you choose a non-standard language, this will place a burden on followers, so there should be a good reason for this.
Numerical languages

The main numerical languages used at Lear are Matlab/Octave and Python/numpy. Python is a much richer language and does not have licensing problems, but Matlab is simpler.
Low-level languages

The main low-level language is C. It is interfaced with mex for Matlab, and cython or SWIG for Python, or simply called as a subprocess.
Write code not headers

Some languages (read C++) require or encourage to write a lot of code that does not actually translate to any machine instruction. When you end up writing a lot of get/set, public/private, virtual, namespace, or even comments, this should ring a bell.
Write simple code

It turns out that it is difficult to write simple code, because it is difficult/subjective to define what simple is. Here are a few tentative guidelines.
Do not write smart code

In the time you saved, think of smart research ideas.
Write shallow code

Deep call stacks are hard to follow, especially if functions are scattered over several files in different directories,.
The guy re-using your your code has the paper in his hands, and wants to see where equation (5) is applied to data X, not follow a 3-level call stack.
Use simple languages (and features)

C is simpler than C++, Matlab is very simple, but it is possible to write complicated code in any language. Languages often have shiny "advanced" features. Here is a table with a few examples:


Language
Complicated Features (non-exhaustive!)


C++
templates, operator overloading, boost, C++11


Python
operator overloading, dynamic addition of methods to instances


Matlab
manipulation of caller's symbol table


If you think you need advanced features, please think again.
If you still think so, please choose the smallest subset that you can live with.
Although portability is not a major concern, it is a good test for code simplicity. Does your Matlab code work with Octave? Does your Python code run on Python 2.6? Does your C++ code compile on gcc 3.x?
Avoid genericity

Do you need grayscale images with pixels other than 32-bit float? (or images in more than 2 dimensions!) Or matrices with elements other than double? Are you ever going to use something else than L2 normalization?
Genericity comes at a cost in terms of lost focus and code bloat, so use only if you really think that it will be useful. And remove if it turns out that it was not necessary.
Avoid libraries

If reinventing the wheel takes 10 lines of code, please do so. Dependencies always incur more work to understand. You can copy code from libraries if relevant (and allowed by the license).
Corollary: avoid layers. For many useful libraries or programs there are wrappers to make them "cleaner", "easier to use" (eg. scikitlearn for libsvm, C++ mex interface above mex, Boost's interface above BLAS, Python's threading module above theads). Please evaluate whether the wrapper adds something significant to the original library, or if you could write a more focused wrapper yourself?
The F-word

Building "frameworks", "toolboxes" or "pipelines" is engineering work. LEAR does not sell this kind of things.
Burry dead code

Unused code is harmful.
It is unlikely that more than 10 % of the code you write will be in the execution path to a result in the paper. During development you will test many variants, most of which fail or are not optimal. If these variants are not worthwile to put in the paper, they are not worthwile to keep in the code. Remove them.
"I keep it just in case" is not an option: use a source code versioning system to keep track of them (svn or git). Do not "comment out" code.
Old = trusted

At LEAR, people are doing science, not technique. There is rarely a reason why bleeding-edge research must be built on new techniques.
Old libraries have undergone Darwinian selection. If they survived, they are probably worth something.
Corollary: new = untrusted. New libraries or techniques should be used very cautiously. We are not the beta-testers of the latest machine learning package.
References


The Linux kernel coding style is full of useful remarks, see eg. Section 8 about comments.


Google C++ Style Guide coding style with arguments (favorite: do not use iostream/fstream).
Language	Complicated Features (non-exhaustive!)
C++	templates, operator overloading, boost, C++11
Python	operator overloading, dynamic addition of methods to instances
Matlab	manipulation of caller's symbol table