kelly-sovacool/URSSI_winterschool_notes.md

## URSSI_winterschool_notes.md

      
    Raw
  

              URSSI_winterschool_notes.md
            
          
    Notes from the URSSI Winter School

Update: these notes are now here: https://sovacool.dev/posts/2019/12/urssi-winterschool-notes
Slides & other resources: https://github.com/si2-urssi/winterschool
Contents:


Day 1

Intro to Software Design (Jeff Carver)
Think like a programmer (Andy Loftus)
Intro to design patterns (Jeff Carver)
Basics of packaging Python programs (Kyle Niemeyer)


Day 2

Collaboration with Git & GitHub (Karthik Ram)
Git Exercises (James Howison)
Software testing & continuous integration (Kyle Niemeyer)
Git Exercises ctd (James Howison)


Day 3

Code Review (Jeffrey Carver)
Open Science & Software Citation (Kyle Niemeyer)
Reproducibility
Documentation (Kyle Niemeyer)


Day 1

Intro to Software Design (Jeff Carver)


Whether you know it or not, you’re doing software design. Make those decisions with intent & purpose.
Characteristics of good design

Firmness: hard to write bugs accidentally
Suitable for intended purpose
Interesting & useful to users


Principles of design:

Traceability - easy to understand what the software is supposed to do.
Minimize intellectual distance - as close to the real-world as possible
Don’t reinvent the wheel. Re-use good design if it’s already a solved problem.
Accommodate change.
Fail gracefully.


Think like a programmer (Andy Loftus)


Solve easy problems; defer hard ones until they are easy.

Zen of Python excerpt: “If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea."


Think about code before you write it

Identify use cases
Define goals from use cases
Split into small, easy pieces
Define one piece at a time


Thinking about use-cases before the goal helps you focus on the small, easy-to-solve pieces (exact problem at hand, limit the scope of the problem) without getting bogged down in any grandiose, big-picture ideas.
Encapsulation

Isolate unrelated concerns.
Hide changing things.
Python details:

Use the @property & @var.setter decorators for getters & setters.
@classmethod decorator for different constructors & other methods that work on the class but not instances of the class.


Environment variables

collections.ChainMap: use it to prioritize program options.

os.environ to access shell environment variables.
Defaults = some dict
combined = ChainMap(cmdline_args, os.environ, defaults)

Equivalent of stringing together dictionary.update but in reverse


Structuring code for readability: Trey Hunner blog post: craft your python like poetry.
Low barrier to entry. Make your code usable & accessible to lots of people.

Make a runnable sample

Keep it short; one command if possible.

slick example: curl URL/quickstart.sh | bash


Clean up after running
Run it multiple times in a row & it does the exact same thing every time


Intro to design patterns (Jeff Carver)


Chain of responsibility

Common interface to handle requests, but user doesn’t need to know which specific method handles the request.


Creational pattern: Builder

Create various representations of the same object. Abstract construction steps with different implementations of methods for different object variants.


Structural pattern: Proxy

Only load something when you actually need it if it takes a long time to load or is expensive to create. e.g. when loading webpage, it’ll display the text before images have finished loading, with blank placeholder where image will load.


More resources

“Gang of four” original book on design patterns
toptal.com python design patterns book


Basics of packaging Python programs (Kyle Niemeyer)


Module: any python file that contains definitions & statements.
Package: a collection of modules in the same directory.

Must contain the init.py file. (Except for namespace packages...)

Often this file is empty.
Python executes this file before anything else when imported.


Can contain subdirectories with “submodules” containing more Python files and another init.py file.
Tests subdirectory for test files (more on pytest later).


Lots of different ways to import modules.

Kyle’s preferred way: explicit relative imports

Uses dot notation (. for current path, .. for one level up)


DON’T REINVENT THE WHEEL

Rely on the standard library, numpy, scipy, etc.


main() & main (Bryan Weber, writes for RealPython)

Can use a module both as a module AND a script.
main() is the entry point to the program.
Import guard example: realpython.com/python-main-function
main.py: special use case to execute your package as a script. e.g. pip.


Package management

pip to install packages on PyPI or from source.

-e flag for development version.


The setup.py file (at same level as source directory) tells pip how to install your package.

See slides for example use.


See Kyle’s "better example” slide for cool use of path.abspath & path.join with here variable (kinda like R’s here pkg)
Changelog: keepachangelog.com
Semantic versioning: semver.org (PEP 440)

MAJOR.MINOR.PATCH


Problem with setup.py: could have malicious code. PyPA has come up with pyproject.toml & flit to get around that. Also easier than using setup.py.

Also look into cookiecutter templates.


Think about this at the very beginning so you don’t have to re-organize everything later.

Day 2

Collaboration with Git & GitHub (Karthik Ram)


Workflows

Centralized workflow

Only works for really small projects
Everyone just commits to master 😬


Feature branching workflow

Also work in a feature branch.
Start a pull request before merging to master.
Delete branches after they’re merged.


Forking workflow

Only reason to fork is if you don’t have write access to someone else’s project / when you’re not a core contributor.
Create a PR when ready to merge.


Alias git to hub.  https://hub.github.com/

Extensions to interface with GitHub from the command line.
Create a GitHub repo from a local git repo: git create username/reponame
Open up the repo in your browser: git browse
Open a new PR: git pull-request
Compare 2 branches: git compare master..feature-branch
If you clone a repo but realize you wanted to fork it: git fork


On branches:

A branch is just a pointer to a commit. Commits are linked nodes.


Use pull requests as much as possible.

Fosters code review.
Facilitates discussion.
Can use continuous integration to run tests automatically.
Someone else should merge your code into master so two sets of eyeballs review each feature.

Pick one or two people to do a technical review & an end-user review.


Instead of creating a merge commit, could use rebase to squash all the commits from that branch into one.
NEVER SEND A PULL REQUEST FROM MASTER.

Master branches will become incompatible.
GitHub now warns you if you attempt to do this.


Never send a large pull request without notice.

Read the contributing doc.
Common practice is to ask whether the maintainers want the feature before you work on it.
Pull requests should be small, digestible changes.

Make each unit of code simple enough for someone to review & accept.


Tips:

Always git pull before you start new work.
Keep branch names descriptive.
Generously use branches, but delete them when you’re done.
Use the hub extension to make your life easier.


Git Exercises (James Howison)


Material: https://jameshowison.github.io/peer_production_course/docs/additional_git_exercises.html
Pull requests are communication; make them digestible.
Note: any time you edit files, that’s a feature, so you should always do that in a branch.
Maintainer as developer AND champion of the community.

Create a welcoming & active environment.
How long ago was the last commit is really important. Is the project active?
“Turn the music on — make it feel like a party!"
Even when you’re working with people face-to-face, you should document discussions on GitHub.


How to split pull requests.

Software testing & continuous integration (Kyle Niemeyer)


How do you know your code gives the right answers? …what about after you make changes to the code?
When: ALWAYS
Where: external test suite

e.g. inside tests/ subdir in package repo.
Some tests are better than no tests. But a rigorous test suite is best!


Why: make sure our results are trustworthy.

It’s really easy to make subtle mistakes.
Helps us know that a PR won’t break anything.
Unit tests are good examples of how a package works.


What and how

Tests compare expected vs observed outputs for known inputs.
You don’t have to have a function written in order to write a test.
Use assertions (e.g. assert exp == obs).
Use math.isclose() or np.allclose() to get around floating point precision.
Use pytest package.

-s to keep standard output.
-k to run tests matching a substring.
-q run specific test file & test function.


What cases to test

Interior: precise values don’t matter (just test one of these).
Edge: beginning or end of range of inputs (test all of these).
Corner cases: 2 or more edge cases that intersect.


Pytest test generators

Decorator to feed lots of inputs to one test function: @pytest.mark.parametrize


Types of tests

Unit test: test individual functions & methods.

Have to break up your code into small functions.


Integration test: verify that multiple pieces of the code work together.
Regression test: confirm that new results match prior results (which are assumed correct).


Test-driven development (TDD): write your tests before you implement the functions.
More tips

Test for consistency with PEP8.

e.g. flake8: linter & style-checker.
Plugins for your favorite IDE to run it continuously.


Test that exceptions are raised: pytest.raises(ExceptionClass)
Mocking

Replace parts of the system with precisely controllable code to specify return values & throw exceptions.


Test coverage

Percentage of code (in number of lines) that are touched by tests.
100% test coverage doesn’t guarantee that you catch all potential errors; it means every line of code is run by at least one test.
pytest-cov creates a coverage report.
codecov.io integrates with GitHub.


Continuous integration

Ensure all changes to your project pass tests through automated test & build process.
Services: GitHub Actions, travis, CircleCI, AppVeyor, Jenkins (used by mothur)
Add the CI badge to your readme: it signals that your tool is being actively maintained.
See PyTeCK repo as an example of useful badges.


Tests in the wild: PyTeCK

Git Exercises ctd (James Howison)


https://learngitbranching.js.org/?NODEMO
Note: git cherry-pick keeps the original author information. 😄
git rebase re-writes history to move the branch point. Obviates merge commits, instead makes them fast-forwards.
git rebase -I in interactive mode is a good idea. Allows you to squash commits and clean things up.

Day 3

Code Review (Jeffrey Carver)


Code review augments testing, but doesn’t replace testing.

Efficiency, readability, etc. can’t be tested for.


The purpose is to make the code better. Everyone makes mistakes. There’s no expectation that you’ll do it exactly right the first time.
By doing code review, you save time down the road.
Goals:

Team cohesion.

Gain shared understanding of the project.
Get to know teammates skills’ better.


Code quality.

Find problems early.
Get different perspectives.
Consistency & readability.
Makes code easier to maintain.


Personal learning.

Reading other people’s code & having your code reviewed.


Mindset:

Developer:

Recognize that a code critique is not a personal attack. You are not your code.
Be ready & willing to learn new things.
Expect that there will be changes. Remove the fear of making mistakes.
Be humble.


Reviewer:

Don’t assume that your way is the best.
Make positive comments, not only negative ones.
Understand why the developer asked you to review the code.
Focus on the code, not on the author.
Pick your battles.


Techniques

Prioritize things that humans can spot that automated testing can't.

Readability
Algorithms


How we communicate matters (applies in all types of feedback-giving)

Ask questions where possible.

e.g. “Have you considered…” -- Maybe they have and there's a good reason for it.


No personal attacks. It's about the code, not the person!
Be as specific as possible about how the code could be improved instead of making general statements.
Put yourself in others' shoes.

If you wouldn't want to get the comment, you probably shouldn't give it to someone else.


Explain why you're making the suggestion.


Checklist

Before you ask someone to review your code:

Write tests.
Make sure the code runs & passes the tests.
Write comments & other documentation.

Document any weird edge cases & work-arounds


Follow the style guide.


When you review someone else's code:

Comments are understandable & appropriate.
DRY up repetitive code.
Code runs & passes tests.
Exceptions are handled appropriately.


Best practices

Communicate goals of code review.
Do it early & often.
Review a small amount of code.

If it takes longer than 60 minutes to review, that's too much.


Establish a process for what to do after reviews.

Is it a hard gate that you have to make the reviewer happy, or are they just suggestions you could choose not to follow?


Issues you might identify in code review

Inconsistent style
Inefficiency
Unvalidated inputs
Lack of exception handling


Why is code review important for research software specifically?

Just like peer-reviewing publications, we want to make sure the code underlying the science is sound.
Science depends on the correctness of your code.
Help spread best-practices & high-level understanding in the scientific community.
Results may not always be known. There's not always "ground truth" (e.g. in simulations).


GitHub-specific tips: using Pull Requests for code review (examples: pr-omethe-us/PyKED) (Kyle Niemeyer)

Use pull request templates.

Could enforce check boxes like which issue(s) it resolves, that test cases were added, etc.


Easily view file diffs & add comments right alongside the code. Facilitates conversation.

You can leave comments at multiple lines.
Make suggestions for small, easy changes. There's an "insert suggestion" button! (Don't do this for design changes.)


Under settings > branches, you can protect branches

e.g. require that a PR has to be reviewed before merging into master.
More on code owners: https://help.github.com/en/github/creating-cloning-and-archiving-repositories/about-code-owners


Tool: octobox.io for managing GitHub notifications.


Open Science & Software Citation (Kyle Niemeyer)


TLDR: if you make your code public, pick a license and put a LICENSE file in your repo.
Copyright

Facts & ideas are not copyrightable.
Expressions of ideas are copyrightable.
Right of first publication: goes to the first creator even if not explicitly specified.
You should include a license with all publicly available software code so people know how they can (or can't) use it.

Or, you can explicitly put work into the public domain, then it's free for anyone & everyone to use & modify.


Software Licenses

Types:

Proprietary
Free/open source (FOSS, FLOSS, OSS)

Permissive: BSD 3-clause, MIT
Copyleft: GPL (the license is "viral")


Pick an existing license; don't make your own!
Resource: https://choosealicense.com
Open Source Initiative (OSI) Licenses

To call your work "open-source", you have to release it under one of the OSI licenses.


Non-software: Creative Commons

Codes:

BY: Attribution (similar to permissive)
SA: ShareAlike (similar to copyleft)
ND: NoDerivatives
NC: NonCommercial


e.g. CC BY, CC BY-SA
CC0: like the public domain version of creative commons.


More concepts

Patents: cover ideas & concepts (which copyright doesn't).
Trademarks: symbols that represent a business or organization.
Export control: gov't may forbid transfer of code/data/ideas to another country or foreign national.
HIPAA: cannot share human patient data.


Archiving & Citing Software

Services: Zenodo, figshare, something within your University's library (UMich has one)

Archives your stuff forever and makes it citable with a DOI.
figshare: company, for-profit...
Zenodo: run by CERN. Will be around as long as the EU exists.

Free! Good file
size limits
Connects with GitHub! When you turn on Zenodo for your repo, it creates a new DOI when you cut a new release.


Without proper citations, your work is not reproducible.
Academia relies on citations for credit.
Paper: Software Citation Principles

Software should be "first-class" citations just like other publications.
How? name, author(s), DOI or other persistent identifier.

A GitHub link is not a persistent identifier, but it's better than nothing.


If there's a paper describing it, cite both the paper & the code DOI.


How can we make our software easily citable?

Create a DOI (e.g. via Zenodo)
Include a CITATION file in your GitHub repo.


Tool in development: httsp://citeas.org (James Howison)

Web scraper to find the right citation given a package name or website.


Reproducibility


repro-packs (Kyle Niemeyer)

Lorena Barbra: "reproducibility packages (repro-pack)" -- packages associated with papers shared under CC-BY.
Produce a single repro-pack for an entire paper

containing:

Code, results, input data (if small enough)
Figures (vector format)
Config file, etc


Upload to FigShare/Zenodo under CC-BY license.
Cite using the resulting DOI in the associated papers.


Benefits

Improve reproducibility & impact of your work.
Reviewers love it.
Lets you reuse your figures without violating a journal copyright.

When published, the journal (one that isn't open access) owns the paper & everything in it that isn't licensed from somewhere else.


Can include an appendix with statement about the availability of material. Or put it in the methods section.
Research compendium: make your paper like a package so it's easily-installable. Uses lightweight packaging structure.


rOpenSci (Karthik Ram)

rOpenSci: Scientific software for R. Helping researchers write sustainable software tools.
software-review: rOpenSci Software Peer Review of community-contributed packages
JOSS got started when rOpenSci realized the need extends beyond R packages.
dev-guide: https://devguide.ropensci.org/
PyOpenSci recently got started as the Python version of ROpenSci. (David Nicholson)


JOSS: Journal of Open Source Software (Kyle Niemeyer)

Open, no fees.
If you've already licensed your code & have good documentation, it should take under an hour to submit to JOSS.
Very short paper to describe the software.
All the conversation happens on GitHub.
Uses same structure as JOSE (Journal of Open Source Education).
Questions from the audience: when to submit as a package (e.g. to JOSS) versus in a repro-pack (to your society journal)?

If anyone else would ever use it, it should probably be a package.
If the code is only used for creating a paper, it should just be in the repro-pack.
If your goal is to write a methods paper, it probably wouldn't go to JOSS.
If you have the option to submit to a domain journal, do that first instead of JOSS. (Karthik's take)

JOSS is meant to fill in the gap for people who don't have a place to publish their software.


This is for getting research credit. But you still the need to cite the specific version you used (e.g. from Zotero) for reproducibility purposes.


Sidney Bell at Chan Zuckerberg Initiative

CZI started funding scientific software.

foundational packages (e.g. scikitlearn, matplotlib, pandas).
biology domain-specific packages.


First cycle of funding awarded. Second round closes in Feb.
Funding awarded to organizations (e.g. NumFocus, Universities), not people.


Documentation (Kyle Niemeyer)


Value of documentation.

The value & extent of your work if it's understandable by your colleagues.
Provides provenance for your scientific process.
Demonstrates your skill & professionalism.
"A love-letter that you write to your future self."


It's easier than you think!
Types:

user & developer guides

README file accompanied by LICENSE, CITATION, CHANGELOG, etc.


code comments

docstring

for functions & classes
available within Python via help() & easy to parse by Sphinx.


in-line

bad: polluting the code with unnecessary information that's already evident from reading the code.
good: use sparingly to explain reasons behind choices & complicated sections


self-documenting code

intelligently name things that tells you why it exists, what it does, and how it's used.
write really simple functions that do only one thing.

"A function should have a function, not multiple functions."


follow consistent style.


generated API documentation


Tools

Sphinx: automatically generate documentation

Set it up with CI to automatically build your documentation website when you make changes.
Writing docstrings that are compatible with Sphinx:

Styles: NumPy, Google, reStructuredText...
Specify parameters, returns, & include a short description


Easy to get started quickly. See slides for more details.
# at top-level of repo, same level as package dir
mkdir docs/
cd docs/
sphinx-quickstart
make html


doctr: auto-deploy docs to GitHub pages using TravisCI.
Read the Docs to host your documentation.
Example: https://github.com/kyleniemeyer/ME373