Skip to content

Instantly share code, notes, and snippets.

@NathanSkene
Last active September 2, 2020 11:24
Show Gist options
  • Save NathanSkene/fb128abf6f3f937feb9688aaa9f58ce0 to your computer and use it in GitHub Desktop.
Save NathanSkene/fb128abf6f3f937feb9688aaa9f58ce0 to your computer and use it in GitHub Desktop.
Welcome pack for new starters

Welcome to Imperial's Neurogenomics lab!

To-do:

Once you have Imperial login ID

Send me your username so I can give you access to the computing cluster and shared project spaces

Other important things to do

Send me a profile photo and a short bio paragraph to go on the lab website

Create a Github account if you don't already have one: send me your username so I can add you to the group's project space (https://github.com/neurogenomics). Try to do all your work within a Github repository.

Join the journal club's mailing list: https://groups.google.com/forum/#!forum/london-genomics-journal-club

Join the lab's mailing list: https://groups.google.com/forum/#!forum/neurogenomics-lab

Join the group's Slack channel. Be aware the slack channel has numerous non-Imperial people (i.e. collaborators from UCL and KU Leuven) and private channels should be used for most things. Create a new channel if you don't think existing ones are relevant to your project. Note, that we are in the progress of migrating over to the UKDRI's centralised Slack channel so you should also join that once you have a DRI login.

Training

If you are new to computing/programming, register as soon as possible for:

If you are experienced with programming, learn about containerisation with docker/singularity and writing workflows with nextflow/WDL/snakemake. If you have never made an R package, try making a simple test one and then installing it from github. If you're never used packrat/Renv, try using them.

If you are new to working with single cell, then try out the Broad Institute single cell tutorial or bioconductor's tutorial.

If you're new to using Git/Github then have a look through these course materials.

You should go through tutorials relating to the lab's core methods:

If you are new to statistics (or stats in R) then there are two fundamental methods that you want to get used to using:

  • Generalised linear models: go through this tutorial and google if it's not clear
  • Bootsrapping. Look into how EWCE works.

Suggested

I recommend using Evernote to keep track of your work / notes. Try it if you haven't before. Keep notes on all computer errors etc in here.

Lab organisational profiles online

We have a github repo and a docker hub repository. Please use these repositories for lab related work.

Creating your profile on the lab website

Please create a profile for yourself on the lab website. The website is managed through github. An individuals profile is created through pushing a config page into a person's folder, e.g. Roxy's folder. No matter how long you'll be in the lab would be great if you could create a profile. Easiest way is to send me a profile photo, with a short bit of descriptive text, details on your degrees and links to github/twitter/linkedin pages.

Learning to use the computing cluster:

To get access to the computing cluster, send me an email with your username and I'll add you.

The Imperial CX1 cluster uses PBS as a job manager. PBS has many versions and it cannot neccesarily be assumed that a function you find in an online manual will work exactly as described there. The functions which work on the cluster are best found by typing man qstat while logged to an interactive session.

There is a weekly HPC clinic. I strongly recommend making use of this. You can turn up and experts will help you. Even if you are just unsure about sommething, go and speak to them. They are held in South Kensington but it's well worth going.

Imperial regularly runs a beginner's guide to high performance computing course. If you have not previously used HPC you'll want to register for this as soon as possible. This can be done through their website:

Imperial also runs a course on software carpentry. If you are unfamiliar with usage of Git and Linux then you should take this course.

Combiz wrote useful notes on using the HPC (how to login etc):

A version of RStudio is installed on the computing cluster and can be accessed through your browser. This gives you access to a 24 core machine and is probably better than programming on your laptop.

Create interactive jobs on the cluster

To create an interactive session on the cluster (to avoid overloading the login nodes) use the following command

qsub -I -l select=01:ncpus=8:mem=96gb -l walltime=08:00:00

That command requests the most resources that can be obtained for interactive jobs, decrease these if you can. On the main queue it will take a long time to submit. If I've given you access to the med-bio queue (ask) then you will be better of using the following:

qsub -I -l select=01:ncpus=2:mem=8gb -l walltime=01:00:00 -q med-bio

Setup ssh for the cluster

To setup ssh on your computer for accessing the cluster add the following to ~/.ssh/config:

Host *
 AddKeysToAgent yes
 IdentityFile ~/.ssh/id_rsa
Host imperial
   User nskene
   AddKeysToAgent yes
   HostName login.cx1.hpc.imperial.ac.uk
   ForwardX11Trusted yes
   ForwardX11 yes
   HostKeyAlgorithms=+ssh-dss
Host imperial-7
   User nskene
   AddKeysToAgent yes
   HostName login-7.cx1.hpc.imperial.ac.uk
   ForwardX11Trusted yes
   ForwardX11 yes
   HostKeyAlgorithms=+ssh-dss

If you use imperial-7 to login then you'll always connect to the same login node which makes using screen/tmux easier.

Joint workspaces on the Imperial cluster

We have two shared project spaces on the cluster. If you are involved in the DRI Multioics Atlas project then use projects/ukdrimultiomicsprojects/. Otherwise, please use projects/neurogenomics-lab

You will not be able to write into the main directory of either of these. They have two folders: live and ephemeral. Read about the differences between these here:

The medbio cluster (for faster job submissions)

The MedBio cluster has additional computational resources and is accessed via a seperate queue. We have access to it but I do not have admin rights to grant access to individuals. To get access email p.blakeley@imperial.ac.uk. Read about it here: https://www.imperial.ac.uk/bioinformatics-data-science-group/resources/uk-med-bio/

To run on the med bio cluster, just put this at the end of your submit commands: -q med-bio

Express (charged) access to the cluster

It can take a long time to get jobs submitted on the cluster. We can pay to get jobs submitted faster. Details are here: https://www.imperial.ac.uk/admin-services/ict/self-service/research-support/rcs/computing/express-access/. I would rather pay for faster results if this is slowing you down. Let me know if this would be useful and I'll add you to the list. Express jobs are sunmitted using Run express jobs with qsub -q express -P exp-XXXXX, substituting your express account code.

Running docker containers on the HPC

You'll need to use singularity to run docker containers on the HPC. To run a rocker R container in interactive mode, run the following, substituting your username where appropriate:

mkdir /rds/general/user/$USER/ephemeral/tmp/
singularity exec -B /rds/general/user/$USER/ephemeral/tmp/:/tmp,/rds/general/user/$USER/ephemeral/tmp/:/var/tmp,/rds/general/user/$USER/ephemeral/rtmp/:/usr/local/lib/R/site-library/ --writable-tmpfs docker://rocker/tidyverse:latest R

To create a Singularity image, first archive the image into a tar file. Obtain the IMAGE_ID with docker images then archive with (substituting the IMAGE_ID): -

docker save 409ad1cbd54c -o singlecell.tar

On a system running singularity-container (>v3) (e.g. on the HPC cluster), generate the Singularity Image File (SIF) from the local tar file with: -

/usr/bin/singularity build singlecell.sif docker-archive://singlecell.tar

This singlecell.sif Singularity Image File is now ready to use.

Learning NextFlow

A complete 10 hour workshop on learning NextFlow has been digitised by Seqera Labs: the videos and the online resources.

The lab's full set of tutorial's for NextFlow are available here but these remain a work in progress. We had a workshop in Feb 2020 and we kept the discussion and logs of this in a slack channel #nextflow-workshop... take a look on there for example scripts and relevant links.

This example scripts shows how to launch an Rscript from Nextflow in parallel on the cluster:

#!/usr/bin/env nextflow
params.datasets = ['iris', 'mtcars']
process writeDataset {
    executor = 'pbspro'
    clusterOptions = '-lselect=1:ncpus=1:mem=1Gb -l walltime=24:00:00 -V'
    tag "${dataset}"
    publishDir "$baseDir/data/", mode: 'copy', overwrite: false, pattern: "*.tsv"
    input:
    each dataset from params.datasets
    output:
    file '*.tsv' into datasets_ch
    """
    module load R
    """
    """
    #!/usr/bin/env Rscript 
    data("${dataset}")
    write.table(${dataset}, file = "${dataset}.tsv", sep = "\t", col.names = TRUE, row.names = FALSE)
    """
}

Some tutorial information for Nextflow from a workshop at the Sanger is available here.

An active NextFlow chatroom where you can ask questions is on gitter.

NextFlow on Google Cloud Life Sciences Platform (GCP)

Follow the guide here: https://cloud.google.com/life-sciences/docs/tutorials/nextflow

Learning WDL / TERRA

Here are screen casts from Lynn Langit explaining how to use WDL: https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM. This should be your starting point. Here's Lynn's github repo with all the example code: https://github.com/openwdl/learn-wdl. The screencasts show how to take code from this repo and put them into TERRA to run them.

I've prepared a basic introduction to use of TERRA here but it doesn't really cover WDL yet. The best tutorial resources for learning WDL are here: https://support.terra.bio/hc/en-us/sections/360007274612.

There is a tutorial on how to use notebook's within TERRA.

Requesting tissue from the brain bank

Fill out the form here: https://www.imperial.ac.uk/media/imperial-college/medicine/multiple-sclerosis-and-parkinsons-brain-bank/PD-request-form-v3.pdf

Have a look at the database here to see what tissue they have: https://brainbanknetwork.cse.bris.ac.uk/. It may not have everything so you're probably best emailing Steve.

Commercialisation and startups

I strongly encourage you to consider how your research can be commercialised. There is more money available in the private sector than in the public if you really want to scale your ideas. There is plenty of money available in seed funding. This is easier to do than you may expect: you do not need patents. I' always opening to discussing how you can do this. A few things to look into:

  • Y-combinator (the world's most prestigious startup accelerator):
  • Indie bio is one of the leading biotech startup accelerators
  • Age1 a fund which specialises in biotech companies relating to old age

This guide to the London startup ecosystem explains local sources of funding etc: https://startupsoflondon.com/london-startup-ecosystem-ultimate-report-2020/

The Turing Institute's enrichment schemes & hackathons

The Turing Instite is the United Kingdom's national institute for data science and artificial intelligence. They have very good people there. They run a 12 month enrichment scheme so if you're doing your PhD and it's relating to machine learning, I strongly recommend applying to do this.

They also run regular data study groups, where you can apply to work with them on interesting datasets. These are great opportunities to test your skills in new ways. Your project will survive if you take some time off to do this; I recommend doing this.

Making Bioinformatics reproducible

Firstly, you should be aware that much of current science is not reproducible. Small sample sizes can explain a considerable part of bad science within neuroscience. Sloppy code and failure to ensure reproducibility are the equivilents for bioinformatics.

The Turing Institute has written an extensive article about making computational research reproducible, we should endevour to follow it: it's called, The Turing Way.

It is important that all the lab code is storing on github, prepared as R packages as earlier as possible and ideally prepared as workflows and shared via TERRA. Read about the benefits of workflow systems here.

Nature wrote a good article about best practises working with large datasets. If we're not using anything from the article yet (like, Harvard Dataverse, [Zenodo)(https://zenodo.org/) or NextJournal) , it would be great if you could try it out and let me know how you get on.

Read some selfish reasons for being reproducible here.

R

The lab's prefered programming language is R. Please try to use R unless you really must use another language. If you need to use a python function, consider using reticulate to call it from within R.

Learning to programme in R:

If you have never done any programming then I recommend starting with the Khan academy courses. Make sure that you are familiar with the following concepts: if-statements, for-loops, variables, arrays and functions. I have not done any of these courses myself so please let me know how you get on with them. If you find any better resources then please communicate them to me.

Imperial provides tutorials for command line, version control with git and python. While I don't use python much and would prefer that you learn R, the basic principles of programming are the same across languages.

Have a look to see if any workshops are being run by the Software Carpentries within the UK soon. They were given funding by CZI to help train people for bioinformatics so I assume the courses are good.

Here’s a tutorial written by a (quite famous) R developer called Hadley Wickham. It’s a good intro to data visualisation using R. The ggplot2 package is probably the best data vis tool in any programming language.The tutorial does teach a style of code (‘tidyverse’) that I don’t use much, but it is popular:

For learning R, many people recommend the tutorials by SwirlStats.

This cheatsheet explains many basic functions using the two main styles of R.

Consider installing Sublime Text. A good text editor is always useful.  

Memory issues in R

The default settings for R are not good for handling large datasets. Amend these as follows:

In bash:

cd ~
touch .Renviron
open .Renviron

Then add:

R_MAX_VSIZE=700Gb

Save

Restart R

Learning core lab approaches:

Run through the tutorial's for EWCE and MAGMA Celltyping:

Applying for post-doctorial fellowships

Imperial runs the Postdoc and Fellows Development Centre. It's worth signing up to their emails. They give advice on how to appy for all the main fellowships. Make sure you've signed up to their emails (if you are postdoc level).

Open Science and reproducibility

A fundamental philosphy of the lab is that open science is reproducible science. You should familiarise yourself with:

Recommended background reading:

Recommended books and the lab library

I've bought copies of the following books so anyone in the lab can borrow them. They are all easy reading and intended to give you a general background in current understanding of molecular biology, evolution, and human genetics that forms a good background for the work the lab is doing.

  • "Arrival of the Fittest: Solving Evolution's Greatest Puzzle" by Andreas Wagner
    • This book details practical issues with how metabolic pathways could have evolved, i.e. how many possible pathways could result in production of a single metabolite? This is useful to understand to grasp how complex traits work.
  • "The Beak Of The Finch: Story of Evolution in Our Time" by Jonathan Weiner
    • Explains some of the most important practical studies on evolution, involving carefully monitoring of finch colonies on islands in the Galapagos. Helps understand how evolution actually works day to day, year by year.
  • "Who We Are and How We Got Here: Ancient DNA and the new science of the human past" by David Reich
    • Explains the history of our species from a genetic perspective
  • "A Life Decoded: My Genome: My Life" by Craig Venter and "Avoid Boring People" by James Watson
    • Sequencing of the human genome was one of the greatest scientific achievements of man. These two books explain very different perspectives on how it was done: Venter tried to do it using private funding, Watson fought to make academia rise to the challenge.
  • “At the water’s edge” by Carl Zimmmer
    • One of the most remarkable transitions in biology is the evolution of whales from land-born mammmals. This book explains what we know about how this happened.
  • "Born Together-Reared Apart" by Nancy Segal
    • This book explains the history of one of the most important twin studies. Gets a bit dry as the book goes on but the early parts of the book give a valuable introduction.

These papers are worth reading for an understanding of the state of genetics today:

  • Visscher, Peter M., and Michael E. Goddard. "From RA Fisher’s 1918 Paper to GWAS a Century Later." Genetics 211.4 (2019): 1125-1130.

  • Ashbrook, David G., et al. "The expanded BXD family of mice: A cohort for experimental systems genetics and precision medicine." bioRxiv (2019): 672097.

    • A good understanding of complex trait genetics in mice is important for really understanding the field of genetics
  • Boyle, Evan A., Yang I. Li, and Jonathan K. Pritchard. "An expanded view of complex traits: from polygenic to omnigenic."

  • "Common disease is more complex than implied by the core gene omnigenic model." Wray, Naomi R., et al.

  • van Rheenen, Wouter, et al. "Genetic correlations of polygenic disease traits: from theory to practise"

  • Watanabe, Kyoko, et al. "A global overview of pleiotropy and genetic architecture in complex traits."

  • Sullivan, Patrick F., and Daniel H. Geschwind. "Defining the genetic, genomic, cellular, and diagnostic architectures of psychiatric disorders."

  • Reshef, Y. A. et al. Detecting genome-wide directional effects of transcription factor binding on polygenic disease risk. Nat. Genet. 50, 1483–1493 (2018).  

  • Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228 (2015).  

  • Jansen, I. E. et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nat. Genet. (2019).  

  • Soskic, B. et al. Chromatin activity at GWAS loci identifies T cell states driving complex immune diseases. bioRxiv 566810 (2019). doi:10.1101/566810  

  • Colantuoni, C. et al. Temporal dynamics and genetic control of transcription in the human prefrontal cortex. Nature 478, 519–523 (2011).

These papers are worth reading to understand single cell methods in neuroscience:

  • Zeisel, Amit, et al. "Molecular architecture of the mouse nervous system."

  • Harris, Kenneth D., et al. "Classes and continua of hippocampal CA1 inhibitory neurons revealed by single-cell transcriptomics." PLoS biology 16.6 (2018): e2006387.

  • Hodge, Rebecca D., et al. "Conserved cell types with divergent features in human versus mouse cortex." Nature 573.7772 (2019): 61-68.

  • Codeluppi, Simone, et al. "Spatial organization of the somatosensory cortex revealed by osmFISH." Nature methods 15.11 (2018): 932.

  • Goldmann, Tobias, et al. "Origin, fate and dynamics of macrophages at central nervous system interfaces." Nature immunology 17.7 (2016): 797.

Papers worth reading about broader neuroscience:

  • Komiyama, Noboru H., et al. "Synaptic combinatorial molecular mechanisms generate repertoires of innate and learned behavior." BioRxiv (2018): 500389.

  • Kopanitsa, Maksym V., et al. "A combinatorial postsynaptic molecular mechanism converts patterns of nerve impulses into the behavioral repertoire." BioRxiv (2018): 500447.

  • Grant, Seth GN. "Synapse diversity and synaptome architecture in human genetic disorders." Human molecular genetics (2019).

  • Luo, Liqun, Edward M. Callaway, and Karel Svoboda. "Genetic dissection of neural circuits: a decade of progress." Neuron 98.2 (2018): 256-281.

Papers worth reading about broader neurogenomics:

  • Raj, Bushra, and Benjamin J. Blencowe. "Alternative splicing in the mammalian nervous system: recent insights into mechanisms and functional roles." Neuron 87.1 (2015): 14-27.

  • Vuong, Celine K., Douglas L. Black, and Sika Zheng. "The neurogenetics of alternative splicing." Nature Reviews Neuroscience 17.5 (2016): 265.

  • Kosik, Kenneth S. "Life at low copy number: how dendrites manage with so few mRNAs." Neuron 92.6 (2016): 1168-1180.

  • Holt, Christine E., and Erin M. Schuman. "The central dogma decentralized: new perspectives on RNA function and local translation in neurons." Neuron 80.3 (2013): 648-657.

Worth reading about the genetics of neurodegenetative disease:

  • Singleton, Andrew, and John Hardy. "The evolution of genetics: Alzheimer’s and Parkinson’s diseases." Neuron 90.6 (2016): 1154-1163.

Worth reading about statistics

Empirical Bayes methods are commonly used in genomics.

Interesting project going on elsewhere in neuroscience:

  • Abbott, Larry F., et al. "An international laboratory for systems and computational neuroscience." Neuron 96.6 (2017): 1213-1218.

You might find these lectures notes useful:

Documentaries

Assorted useful info

Ordering reagents, computers etc

The location to deliver to within ICIS is 'IC WCWL MOLECULAR SCI RES HUB'. Address is Molecular Sciences Research Hub, 80 Wood Lane, Imperial College, W12 0BZ.

Parallel processing in R

Don't worry about this if you are new to programming.

If you'll be using R on the cluster you might want to get used to using the ClusterMQ R package. This lets you submit jobs to the Imperial cluster from within R, much the same as with a normal loop. Create a .PBStemplate file in your home directory on the server, containing the code below, to enable it to run.

#PBS -N {{ job_name }}
#PBS -l select=1:ncpus={{ cores | 1 }}:mem=1gb
#PBS -l walltime={{ walltime | 0:05:00 }}

source activate monocle
ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

#NOTPBS -o {{ log_file | /rds/general/user/nskene/home/logs/ }}
#NOTPBS -j oe

How to get files from the sequencing facility

When a sample is given to the sequencing facility, they need to be provided with an email address. Please create a mailing list using Office 365, then add everyone relevant to your project (including me) and give that email address to the facility. That way everyone relevant gets the username and password for accessing the dataset.

Karen Davey wrote a guide to getting data of the iRODS system and onto RDS project folders: https://gist.github.com/NathanSkene/3889048dd42c3d054a1f9db2a6b2765f

The sequencing facility have a wiki page which then explains how to access the data and transfer it to the RDS project folder. Please always transfer the data to the RDS project folder (speak to me if you do not know what this is).

Booking meeting rooms

E519 Burlington Danes

It has a shared calendar available through outlook (rmneuro5flo@imperial.ac.uk) for checking availability. In Outlook, go to calendar mode --> Open shared calendar --> Search for person (rmneuro5flo@imperial.ac.uk). You can just email a meeting invite, but you're supposed to do so via Colin Rantle.

Activating network ports

You've sat at your desk and tried connecting to the wired ethernet network... and it doesn't work. Take a photo of the port, write down all the numbers on it, and create a ticket on the IT system asking for it to be activiated.

Using the printers

All printers at Imperial are connected to the same print management system. You send something to the printer, then go to whatever printer you want, tap it with your card and pull your jobs. You should install the drivers for this system onto your computer. If you find something prints strangely, it may be because you are doing it from your web browser: download the file and then print.

Requesting a permanent IP address

Always useful to have if you have an Imperial desktop. Just send a ticket to IT through the ticket system and request one. Send them a photo of your computers badge to speed it up.

Cloud computing

Google Cloud Platform (and $5000 free credits)

Research credits

Faculty researchers (should be the PI or lead researcher) can apply today for free credits for GCP to access the power and flexibility needed to advance their research and scale with ease. Awards are worth $5,000 (USD) in GCP credits and only one person per research proposal may apply. These expire 12 months after they are redeemed. To ensure that all of our programs are sustainable, they are not intended to 100% fund research, but are used to allow the researchers to get started and run a large initial amount of work loads on Google Cloud Platform. Here is the GCP research credits program Application Form.

Why use GCP?

They have huge machines (up to 160 vCPUs and nearly 4TB of memory) and many GPUs. Also, TERRA.bio promises to bring in a new generation of bioinformatics with less irritating hassle munging datasets.

Nextflow on the cloud

This guide explains how to setup NextFlow on Google Cloud: https://cloud.google.com/life-sciences/docs/tutorials/nextflow

Lynn's explanation of setting up NextFlow on Google Cloud: https://medium.com/@lynnlangit/cloud-native-hello-world-for-bioinformatics-7831aecc8d1a

Useful resources

List of major datasets relevant to enrichment analyses: https://amp.pharm.mssm.edu/Enrichr/#stats

Storing scientific (large) datasets online with permanent addresses (and DOIs): https://zenodo.org/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment