Skip to content

Instantly share code, notes, and snippets.

View philippbayer's full-sized avatar

Philipp Bayer philippbayer

View GitHub Profile
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8
@philippbayer
philippbayer / changes.md
Last active November 9, 2023 15:20
createRepeatLandscape.pl changes for EDTA/TESorter classes

I changed the following in RepeatMasker/util/createRepeatLandscape.pl to make classes reported by EDTA and TESorter appear in the plot.

Around line 220 I added CACTA repeats as their own class:

              [ 'DNA/Transib',    '#FF9972' ],
              [ 'DNA/CACTA',      '#D45B2C' ],

I got the color by googling #FF9972 and then clicking around in that feature to get a similar looking color.

Then, around line 700, I added all these translations:

@philippbayer
philippbayer / nextflow.config
Last active November 6, 2023 01:41
My current Pawsey nextflow.config
// have this as nextflow.config in the folder of your run for Pawseys Setonix
// i settled on this command for nf-core/mag:
// nextflow run nf-core/mag --input '*{R{1,2}.fastq.gz' --outdir results
// --skip_spades --cat_db https://tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20210107.tar.gz
// --gtdb 'https://data.gtdb.ecogenomic.org/releases/release202/202.0/auxillary_files/gtdbtk_r202_data.tar.gz'
// -resume -profile singularity
// --refine_bins_dastool --postbinning_input both
// --busco_download_path /SOMEWHERE/busco-data.ezlab.org/v5/data
// --disable-jobs-cancellation
@philippbayer
philippbayer / torch.md
Last active October 16, 2023 08:01
installing torch/transformers under ROCm on Pawsey

Here's my alias in .bashrc for getting a gpu-dev instance based on https://support.pawsey.org.au/documentation/display/US/Setonix+GPU+Partition+Quick+Start

alias getgpunode='salloc -p gpu-dev --nodes=1 --gpus-per-node=1 --account=${PAWSEY_PROJECT}-gpu'

First, to make a fresh environment:

mamba create -p `pwd`/transformers transformers python=3.10

Install Torch with the closest ROCm version (nothing for 5.4.3, the current 'new' version on Pawsey, and nothing for 5.2.3, the default version). Also setting the pip-cache-dir to somewhere on /scratch.

import os
import sys
import argparse
from statistics import mean
'''
INPUT: tab-delimited blastn output. Assuming that taxonomy ID is in this format:
-outfmt "6 qseqid sseqid staxids sscinames scomnames sskingdoms pident length qlen slen mismatch gapopen gaps qstart qend sstart send stitle evalue bitscore qcovs qcovhsp"
This script also assumes that input has been filtered by 90% identity.
$ awk '{if ($7 > 90) print}' all_results.tsv > all_results.90perc.tsv
#!/bin/bash -l
# SLURM directives
#
# This is an array job with four subtasks
#SBATCH --job-name=align
#SBATCH --time=12:00:00
#SBATCH --cpus-per-task=1
#SBATCH --partition=work

First, a tab-delimited file with genome sizes

Gm01	58711475	26	60	61
Gm03	52519505	59690052	60	61

etc.

Then, to plot the thing:

@philippbayer
philippbayer / EDTA.md
Last active March 3, 2021 02:34
Running EDTA on Pawsey with Singularity

First, to download EDTA:

module load singularity
singularity pull EDTA.sif docker://quay.io/biocontainers/edta:1.9.4--0

That'll make a new file called EDTA.sif containing everything in the EDTA v1.9.4 container.

Then we have a problem: Pawsey allows only 1 million files per user and running several EDTA runs for several genomes at once will hit that limit.

@philippbayer
philippbayer / similarity.py
Created January 3, 2014 05:35
My solution for "String Similarity" for HackerRank
def get_similarity(a, suffix):
from itertools import izip
score = 0
for a, b in izip(a, suffix):
if a != b:
break
score += 1
return score
def stringSimilarity(a):
@philippbayer
philippbayer / covid_vs_cash.Rmd
Created July 2, 2020 14:26
Plotting whether there's a correlation between I've Been Everywhere and Covid-19 cases
```{r setup}
library(tidyverse)
library(ggrepel)
```
```{r}
df <- readxl::read_xlsx('./Covid_vs_State.xlsx')
head(df)
```