Skip to content

Instantly share code, notes, and snippets.

View mschecht's full-sized avatar

Matthew Schechter mschecht

View GitHub Profile
@mschecht
mschecht / Current programming workflow - Python
Last active June 5, 2024 10:33
Use Jupyter Lab on a remote server
# Python programming setup
- Desc: This is my current workflow for setting up my Python programming interface
with Jupyter lab on a remote server. I will be setting up Jupyter lab on the
remote server to be constantly running. After this has been completed once, only
step 5-6 are needed to connect my local computer to the running Jupyter lab.
- Benefits:
- Now I get to develope locally using the Jupyter Lab interface but run
calculations on remotely on the servers :)
@mschecht
mschecht / extract_sequences
Created January 7, 2019 15:47
fastest way to extract sequences from fasta file
# Need to download seqkit
# fx2tab converst a fasta to tabular format
seqkit fx2tab allORFs.fasta | sort -k1,1 --parallel 32 -S20% > allORFs.sorted.tsv
# grep list of headers against tabular fasta then convert back to standard fasta
LC_ALL=C grep -w -F -f <(sort -k1,1 toextract.txt) allORFs.sorted.tsv | seqkit tab2fx > toextract.fasta
@mschecht
mschecht / remove_after
Created January 6, 2019 14:47
remove everything after last instance of character
# Removes everything after the "."
# You could subsitute "." for anything (i.e., "_", "/")
awk 'BEGIN{FS=OFS="."} NF--' FILE
seqkit fx2tab original.fasta | awk '{print "seq_"NR"\t"$2}' | seqkit tab2fx > renamed.fasta
# 1: concert fasta to tabular format
# 2: Here is where you can change the headers. In the example above, each sequence header will
# be changed to "seq_'NR'" (NR is the variable for number of records (i.e.line number) in
# awk programming linking)
@mschecht
mschecht / cluster_methods
Created December 7, 2018 10:41
Test with clustering algorithm gives you the best cophenetic distance correlation.
pufm_cor <- cor(pufm_agg_v3_wide, method = "pearson")
pufm_cor <- as.dist(1 - pufm_cor)
hc_methods <- c("ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", "centroid")
coph <- function(hc_method, d = d, dist_method){
hc <- hclust(d, method = hc_method)
coph <- cor(cophenetic(hc), d)
df <- data_frame(hc_method = hc_method, dist_method = dist_method, coph = coph)
}
@mschecht
mschecht / compile
Created November 13, 2018 10:14
zlib
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. -DZLIB_ROOT=/home/mschecht/.linuxbrew/Cellar/zlib/1.2.11 ..
@mschecht
mschecht / list_size
Last active September 19, 2018 10:42
# du: disk usage
# -sh: Show the size of a single folder in human readable units
du -h ./* | sort -h
# sort -h: sort by size of folder in human readable form
## 7. Permutate subsampling of larger matrices
Now lets subsample the "all" and the "knowns" to have the same number of components as "unknowns"
### Matrix subsampling function
```{r}
phyloseq_subsample <- function(phyloseq_obj) {
subsample_size <- taxa_names(unk_physeq) %>% length() # get number of variables in unk matrix
matrix <- phyloseq:::veganifyOTU(phyloseq_obj) # Pull out matrix from phyloseq
sub <- sample(x = seq_len(ncol(matrix)), size = subsample_size, replace = FALSE) # create vector of subsampled variables from larger matrix
---
title: "Single vs multidomain proteins in refseq"
output:
html_document:
df_print: paged
editor_options:
chunk_output_type: console
---
What is the number of single versus multidomain proteins in the non-redundant refseq database?
#!/usr/bin/env bash
MULTI=$(awk '{print $3}' $1 | uniq -c | awk '$1 != 1 {print $2}' | wc -l)
SINGLE=$(awk '{print $3}' $1 | uniq -c | awk '$1 == 1 {print $2}' | wc -l)
echo "No. of single domain proteins = $SINGLE"
echo "No. of multidomain proteins = $MULTI"