Matthew Schechter mschecht

## Current programming workflow - Python
# Python programming setup
- Desc: This is my current workflow for setting up my Python programming interface
with Jupyter lab on a remote server. I will be setting up Jupyter lab on the
remote server to be constantly running. After this has been completed once, only
step 5-6 are needed to connect my local computer to the running Jupyter lab.

- Benefits:
  - Now I get to develope locally using the Jupyter Lab interface but run
  calculations on remotely on the servers :)

## extract_sequences
# Need to download seqkit
# fx2tab converst a fasta to tabular format
seqkit fx2tab allORFs.fasta | sort -k1,1 --parallel 32 -S20% > allORFs.sorted.tsv

# grep list of headers against tabular fasta then convert back to standard fasta
LC_ALL=C grep -w -F -f <(sort -k1,1 toextract.txt)  allORFs.sorted.tsv | seqkit tab2fx  > toextract.fasta

## remove_after
# Removes everything after the "."
# You could subsitute "." for anything (i.e., "_", "/")
awk 'BEGIN{FS=OFS="."} NF--' FILE

## rename-fasta-headers
seqkit fx2tab original.fasta | awk '{print "seq_"NR"\t"$2}' | seqkit tab2fx > renamed.fasta
# 1: concert fasta to tabular format
# 2: Here is where you can change the headers. In the example above, each sequence header will
# be changed to "seq_'NR'" (NR is the variable for number of records (i.e.line number) in
# awk programming linking)

## cluster_methods
pufm_cor <- cor(pufm_agg_v3_wide, method = "pearson")
pufm_cor <- as.dist(1 - pufm_cor)

hc_methods <- c("ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", "centroid")

coph <- function(hc_method, d = d, dist_method){
  hc <- hclust(d, method = hc_method)
  coph <- cor(cophenetic(hc), d)
  df <- data_frame(hc_method = hc_method, dist_method = dist_method, coph = coph)
}

## compile
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. -DZLIB_ROOT=/home/mschecht/.linuxbrew/Cellar/zlib/1.2.11 ..

## list_size
# du: disk usage
# -sh: Show the size of a single folder in human readable units

du -h ./* | sort -h

# sort -h: sort by size of folder in human readable form

## subsample_ord
## 7. Permutate subsampling of larger matrices

Now lets subsample the "all" and the "knowns" to have the same number of components as "unknowns"

### Matrix subsampling function
```{r}
phyloseq_subsample <- function(phyloseq_obj) {
  subsample_size <- taxa_names(unk_physeq) %>% length() # get number of variables in unk matrix
  matrix <- phyloseq:::veganifyOTU(phyloseq_obj) # Pull out matrix from phyloseq
  sub <- sample(x = seq_len(ncol(matrix)), size = subsample_size, replace = FALSE) # create vector of subsampled variables from larger matrix

## single_multidomain_.Rmd
---
title: "Single vs multidomain proteins in refseq"
output:
  html_document:
    df_print: paged
editor_options:
  chunk_output_type: console
---

What is the number of single versus multidomain proteins in the non-redundant refseq database?

## singlevsmulti.sh
#!/usr/bin/env bash

MULTI=$(awk '{print $3}' $1 | uniq -c | awk '$1 != 1 {print $2}' | wc -l)

SINGLE=$(awk '{print $3}' $1 | uniq -c | awk '$1 == 1 {print $2}' | wc -l)

echo "No. of single domain proteins = $SINGLE"

echo "No. of multidomain proteins = $MULTI"
	# Python programming setup
	- Desc: This is my current workflow for setting up my Python programming interface
	with Jupyter lab on a remote server. I will be setting up Jupyter lab on the
	remote server to be constantly running. After this has been completed once, only
	step 5-6 are needed to connect my local computer to the running Jupyter lab.

	- Benefits:
	- Now I get to develope locally using the Jupyter Lab interface but run
	calculations on remotely on the servers :)
	# Need to download seqkit
	# fx2tab converst a fasta to tabular format
	seqkit fx2tab allORFs.fasta \| sort -k1,1 --parallel 32 -S20% > allORFs.sorted.tsv

	# grep list of headers against tabular fasta then convert back to standard fasta
	LC_ALL=C grep -w -F -f <(sort -k1,1 toextract.txt) allORFs.sorted.tsv \| seqkit tab2fx > toextract.fasta
	# Removes everything after the "."
	# You could subsitute "." for anything (i.e., "_", "/")
	awk 'BEGIN{FS=OFS="."} NF--' FILE
	seqkit fx2tab original.fasta \| awk '{print "seq_"NR"\t"$2}' \| seqkit tab2fx > renamed.fasta
	# 1: concert fasta to tabular format
	# 2: Here is where you can change the headers. In the example above, each sequence header will
	# be changed to "seq_'NR'" (NR is the variable for number of records (i.e.line number) in
	# awk programming linking)
	pufm_cor <- cor(pufm_agg_v3_wide, method = "pearson")
	pufm_cor <- as.dist(1 - pufm_cor)

	hc_methods <- c("ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", "centroid")

	coph <- function(hc_method, d = d, dist_method){
	hc <- hclust(d, method = hc_method)
	coph <- cor(cophenetic(hc), d)
	df <- data_frame(hc_method = hc_method, dist_method = dist_method, coph = coph)
	}
	# du: disk usage
	# -sh: Show the size of a single folder in human readable units

	du -h ./* \| sort -h

	# sort -h: sort by size of folder in human readable form
	## 7. Permutate subsampling of larger matrices

	Now lets subsample the "all" and the "knowns" to have the same number of components as "unknowns"

	### Matrix subsampling function
	```{r}
	phyloseq_subsample <- function(phyloseq_obj) {
	subsample_size <- taxa_names(unk_physeq) %>% length() # get number of variables in unk matrix
	matrix <- phyloseq:::veganifyOTU(phyloseq_obj) # Pull out matrix from phyloseq
	sub <- sample(x = seq_len(ncol(matrix)), size = subsample_size, replace = FALSE) # create vector of subsampled variables from larger matrix
	---
	title: "Single vs multidomain proteins in refseq"
	output:
	html_document:
	df_print: paged
	editor_options:
	chunk_output_type: console
	---

	What is the number of single versus multidomain proteins in the non-redundant refseq database?
	#!/usr/bin/env bash

	MULTI=$(awk '{print $3}' $1 \| uniq -c \| awk '$1 != 1 {print $2}' \| wc -l)

	SINGLE=$(awk '{print $3}' $1 \| uniq -c \| awk '$1 == 1 {print $2}' \| wc -l)

	echo "No. of single domain proteins = $SINGLE"

	echo "No. of multidomain proteins = $MULTI"