Skip to content

Instantly share code, notes, and snippets.

View trvinh's full-sized avatar

Vinh Tran trvinh

  • Goethe University Frankfurt
  • Frankfurt am Main, Germany
View GitHub Profile
@trvinh
trvinh / find_isoforms.py
Created August 28, 2025 08:57
Given a parsed gff3 file (in parquet format), find groups of isoforms for a list of protein IDs
#!/usr/bin/env python3
import argparse
import pandas as pd
def find_clusters(df: pd.DataFrame, protein_list: list) -> pd.DataFrame:
"""
Given a dataframe and a list of protein IDs,
return groups of proteins that belong to the same cluster (per chromosome).
"""
@trvinh
trvinh / gff_parser.py
Last active August 28, 2025 08:59
Parsing gff3 file to get isoforms and protein lengths
import pandas as pd
import re
import argparse
def parse_gff(gff_file):
"""
Parse GFF file and extract CDS with protein ID, gene locus, mRNA, chrom, strand, positions.
"""
records = []
@trvinh
trvinh / create_core_hmm.py
Last active August 19, 2025 07:47
Create aln and hmm files for subfolders in core_orthologs directory
#!/bin/env python
import os
import sys
import argparse
from pathlib import Path
import subprocess
import multiprocessing as mp
from tqdm import tqdm
@trvinh
trvinh / install_phyloprofile.txt
Last active November 4, 2024 12:25
Install PhyloProfile in a new Conda env
# create new conda env
mamba create -n phyloprofile_v1.20 r-base pkg-config pkgconfig fontconfig gsl lxml
# activate that env and start an R terminal
mamba activate phyloprofile_v1.20
R
# install phyloprofile from bioconductor
install.packages("BiocManager")
BiocManager::install("PhyloProfile")
# or install dev version from github
install.packages("devtools")
@trvinh
trvinh / split_multi_domains_file.R
Last active June 7, 2024 09:15
Split a multi-ortholog-pair domain file into single files for each ortholog pair
library(data.table)
library(dplyr)
#' Split a multi ortholo group file into single files
splitDomainFile <- function(domainFile = NULL, outPath = NULL) {
if (is.null(domainFile)) stop("Domain file cannot be NULL")
if (is.null(outPath)) stop("Output path cannot be NULL")
df <- fread(
domainFile, header = TRUE, stringsAsFactors = FALSE, sep = "\t"
@trvinh
trvinh / update_ete_ncbi.py
Created February 8, 2024 08:48
Update NCBI database of python ETE3 library
from ete3 import NCBITaxa
ncbi = NCBITaxa()
ncbi.update_taxonomy_database()
@trvinh
trvinh / process_taxDB.R
Last active June 12, 2024 02:44
Process taxonomy DB for PhyloProfile from downloaded taxdmp.zip file
library(PhyloProfile)
processNcbiTaxonomy <- function(taxdmpfile = NULL) {
if (is.null(taxdmpfile) || !file.exists(taxdmpfile)) {
stop("taxdmp.zip file invalid!")
} else temp <- taxdmpfile
names <- utils::read.table(
unz(temp, "names.dmp"), header = FALSE, fill = TRUE, sep = "\t",
quote = "", comment.char = "", stringsAsFactors = FALSE
@trvinh
trvinh / combine_fasta.py
Last active March 8, 2023 14:51
Python script for concatenating 2 fasta file without duplicated sequence headers
# -*- coding: utf-8 -*-
from Bio import SeqIO
import argparse
import shutil
def combine_fa(fa_1, fa_2, out_file):
""" Combine 2 fasta files """
new_fa_dict = SeqIO.to_dict(SeqIO.parse(open(fa_2),'fasta'))
existing_seq = SeqIO.to_dict(SeqIO.parse(open(fa_1),'fasta'))
@trvinh
trvinh / update_data_pp.txt
Created February 2, 2023 13:39
Update PhyloProfile predata (RData files in PhyloProfile/data folder)
library(PhyloProfile)
setwd('PhyloProfile/data')
# load data
data(taxonNamesReduced)
# modify the dataframe
# for example, rename Actinobacteria to Actinomycetota
taxonNamesReduced$fullName[
taxonNamesReduced$rank == "phylum" & taxonNamesReduced$ncbiID == 201174
@trvinh
trvinh / use_timeit.py
Created December 13, 2022 16:07
Use timeit to calculate runtime of a function
import timeit
def test(st,en):
return random.randint(st, en)
t = timeit.Timer(lambda: test(10, 100))
print(t.timeit(10))