Skip to content

Instantly share code, notes, and snippets.

View fedarko's full-sized avatar

Marcus Fedarko fedarko

View GitHub Profile
@fedarko
fedarko / read_stats.py
Last active December 17, 2023 22:43
Compute simple statistics (number of reads, total read length, average read length) for a set of (maybe gzipped) FASTA / FASTQ files
#! /usr/bin/env python3
#
# Computes the total number of reads, total read length, and average read
# length of a set of (maybe gzipped) FASTA / FASTQ files. Requires the pyfastx
# library (https://github.com/lmdu/pyfastx). I designed this in the context of
# computing read statistics, but if you have a set of other sequences (e.g.
# contigs) then I guess this would still work for that.
#
# USAGE:
# ./read_stats.py file1.fa [file2.fa ...]
@fedarko
fedarko / shorten_edge_labels.py
Last active August 17, 2023 21:08
Shortens each edge label in a LJA DOT file to just the first line
#! /usr/bin/env python
#
# Shortens edge labels in a DOT file output by LJA to just show the first line
# and then a count of how many other lines are omitted. (If an edge's label
# spans exactly one or two lines, then the entire label is preserved.)
#
# USAGE:
# ./shorten_edge_labels.py in.dot out.dot
import sys
@fedarko
fedarko / check_for_conflicting_node_ids.py
Created August 16, 2023 05:55
Checks for "conflicting" node IDs defined multiple times in a DOT file
#! /usr/bin/env python
#
# Scans through a jumboDBG / LJA output DOT file; looks for cases where
# the same node is "defined" on multiple lines. This can be caused by the
# same truncated node ID being misused across lines.
#
# USAGE:
# ./check_for_conflicting_node_ids.py graph.dot
#
# Note that this assumes that the input graph was output by jumboDBG / LJA --
@fedarko
fedarko / rm_seqs_from_gfa.py
Created August 6, 2023 02:42
Remove sequences from a GFA 1 file
#! /usr/bin/env python3
#
# SUMMARY
# =======
# Outputs a copy of a GFA 1 file with each segment (S) line that contains a
# sequence (not just a "*" character) altered as follows:
#
# - If an LN:i tag does not exist for this sequence:
# - We will add an LN:i tag describing the length of the sequence.
# - We will replace the sequence with a "*" character.
@fedarko
fedarko / sort-rmdup-bbl.py
Last active August 31, 2022 08:11
Sort and remove duplicate BBL (bibtex file) entries; useful when combining multiple BBL files (e.g. if using the multibib package) into a single one
#! /usr/bin/env python3
# NOTE: this is a hack, so it will probably break if you have BBL files that
# don't look like the natbib-generated ones I'm used to. It is also pretty
# unintelligent about *how* it sorts entries (it defers most of the work
# to python), so if you have cases where some of your references are by
# the same person or whatever then that might cause the output to not match
# your expectations.
import sys
@fedarko
fedarko / gfa-to-fasta.py
Created April 15, 2022 02:45
Convert GFA to FASTA
#! /usr/bin/env python3
# Converts a GFA assembly graph to a FASTA file of all sequences
# within the graph. Notably, this ignores connections between sequences
# in the graph.
#
# Depends on Python 3.6 or later.
#
# Usage:
# $ ./gfa_to_fasta.py mygraph.gfa contigs.fasta
@fedarko
fedarko / handle_duplicate_sample_ids.py
Last active December 16, 2019 22:12
Script to report on duplicate IDs in a plate map spreadsheet (and modify certain duplicate IDs, in a very specific case); also attempts to update Qiita prep files accordingly. As a warning, code is untested / pretty gross.
#! /usr/bin/env python3
import os
from collections import Counter
from math import ceil
import re
from numpy import argmax
import pandas as pd
from qiime2 import Metadata
# "Parameters" of this script
@fedarko
fedarko / find_missing_dates.py
Last active December 10, 2019 00:54
In a timeseries metadata file, finds all days that are not "represented" by at least one sample in the metadata
#! /usr/bin/env python3
from dateutil.parser import parse
import pandas as pd
df = pd.read_csv("20191209_metadata.txt", sep="\t", index_col=0)
# Subset to a certain host subject ID, if desired
df = df[df["host_subject_id"] == "M03"]
@fedarko
fedarko / negative_control_stats.py
Last active October 27, 2019 23:55
Search taxonomies of negative controls
#! /usr/bin/env python3
"""
This is a small script that looks through the annotated taxonomies of all
features present in a dataset's negative control samples. It's handy for
checking that certain features are (for the most part) absent from these
samples.
This obviously isn't a very formal way of accounting for contamination,
but it is useful for quickly verifying that certain taxa are probably not
the product of contamination. (Better approaches include e.g. the decontam
@fedarko
fedarko / convert_timestamp_to_ordinal_date.py
Created October 8, 2019 22:26
adds an ordinal date field based on a timestamp to a q2 metadata file
#! /usr/bin/env python3
from qiime2 import Metadata
from dateutil.parser import parse
from dateutil.relativedelta import relativedelta
m = Metadata.load("metadata-with-age.tsv")
m_df = m.to_dataframe()
m_df["ordinal-timestamp"] = 0