A personal diary of DataFrame munging over the years.
Convert Series datatype to numeric (will error if column has non-numeric values)
(h/t @makmanalp)
#!/usr/bin/env python | |
""" | |
GTF.py | |
Kamil Slowikowski | |
December 24, 2013 | |
Read GFF/GTF files. Works with gzip compressed files and pandas. | |
http://useast.ensembl.org/info/website/upload/gff.html |
A personal diary of DataFrame munging over the years.
Convert Series datatype to numeric (will error if column has non-numeric values)
(h/t @makmanalp)
import os | |
import matplotlib | |
from matplotlib.patches import Circle, Wedge, Polygon, Rectangle | |
from matplotlib.collections import PatchCollection | |
import matplotlib.pyplot as plt | |
def karyoplot(karyo_filename, metadata={}, part=1): | |
''' | |
To create a karyo_filename go to: http://genome.ucsc.edu/cgi-bin/hgTables |
[kallisto][] is a new method for processing RNA-seq data. By pseudoaligning reads to a transcriptome instead of aligning reads to a genome, the quantification step is much faster. While the computational speedup will be huge for projects with many samples and/or with organisms with large genomes, I was curious how much time would be saved using [kallisto][] on a small RNA-seq project for an organism with a smaller genome. To perform this comparison, I downloaded 6 fastq files from a recent yeast RNA-seq study on GEO. I chose [Subread][subread] as the comparison method because it performs read alignment but is optimized for quickly obtaining gene counts (it soft clips reads instead of trying to map exact exon-exon boundaries).
The dplyr
package in R makes data wrangling significantly easier.
The beauty of dplyr
is that, by design, the options available are limited.
Specifically, a set of key verbs form the core of the package.
Using these verbs you can solve a wide range of data problems effectively in a shorter timeframe.
Whilse transitioning to Python I have greatly missed the ease with which I can think through and solve problems using dplyr in R.
The purpose of this document is to demonstrate how to execute the key dplyr verbs when manipulating data using Python (with the pandas
package).
dplyr is organised around six key verbs:
#!/usr/bin/env python | |
"""Sequence-based structural alignment of two proteins.""" | |
import argparse | |
import pathlib | |
from Bio.PDB import FastMMCIFParser, MMCIFIO, PDBParser, PDBIO, Superimposer | |
from Bio.PDB.Polypeptide import is_aa |
Click on Binder package
link on that page. That link is near the very bottom of the part of the page that is showing above; it is just below Demo Programs.
A notebook will then launch. (Sometimes first times they hang, just hit reload
in your browser.)
After it loads fully it will look like below with a URL different from what you see but similar.
Go to my fork of the VPython Binder repository in your browser.
That will take you to a new page and trigger deploying version of the jupyter notebook environment from the correct repository. You shouldn't need to do anything as this takes place; you can watch the progress bar roughly in the middle of the screen, just below the Launch
button. It may take about a minute. After it boots up, it should bring you to the dashboard that will look like below
%matplotlib notebook | |
# use `%matplotlib notebook` if you are using current JupyterLab | |
from vpython import * | |
import matplotlib.pyplot as plt | |
plt.style.use('ggplot') | |
# based on "AtomicSolid" by Bruce Sherwood | |
# adapted to include realtime matplotlib by Wayne Decatur |