Skip to content

Instantly share code, notes, and snippets.

@alyssafrazee
Created December 9, 2014 02:57
Show Gist options
  • Save alyssafrazee/46ebca5d1c822618519a to your computer and use it in GitHub Desktop.
Save alyssafrazee/46ebca5d1c822618519a to your computer and use it in GitHub Desktop.
tutorial stuff: ballgown & polyester

Using Ballgown and Polyester

Ballgown

Reading data

  • ballgown creates a ballgown object from tablemaker output
  • ballgownrsem creates a ballgown object from RSEM output. (not yet well-tested).
  • gffRead and gffReadGR read GTF (annotation) files into R
    • gffRead gives you a data frame
    • gffReadGR gives you a GRanges object
    • gffReadGR with splitByTranscript=TRUE gives you a GRangesList of transcripts in the GTF file

statistical tests

The stattest function:

  • can give it either a full ballgown object, or a feature-by-sample matrix (see argument gowntable)
  • BE CAREFUL WITH PHENOTYPE DATA ORDER
  • Returns data frame: feature ID, p-values, and q-values. No nonsense. Optionally returns coefficients (fold changes) for a 2-group comparison.
  • can do timecourse tests automatically
  • you can specify your own model matrices
  • automatically library-adjusts and log transforms, but you can turn this off.

extracting expression estimates

  • functions that end in expr extract expression matrices. There are default units for each of them:
    • texpr(bg) gives you an FPKM transcript matrix
    • eexpr(bg) gives you an exon rcount matrix (read counts)
    • iexpr(bg) gives you an intron rcount matrix (# reads supporting the intron)
  • adding the all tag to a call to an expr function gives you genomic location data too.
    • texpr(bg, 'all') is particularly informative (contains gene IDs, transcript lengths, etc)

assembly structure

  • structure(bg)$trans, structure(bg)$exon, and structure(bg)$intron give assembly structures in GRangesList (transcripts) or GRanges (exon/intron) form
  • tGene finds the gene to which a transcript belongs. (such grammar wow)
  • transcriptsIDs(bg), transcriptNames(bg) give sets of transcript IDs for bg object.
  • geneIDs(bg), geneNames(bg) give sets of gene IDs for bg object.
  • seqnames(bg) gives the chromosome names included in the object

indexes

  • indexes(bg) is list, so use $
  • indexes(bg)$t2g, indexes(bg)$i2t, indexes(bg)$e2t connect exons, introns, transcripts, genes.
  • indexes(bg)$bamfiles can contain paths to bam files. Probably wise to have this in same order as pData, but it's not as crucial since no existing functions depend on this.

filtering

  • exprfilter subsets a ballgown object based mean transcript expression (FPKM or cov)
  • subset subsets a ballgown object based either on genomic location or on phenotype

phenotype data

pData(bg) is the phenotype component of a ballgown object.
SUPER IMPORTANT BUSINESS: make SURE the columns of pData are in the same order as the columns of texpr(bg). I've tried to implement some automatic checks, but I'm sure they are not perfect.

annotation

  • getGenes was designed to label assembled transcripts with annotated gene names
  • annotate_assembly matches two sets of transcripts based on overlap & calculates percents.
  • checkAssembledTx plots assembled & annotated transcripts together
  • contains checks whether one set of GRanges is fully contained within another. (I found this useful for determining whether a set of assembled transcripts contained coding sequences).

transcript clustering

  • clusterTranscripts does k-means or hierarchical clustering of the transcripts within a gene
  • pctOverlap calculates the percent overlap between two GRanges objects
  • collapseTranscripts actually calculates cluster-level expression

visualization

  • plotTranscripts plots all the isoforms of a specific gene
  • plotMeans plots the isoform structure for a gene but makes a multipanel plot, one for each phenotype group, and colors the isoforms by mean abundance.
  • plotLatentTranscripts does transcript clustering and plots the results

Polyester

Requirements:

  • One of:
    • fasta file containing annotated transcripts
    • GTF file containing transcript structures + chromosome sequences. (check out the igenomes indexes).
  • specify groups / fold changes.

That's all you need - Polyester can automate the rest.

Workhorse

Can do simple simulations with one function: simulate_experiment. It's fairly well-documented (in my completely biased opinion -- please let me know if it's confusing).

Internal functions

You can play around with different parts of the simulation using the internal functions. They're all exported.

  • generate_fragments
  • reverse_complement
  • get_reads
  • add_error and add_platform_error

You can also add a bunch of biases to your simulation:

  • GC bias (not on github yet, coming ASAP)
  • positional bias (in generate_fragments)
  • non-normal fragment length distribution (in generate_fragments)'
  • non-uniform error model (using add_platform_error instead of add_error). In my experience this is a bit slow.

I don't have all the features fully wrapped into simulate_experiment yet so usability isn't quite guaranteed for the new stuff yet. I promise I'll put them in soon (definitely before the spring Bioconductor release, and hopefully well before then).

Custom read matrix

The other major feature of Polyester is that you can just define a read count matrix (number of reads to simulate per transcript per sample) and simulate that exact number of reads per transcript. The function is simulate_experiment_countmat.

another fun secret

I have some code for ROC curves hidden in the ballgown package ballgown:::assessSim. (It's too specific to export, but code for calculating sensitivity/specificity for an annotated/assembled simulation is a huge pain to write, so I stuck it in Ballgown. The sim_tx_info.txt file output by Polyester is basically the exact input you need for this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment