Skip to content

Instantly share code, notes, and snippets.

View gregcaporaso's full-sized avatar
🌱

Greg Caporaso gregcaporaso

🌱
View GitHub Profile
@gregcaporaso
gregcaporaso / README.md
Created October 27, 2012 00:09
A quick and dirty script for generating "positive control" data for the PICRUST project

Notes by Greg Caporaso (gregcaporaso@gmail.com)

Analysis goals

From email to picrust-developers on 4 Oct 2012:

  1. Filter the HMP (not HMP-mock) data set to ~50-100k sequences at random to form a filtered dataset (for decreased run time).

  2. Select ~12 of the most abundant IMG-defined OTUs from the HMP, and slice the reference sequence to the amplified region in that dataset. "IMG-defined" here means that we have an IMG genome attached to the OTU, opposed to the Greengenes-defined OTUs where we don't have a genome for that specific OTU. "most abundant" will be somewhat arbitrary - I'm thinking something like a random 12 IMG-defined from the 25% most abundant OTUs in the dataset.

@gregcaporaso
gregcaporaso / Lecture20.ipynb
Last active October 12, 2015 07:47
IPython Notebook files used in Greg Caporaso's Fall 2012 BIO599 Computational Biology course. See the included README.md file for more details and licensing information.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@gregcaporaso
gregcaporaso / README.md
Created November 17, 2012 03:28
quick and dirty script to create a barcode read fastq file from a sequence read fastq file with barcodes in the headers

USAGE: extract_fastq_barcodes_from_header.py input_reads.fastq barcode_reads.fastq

@gregcaporaso
gregcaporaso / jgc53_coordinates.kml
Created November 20, 2012 23:25
A small example of the output for Programming Assignment 3 (Greg Caporaso's Fall 2012 BIO 299 course)
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
<name>HCanyon10R1</name>
<description>HCanyon10R1</description>
<Point>
<coordinates>-110.40590,+37.33006</coordinates>
</Point>
</Placemark>
@gregcaporaso
gregcaporaso / glob_example.py
Created November 24, 2012 17:00
Example of using glob to compile a list of filepaths
from glob import glob
filepaths = glob('*txt')
for filepath in filepaths:
f = open(filepath,'U')
# tip: always open files for reading with mode 'U' rather
# than mode 'r'
## Do whatever with the open file
f.close()
@gregcaporaso
gregcaporaso / ucrss_fast_params.txt
Created December 2, 2012 17:17
Parameters file for running subsamples OTU picking workflow in 'fast' mode
pick_otus:enable_rev_strand_match True
pick_otus:max_accepts 1
pick_otus:max_rejects 8
pick_otus:stepwords 8
pick_otus:word_length 8
@gregcaporaso
gregcaporaso / README.md
Created December 3, 2012 20:35
first pass at code for filtering OTUs that show up in negative control samples

Script for filtering OTUs that show up in negative control samples. This is a first pass at testing a process that was developed for the Student Microbiome Project. The effect of this filtering has not be investigated in detail, so use at your own risk.

This script works as follows:

  1. Filter input OTU table to contain only the control samples (as indicated by the -s parameter)
  2. Compute the median or mean (specified with --abundance_f) abundance of each OTU in the control samples. Generate a list of OTUs where this value is >= the minimum abundance (specified with --min_abundance).
  3. Filter the OTUs identified in Step 2 from the input OTU table.
@gregcaporaso
gregcaporaso / DemultiplexSummaryF1L1.txt
Created December 7, 2012 21:15
Very quick and dirty script to map some problematic barcodes from an EnGGen MiSeq run
### Most Popular Index Sequences
### Columns: Sequence ReverseComplement HitCount
.CCA.TCG CGA.TGG. 4190556 TCCAGTCG 2 TGTATGCG TCCAGTCG TACTTCGG TTCCTGCT TGCGATCT TTGACTCT TGCATAGT
.ACT.CGG CCG.AGT. 3867426 TACTTCGG 3
.CCAGTCG CGACTGG. 2761048 TCCAGTCG 2
.GTA.GCG CGC.TAC. 2595270 TGTATGCG 1
.ACTTCGG CCGAAGT. 2415896 TACTTCGG 3
.GTATGCG CGCATAC. 1570629 TGTATGCG 1
.CCA.TC. .GA.TGG. 589625 TCCAGTCG 2
TCCA.TCG CGA.TGGA 564313 TCCAGTCG 2
@gregcaporaso
gregcaporaso / partition_sequences.py
Created January 2, 2013 15:33
Given an input sequence file, splits sequences randomly into n different files. This is useful for generating files that can be used to test computationally expensive analysis processes as analyses can be run iteratively on each input sequence set as the process can then be run iteratively, but also provide preliminary results based on random su…
#!/usr/bin/env python
# File created on 02 Jan 2013
from __future__ import division
__author__ = "Greg Caporaso"
__copyright__ = "Copyright 2011, The QIIME project"
__credits__ = ["Greg Caporaso"]
__license__ = "GPL"
__version__ = "1.6.0"
__maintainer__ = "Greg Caporaso"
@gregcaporaso
gregcaporaso / README.md
Last active December 11, 2015 04:18
Very small sequence collection for use in QIIME tests (under development).

Tiny test data sequence collection for use with QIIME.

This is being compiled to address #582.

Using this data

You can see the output of a few commands by downloading this data and running the cmds.sh shell script from inside the unzipped directory.

Desired properties of the test data