Skip to content

Instantly share code, notes, and snippets.

@iracooke
Last active March 9, 2021 00:04
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save iracooke/73c58d11c921fae4fa7c to your computer and use it in GitHub Desktop.
Save iracooke/73c58d11c921fae4fa7c to your computer and use it in GitHub Desktop.
TopGO

Enrichment analysis with topGO

Input data needed;

  • A file orf2go.map for your study organism. It should contain entries like this;

    cds.comp1000018_c0_seq1|m.205055 GO:0005618, GO:0016021, GO:0045330, GO:0004857, GO:0030599, GO:0042545 cds.comp1000154_c0_seq1|m.205060 GO:0030136, GO:0031901, GO:0005768

If you don't have an orf2go.map file but you have a Trinity sqlite file you can generate one using trinotate_proteomics as follows;

	orf_to_go.rb Trinotate.sqlite > orf2go.map
  • A file containing a list of IDs for a specific tissue, or (optionally) with quantitative expression info.

Step 1: Read mappings file

	library(topGO)
	geneID2GO <- readMappings(file = "orf2go.map")

Step2: Create the background set

Note that in this example the file orf2go.map serves 2 purposes. It contains all transcriptID -> GO mappings. But it also contains a complete list of all the transcriptID's. This complete list of transcript ID's serves as the background set for the enrichment analysis. This means that the code to generate the background set is trivial (see below).

	geneNames <- names(geneID2GO)

Step3: Create the sample set

This is the set of ID's in which we will look for enriched GO terms. This set can be any subset of the background, and the set that is chosen affects the interpretation of the analysis. In our proteomics analyses we might look at the top 200 most expressed proteins (by iBAQ). Alternatively the top hits from the differential expression analysis could be used.

	myInterestingGenes=... my code to generate these

It is essential that the items in the myInterestingGenes vector correspond to items within the background ie (geneNames).

	geneList <- factor(as.integer(geneNames %in% myInterestingGenes))
	names(geneList) <- geneNames

Step4: Create GOData object

	GOdata <- new("topGOdata", ontology = "BP", allGenes = geneList,annot = annFUN.gene2GO, gene2GO = geneID2GO)

You might also want to try the other two GO ontologies MF, Molecular Function and CC Cellular Component

Step5: Run analysis

This runs a Fisher's exact test

	resultFis <- runTest(GOdata, algorithm = "classic", statistic = "fisher")

Step6: Look at the results

This creates a plot, but the table allRes is perhaps more useful.

	allRes <- GenTable(GOdata, classic = resultFis,orderBy = "weight", ranksOf = "classic", topNodes = 20)
	showSigOfNodes(GOdata, score(resultFis), firstSigNodes = 5, useInfo = 'all')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment