Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Regular expressions intro and examples

These examples of regular expressions are taken largely from our book Practical computing for Biologists. More information is available at http://practicalcomputing.org.

This document can be accessed via the delightful github url shortener at https://git.io/dine

Given the following names:

Agalma elegans
Frillagalma vitiazi
Cordagalma tottoni
Shortia galacifolia
Mus musculus

Challenge: Shorten to A. elegans format

Can't just search and replace galma with .

###############################################################

Introduce \w wildcard to delete compass directions

+40 46'N +014 15'E
+21 17'N -157 52'W

Try \w first

then try '\w replaced by '

###############################################################

5th
3rd
2nd
4th

Introduce capture ()

Reduce to just numbers:

(\w)\w\w \1

############################################################### Revisit original challenge:

Agalma elegans
Frillagalma vitiazi
Cordagalma tottoni
Shortia galacifolia
Mus musculus

Challenge: Shorten to A. elegans format

Introduce + quantifier

(\w)\w+ (\w+)	\1. \2

###############################################################

Introduce . escape \

Exercise 1

>CAA58790.1= green fluorescent protein [Aequorea victoria]
MSKGEELFTGVVPILVELDGDVNGQKFSVRGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFLKSAMPEGYVQERTIFYKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKMEYNYNSHNVYIMGDKPKNGIKVNFKIRHNIKDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSQDPHGKRDHMVLLEFVTSAGITHGMDELYK
>AAZ67342.1= GFP-like red fluorescent protein [Corynactis californica]
MSLSKQVLPRDVKMRYHMDGCVNGHQFIIEGEGTGKPYEGKKILELRVTKGGPLPFAFDILSSVFTYGNRCFCEYPEDMPDYFKQSLPEGHSWERTLMFEDGGCGTASAHISLDKNCFVHKSTFHGVNFPANGPVMQKKTLNWEPSSELITAGDGILKGDVTMFLMLEGGHRLKCQFTTSYKAKKAVKMPPNHIIEHRLVRKEVADAVQIQEHAVAKHFIV
>ACX47247.1= green fluorescent protein [Haeckelia beehleri]
MEFEPEFFNKPVPLEMTLRGCVNGKEFMIFGKGEGDASKGNIKGKWILSHSEDGKCPMSWAVLAPTFAYGFKVFAKYPKDFAHFWQDCMPVGYSERRITRFGRLSGNDDIEQEGIMNTYHEVQMRERMVGDEITWIVESRVKLDATINENSPILMNDGLSEYRPNLERTVSFEDGLKNYSQFFYPIKDCETKDYIIANQMTHERPLSKCNKPGRLPPSHFKRTDLEQWKDSKEDKDHIVQEEITAFLLQAQDKDLQSLGIGM
>ABC68474.1= red fluorescent protein [Discosoma sp. RC-2004]
MRSSKNVIKEFMRFKVRMEGTVNGHEFEIEGEGEGRPYEGHNTVKLKVTKGGPLPFAWDILSPQFQYGSKVYVKHPADIPDYKKLSFPEGFKWERVMNFEDGGVVTVTQDPSLQDGCFIYKVKFIGVNFPSDGPVMQKKTMGWEASTERLYPRDGVLKGEIHKALKLKDGGHYLVEFKTIYMAKKPVQLPGYYYVDSKLDITSHNKDYTIVEQYERTEGRHHLFLKAELGSNVGER
>AAQ01183.1= green fluorescent protein 1 [Pontellina plumata]
MPAMKIECRISGTLNGVVFELVGGGEGIPEQGRMTNKMKSTKGALTFSPYLLSHVMGYGFYHFGTYPSGYENPFLHAANNGGYTNTRIEKYEDGGVLHVSFSYRYEAGRVIGDFKVVGTGFPEDSVIFTDKIIRSNATVEHLHPMGDNVLVGSFARTFSLRDGGYYSFVVDSHMHFKSAIHPSILQNGGSMFAFRRVEELHSNTELGIVEYQHAFKTPTAFA

Challenge: Convert the headers from the format: >CAA58790.1= GFP [Aequorea victoria] To: >CAA58790_Aequorea

(>\w+).+\[(\w+) \w+\]
\1_\2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment