Skip to content

Instantly share code, notes, and snippets.

@averagehat
Last active October 6, 2015 17:01
Show Gist options
  • Save averagehat/5d6a94b0d052df8b6e1b to your computer and use it in GitHub Desktop.
Save averagehat/5d6a94b0d052df8b6e1b to your computer and use it in GitHub Desktop.
Fasta parsing using clojure instaparse library
(ns exmaple-stuff.core
(:require [instaparse.core :as insta]
[clojure.test :refer :all ]))
; wrapping in <> makes the element excluded from the parse tree
; space represent concatentation, everything else is like regex
; This allows a sequence to have arbitrary newlines in it but no spaces
; Naturally, this verifies the alphabet automatically
; a regex which does not capture the newline might be better than the sequence rule,
; because it wouldn't require joining the string later, but regexes are greedy.
(def raw-parse (insta/parser
"file : (record <'\n'>)* record <'\n'?>
record : id <'\n'> sequence
id : <'>'> #'[^\n]+'
sequence : ('A' | 'G' | 'C' | 'T' | <'\n'>)+"))
;rejoin the sequence string but keep :sequence key; could also be simply `(insta/transform {:sequence str} tree)`
(def fix-seq (partial insta/transform {:sequence (comp #(vector :sequence %) str)} ))
(def parse (comp fix-seq raw-parse))
;; extractors
(defn rule->vals [k t]
(let [is-key #(= (first %) k)]
(->> (tree-seq vector? next t)( filter is-key)(map second))))
(def get-seqs (partial rule->vals :sequence))
(def get-ids (partial rule->vals :id))
;; tests
(comment
(is (= (->> (parse
">foo|boo|roo
ACTGATG
ACGAGAGT
") get-seqs first) "ACTGATGACGAGAGT"))
(is (insta/failure?
(raw-parse
">foo|boo|roo
ACGT
>foo|boo|roo
junkCFGG" )))
(is (= (->> (parse
">foo
ACTGATG
>foo|roo2
ACTGATG" ) get-ids) '("foo" "foo|roo2"))))
;; helper for arbitrary alphabets
(defn alphabet->rule
"create instaparse rule from seq of letters"
[alphabet]
(let [wrap #(str \( % ")+")
aas (-> (map #(str \' % \') alphabet) (conj "<'\\n'>") )]
(wrap (str/join " | " aas))))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment