Skip to content

Instantly share code, notes, and snippets.

@lh3
Last active September 10, 2016 02:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lh3/93c60f267ba77f3749b47c0053672e6e to your computer and use it in GitHub Desktop.
Save lh3/93c60f267ba77f3749b47c0053672e6e to your computer and use it in GitHub Desktop.

Table of Contents

Introduction

GFA stands for Graphical Fragment Assembly format. It is a TAB-delimited text format to describe the relationships between sequences. Initially designed for sequence assembly, GFA may also represent variations in genomes and splice graphs in genes.

For an example, in the picture below, each line is a nucleotide sequence with the arrow indicating its orientation:

image

Assuming each sequence is 200bp in length and each overlap is 20bp in length, we can encode this graph in the GFA format as:

S  A  200  *
S  B  200  *
S  C  200  *
S  D  200  *
L  A  +  B  +  20  20
L  A  +  C  -  20  20
L  B  +  D  +  20  20
L  C  -  D  +  20  20

where each S-line, or segment line, gives the property of a sequence, including its length and actual nucleotide sequence; each L-line, or link line, describes the relationship between two segments. There are two common ways to understand an L-line. We take L A + C - as an example. First, in the overlap graph view, the L-line indicates sequence A on the forward strand is ahead of C on the reverse strand. Second, in the string graph view, the end of A transits to the start of C (i.e. + for the end of a sequence and - for the start). In the following, we often take the overlap graph view for convenience.

Notably, if L A + C - is a link, L C + A - is also a link because the two are equivalent:

image

Terminologies

  • Segment: a sequence. An oriented segment is a 2-tuple (segment,strand). Each oriented segment has a complement oriented segment (segment,¬strand), where operator ¬ gives the opposite strand.

  • Link: a (full) dovetail overlap between two oriented segments. A link is directed. Each link has a complement link derived by swapping the order of segments and flipping the orientations. For the definition of dovetail overlap, see documentations from GRC or from wgs-assembler.

  • Gap: an unknown sequence connecting the ends of two oriented segments.

  • Match: a local alignment between two oriented segments.

Mathematically, GFA models a skew-symmetric graph, where each vertex is an oriented segment (in the overlap graph view) or the 5'- or 3'-end of a segment (in the string graph view), and each directed edge is a link in GFA.

GFA: Mandatory Fields

In GFA, each line is TAB-delimited and describes only one type of data. On each line, the leading letter indicates the data type and defines the mandatory fields on that line. The following table gives an overview of different line types in GFA:

Line Type col1 col2 col3 col4 col5 col6 col7
H Header
S Segment sid slen seq
L Link sid1 ori1 sid2 ori2 olen1 olen2
G Gap sid1 ori1 sid2 ori2 dist
M Match sid1 ori sid2 beg1 end1 beg2 end2

In the table, sid* are strings, slen, olen*, and dist are 32-bit non-negative integers, and ori* take values of + or -; beg* may be an integer or ^ for the start of a segment; end* may be an integer or $ for the end of a segment.

Segment line

A segment line or S-line takes the following format:

S	<sid>	<slen>	<seq>

where sid is the segment name, slen is the length of the segment and seq is the sequence which can be * if not available.

Link line

A link line or L-line is

L	<sid1>	<ori1>	<sid2>	<ori2>	<olen1>	<olen2>

where sid1/sid2 are the names of segments involved in the link, ori1/ori2 are orientations (either + or -), and olen1/olen2 are the lengths in the overlap as is shown in the following (o1 and o2 in the figure correspond to olen1 and olen2, respectively):

image

In GFA, each L-line has a complement L-line, which is

L	<sid2>	¬<ori2>	<sid1>	¬<ori1>	<olen2>	<olen1>

Gap line

A gap line or G-line is

G	<sid1>	<ori1>	<sid2>	<ori2>	<dist>

where dist is the best estimate between the ends of two segments. A G-line is effectively an L-line with negative overlap length.

Match line

A match line or M-line is

M	<sid1>	<ori>	<sid2>	<beg1>	<end1>	<beg2>	<end2>

where sid1/sid2 are the names of segments, [beg1,end1) gives the interval on the forward strand of sid1 and [beg2,end2) gives the interval on the ori strand of sid2.

Each M-line also has a complement M-line.

TODO

  1. Optional fields
  2. Paths
  3. To store both link and complement link or not
  4. Nail down the format of M-line
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment