ilevantis/bedtools_cheatsheet.md

## bedtools_cheatsheet.md

      
    Raw
  

              bedtools_cheatsheet.md
            
          
    Bedtools Cheatsheet

General:


Tools
Description


flank
Create new intervals from the flanks of existing intervals.


slop
Adjust the size of intervals.


shift
Adjust the position of intervals.


subtract
Remove intervals based on overlaps b/w two files.


complement
Extract intervals not represented by an interval file.


closest
Find the closest, potentially non-overlapping interval.


intersect
Find overlapping intervals in various ways.


window
Find overlapping intervals within a window around an interval.


cluster
Cluster (but don't merge) overlapping/nearby intervals.


merge
Combine overlapping/nearby intervals into a single interval.


map
Apply a function to a column for each overlapping interval.


groupby
Group by common cols. & summarize oth. cols. (~ SQL "groupBy")


Formatting:
Notes: BED file format, GFF vs BED indexing


Tools
Description


getfasta
Use intervals to extract sequences from a FASTA file.


maskfasta
Use intervals to mask sequences from a FASTA file.


sort
Order the intervals in a file.


bed12tobed6
Breaks BED12 intervals into discrete BED6 intervals.


bamtofastq
Convert BAM records to FASTQ records.


bamtobed
Convert BAM alignments to BED (& other) formats.


bedpetobam
Convert BEDPE intervals to BAM records.


bedtobam
Convert intervals to BAM records.


Statistics:


Tools
Description


jaccard
Calculate the Jaccard statistic b/w two sets of intervals.


random
Generate random intervals in a genome.


reldist
Calculate the distribution of relative distances b/w two files.


shuffle
Randomly redistribute intervals in a genome.


makewindows
Makes adjacent or sliding windows across a genome or BED file.


nuc
Profile the nucleotide content of intervals in a FASTA file.


Coverage:


Tools
Description


annotate
Annotate coverage of features from multiple files.


coverage
Compute the coverage over defined intervals.


genomecov
Compute the coverage over an entire genome.


multicov
Counts coverage from multiple BAMs at specific intervals.


unionbedg
Combines coverage intervals from multiple BEDGRAPH files.


common flags:


-s, -S : Require same strandedness or opposite strandedness, respectively.
-f, -F : Minimum overlap required as a fraction of A or a fraction of B respectively.
-r, -e : Require that the minimum overlap be satisfied for A AND B, or A OR B respectively.
-split     : Treat "split" BAM or BED12 entries as distinct BED intervals.
-abam      : A is a BAM file.


General

flank, slop

Create new intervals from the flanks of existing intervals. (flank Docs)
Adjust the size of intervals. (slop Docs)
IN           ▓▓▓▓▓       ▓▓▓
Flank      ██     ██   ██   ██
Slop       █████████   ███████

$ bedtools flank [OPTIONS] -i <BED/GFF/VCF> -g <GENOME> [-b or (-l and -r)]
$ bedtools slop [OPTIONS] -i <BED/GFF/VCF> -g <GENOME> [-b or (-l and -r)]


OPTIONS
.


-b, -l, -r
Flank/extend regions by x bp on both sides, on the left, or on the right respectively.


-s
Define -l and -r based on strand.


-pct
Define -l and -r as a fraction of the feature's length.


shift

Adjust the position of intervals, while respecting chromosome edges. (Docs).
IN      ██   ██      ████
OUT        ██   ██      ████

$ bedtools shift [OPTIONS] -i <BED/GFF/VCF> -g <GENOME> [-s or (-m and -p)]


OPTIONS
.


-s
Number of BPs to shift the features.


-m, -p
Number of BPs to shift the features on the - strand or + strand, respectively.


-pct
Define -s, -m and -p as a fraction of the feature's length.


subtract

Remove intervals based on overlaps b/w two files. (Docs)
A        ▓▓▓▓▓▓▓▓▓▓   ▓▓▓     ▓▓▓▓▓▓
B          ▓▓▓▓           ▓▓▓▓▓▓▓  
A sub B  ██    ████   ███        ███

$ bedtools subtract [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>


OPTIONS
.


-A
Remove entire feature if any overlap.


common
strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e


complement

Extract intervals not represented by an interval file. (Docs)
IN           ▓▓▓▓▓     ▓▓▓     ▓▓▓▓▓▓
          ▓▓▓▓            ▓▓▓  
OUT  █████        █████      ██

$ bedtools complement -i <BED/GFF/VCF> -g <GENOME>
closest

Find the closest, potentially non-overlapping interval. (Docs)
A            █████   ✓
B   ████            ███   

$ bedtools closest [OPTIONS] -a <FILE> -b <FILE1, FILE2, ..., FILEN>


OPTIONS
.


-d
Also report distance from A to the closest feature.


-k
Report the k closest hits. Default: 1.


-io
Ignore features in B that overlap A.


-iu, -id
Ignore features in B that are upstream or downstream, respectively, of features in A.


common
strandedness: -s, -S


intersect

Find overlapping intervals in various ways. (Docs)
A           ██████████
B         ▓▓▓▓    ▓▓        ▓▓▓  
A int B     ▓▓    ▓▓

$ bedtools intersect [OPTIONS] -a <BAM/BED/GFF/VCF> -b <FILE1, FILE2, ..., FILEN>


OPTIONS
.


-wa, -wb
Write the original entry in A/original entry in B, respectively, for each overlap.


-loj
For each feature in A report each overlap with B. Report a NULL feature for B if no overlap.


-wao
Report A and B features and no. of bp overlap between them.


-u
Only report each overlapping A feature once.


-c
For each entry in A, report count of overlapping B features.


-v
Only report features in A not overlapping B.


common
strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e; bam/bed12: -abam, -split


window

Find overlapping intervals within a window around an interval. (Docs)
A           ┌────█████────┐
B         ▓▓▓▓    ▓▓▓        ▓▓▓  
A win B   ▓▓▓▓    ▓▓▓

$ bedtools window [OPTIONS] [-a|-abam] -b <BED/GFF/VCF>


OPTIONS
.


-w, -l, -r
Flank length of overlap window in each direction, upstream or downstream, respectively.


-sw
Define -l and -r based on strand.


-u
Only report each overlapping A feature once.


-c
For each entry in A, report count of overlapping B features.


-v
Only report features in A not overlapping B.


common
strandedness: -sm, -Sm; bam: -abam


cluster

Cluster (but don't merge) overlapping/nearby intervals. (Docs)
BED        ████     █████  ███  
clustID   └─#1─┘   └────#2────┘

$ bedtools cluster [OPTIONS] -i <BED/GFF/VCF>


OPTIONS
.


-d
Max distance between features in cluster.


common
strandedness: -s, -S


Aggregation Tools

For merge, groupby, and map the following* aggregation functions (specified by -o) can be applied to a column/columns specified by -c:
sum, count, count_distinct, min, max, mean, median, mode, antimode, stdev, sstdev, collapse, distinct, first, last
*Other functions are available.
merge

Combine overlapping/nearby intervals into a single interval. (Docs)
IN       ▓▓▓      ▓        ▓▓··d··▓▓▓
      ▓▓▓▓         ▓▓        
OUT   ██████      ███      ██████████

$ bedtools merge [OPTIONS] -i <BED/GFF/VCF/BAM>


OPTIONS
.


-s
Require same strandedness.


-S
Force merge for one specific strand only. Options: <+/->.


-d
Maximum distance between features to be merged.


common
aggregation: -o, -c;


map

Apply a function to a column for each overlapping interval.(Docs)
        score = 3  1     5                 4      6
B              ▓▓▓ ▓   ▓▓▓▓▓             ▓▓▓▓▓▓ ▓▓▓▓
A               ██████████                 ███████
B map(mean) A   ██████████ mean(3,1,5)=5   ███████ mean(4,6)=5

$ bedtools map [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>


OPTIONS
. .


common
aggregation: -o, -c; strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e; bed12: -split


groupby

Group by common cols & summarize other cols (~ SQL "groupBy"). (Docs)
$ bedtools groupby [OPTIONS] -i <BED> -g <groupby columns> -c <op. column> -o <operation>


OPTIONS
.


common
aggregation: -o, -c


Formatting

BED file format


Column
e.g.
Definition


chrom
Sc112.1
<STR> name of chromosome/scaffold


start
2134
<INT> start position of feature


end
2565
<INT> end position of feature


name
gene123
<STR> name of feature


score
544
<NUM> score for the feature e.g. bit score


strand
+
<+/-/.> strand on which feature is located


thickStart
2235


thickEnd
2489


itemRgb
255,0,0


blockCount
2
<INT> number of blocks (exons) in the feature


blockSizes
150,80
<INT>,<INT>,... list of block sizes


blockStarts
0,2333
<INT>,<INT>,... list of block start positions relative to start position of feature


GFF vs BED indexing

GFF    ┌─1   2   3─┐ 4   ...
         G---A---T   C   ...
BED    └─0   1   2 └─3   ...


.
gff -> bed
bed -> gff


new_start =
gff_start - 1
bed_start + 1


new_end =
gff_end
bed_end


getfasta

Use intervals to extract sequences from a FASTA file. (Docs)
FASTA   ACTGATCATGATACATGATACCATTAGGATACAATA
BED         ████       █████      ████
OUTFA       ATCA       TGATA      GGAT      

$ bedtools getfasta [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF>


OPTIONS
.


-name
Use “name” column in BED file for FASTA headers in the output.


-s
Reverse complement features on "-" strand. Default: strand information ignored.


-split
Given BED12 input, concatenate the sequences from BED blocks (e.g., exons).


maskfasta

Use intervals to mask sequences from a FASTA file. (Docs)
FASTA   ACTGATCATGATACATGATACCATTAGGATACAATA
BED           ████       █████      ████
FASTA'  ACTGATNNNNATACATGNNNNNATTAGGNNNNAATA

$ bedtools maskfasta [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF> -fo <output FASTA>


OPTIONS
.


-soft
Soft-mask (convert to lower-case bases) instead of masking with "N".


-mc
Specify masking character.


sort

Order the intervals in a file. (Docs)
$ bedtools sort [OPTIONS] -i <BED/GFF/VCF>


OPTIONS
.


-sizeA
Sort by feature size (asc).


-sizeD
Sort by feature size (desc).


-chrThenSizeA
Sort by chromosome (asc), then by feature size (asc).


-chrThenSizeD
Sort by chromosome (asc), then by feature size (desc).


-chrThenScoreA
Sort by chromosome (asc), then by score (asc).


-chrThenScoreD
Sort by chromosome (asc), then by score (desc).


Statistics

jaccard

Calculate the Jaccard statistic b/w two sets of intervals. (Docs)
A                 ███████████  15bp
B               ▓▓▓▓ 10bp ▓▓ 4bp       ▓▓▓ 8bp
A int B           ▓▓ 6bp  ▓▓ 4bp
Jaccard(A,B)     (6+4)/((15+10+4+8)-(6+4)) =  0.37     

$ bedtools jaccard [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>


OPTIONS
.


common
strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e; bed12: -split


random

Generate random intervals in a genome. (Docs)
$ bedtools random [OPTIONS] -g <GENOME>


OPTIONS
.


-l
The length of the intervals to generate. Default: 100


-n
The number of intervals to generate. Default: 1,000,000


-seed
Supply an integer seed for the shuffling.


reldist

Calculate the distribution of relative distances b/w two files. (Docs)
                ───────r──────
A            ▓▓▓▓▓▓         ▓▓▓▓
B                      ███
                ───d1─── ──d2──
reldist = min(d1,d2)/r

$ bedtools reldist [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>


OPTIONS
.


-detail
Instead of a summary, report relative distance for each region in A.


shuffle

Randomly redistribute intervals in a genome. (Docs)
$ bedtools shuffle [OPTIONS] -i <BED/GFF/VCF> -g <GENOME>


OPTIONS
.


-excl
BED file with regions into which features won't be shuffled.


-incl
BED file with regions into which features will be shuffled.


-chrom
Keep features on the same chromosome.


-chromFirst
Distribute features ~uniformly across chroms, not across total sequence.


-noOverlapping
Don't allow shuffled intervals to overlap.


makewindows

Makes adjacent or sliding windows across a genome or BED file.
$bedtools makewindows [OPTIONS] [-g <GENOME>|-b <BED>] [-w <window size> | -n <n windows>]


OPTIONS
.


-s
Number of bases to step before creating a new window. Default: equal to -w


Coverage

annotate

Annotate coverage of features from multiple files. (Docs)
$ bedtools annotate -i variants.bed -files genes.bed conserve.bed known_var.bed
chr1  100 200 nasty 1 - 0.500000  1.000000  0.300000
chr2  500 1000  ugly  2 + 0.000000  0.600000  1.000000

$ bedtools annotate [OPTIONS] -i <BED/GFF/VCF> -files FILE1 FILE2 FILE3 ... FILEn


OPTIONS
.


-counts
Report count of features that overlap -i in each file. Default: report fraction of -i covered by each file.


-both
Report counts & fractions for each file.


common
strandedness: -s, -S.


coverage

Compute the coverage over defined intervals. (Docs)
BED FILE A  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓     ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓     ▓▓▓▓▓▓  
BED File B  ████ ████              ██             █████████
              ████████                                      
Result      [  N=3, 10/15 ]     [  N=1, 2/15  ]    [N=1,6/6]

$ bedtools coverage [OPTIONS] -a <BAM/BED/GFF/VCF> -b <FILE1, FILE2, ..., FILEN>


OPTIONS
.


-d
Report the depth at each position in each A feature.


common
strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e; bam/bed12: -split,-abam
Tools	Description
flank	Create new intervals from the flanks of existing intervals.
slop	Adjust the size of intervals.
shift	Adjust the position of intervals.
subtract	Remove intervals based on overlaps b/w two files.
complement	Extract intervals not represented by an interval file.
closest	Find the closest, potentially non-overlapping interval.
intersect	Find overlapping intervals in various ways.
window	Find overlapping intervals within a window around an interval.
cluster	Cluster (but don't merge) overlapping/nearby intervals.
merge	Combine overlapping/nearby intervals into a single interval.
map	Apply a function to a column for each overlapping interval.
groupby	Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
Tools	Description
getfasta	Use intervals to extract sequences from a FASTA file.
maskfasta	Use intervals to mask sequences from a FASTA file.
sort	Order the intervals in a file.
bed12tobed6	Breaks BED12 intervals into discrete BED6 intervals.
bamtofastq	Convert BAM records to FASTQ records.
bamtobed	Convert BAM alignments to BED (& other) formats.
bedpetobam	Convert BEDPE intervals to BAM records.
bedtobam	Convert intervals to BAM records.
Tools	Description
jaccard	Calculate the Jaccard statistic b/w two sets of intervals.
random	Generate random intervals in a genome.
reldist	Calculate the distribution of relative distances b/w two files.
shuffle	Randomly redistribute intervals in a genome.
makewindows	Makes adjacent or sliding windows across a genome or BED file.
nuc	Profile the nucleotide content of intervals in a FASTA file.
Tools	Description
annotate	Annotate coverage of features from multiple files.
coverage	Compute the coverage over defined intervals.
genomecov	Compute the coverage over an entire genome.
multicov	Counts coverage from multiple BAMs at specific intervals.
unionbedg	Combines coverage intervals from multiple BEDGRAPH files.
OPTIONS	.
-b, -l, -r	Flank/extend regions by x bp on both sides, on the left, or on the right respectively.
-s	Define -l and -r based on strand.
-pct	Define -l and -r as a fraction of the feature's length.
OPTIONS	.
-s	Number of BPs to shift the features.
-m, -p	Number of BPs to shift the features on the - strand or + strand, respectively.
-pct	Define -s, -m and -p as a fraction of the feature's length.
OPTIONS	.
-A	Remove entire feature if any overlap.
common	strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e
OPTIONS	.
-d	Also report distance from A to the closest feature.
-k	Report the k closest hits. Default: 1.
-io	Ignore features in B that overlap A.
-iu, -id	Ignore features in B that are upstream or downstream, respectively, of features in A.
common	strandedness: -s, -S
OPTIONS	.
-wa, -wb	Write the original entry in A/original entry in B, respectively, for each overlap.
-loj	For each feature in A report each overlap with B. Report a NULL feature for B if no overlap.
-wao	Report A and B features and no. of bp overlap between them.
-u	Only report each overlapping A feature once.
-c	For each entry in A, report count of overlapping B features.
-v	Only report features in A not overlapping B.
common	strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e; bam/bed12: -abam, -split
OPTIONS	.
-w, -l, -r	Flank length of overlap window in each direction, upstream or downstream, respectively.
-sw	Define -l and -r based on strand.
-u	Only report each overlapping A feature once.
-c	For each entry in A, report count of overlapping B features.
-v	Only report features in A not overlapping B.
common	strandedness: -sm, -Sm; bam: -abam