Skip to content

Instantly share code, notes, and snippets.

@cnithin7
Forked from chrisamiller/copy_number_sv.md
Last active November 13, 2025 15:19
Show Gist options
  • Select an option

  • Save cnithin7/b7ac47112337d23a02094fc3017ed5be to your computer and use it in GitHub Desktop.

Select an option

Save cnithin7/b7ac47112337d23a02094fc3017ed5be to your computer and use it in GitHub Desktop.

Copy number and Structural Variants

Before we start, we will be using R studio in this exercise to load our objects and visualize the plots. Lets start by setting up Rstudio now, so we can use it later in our exercise.

For setting up the Rstudio:

SSH into your instance

ssh -i cshl_2025_student.pem ubuntu@[public-ip-address]

Set up a password for the RStudio user

Run the following command to set a password for the username ubuntu:

sudo passwd ubuntu

You will be prompted to enter a new password.
For simplicity, you can set it as ubuntu.

Access RStudio in your browser

Open your browser and go to:

http://<public.ip.address>:8787

Log in

When prompted for credentials, enter:

Username: ubuntu
Password: ubuntu

Then click Sign In.

Now you should be able to see,

  • Console: where R commands actually run.
  • Environment/History: shows variables and command history.
  • Plots/Files/Packages/Help: shows plots and file browser.

This exercise walks through some basic steps needed to identify, review, and interpret copy number and structural variants in genomes.

We'll start by creating a folder and pulling down some exome sequencing data from chromosome 17 of the HCC1395 cell line. HCC1395 is an epithelial cell line derived from the same individual's breast tumor (ductal carcinoma) and their normal cells.

mkdir ~/workspace/copynumber
cd ~/workspace/copynumber
wget https://storage.googleapis.com/bfx_workshop_tmp/tum_chr17.bam https://storage.googleapis.com/bfx_workshop_tmp/nrm_chr17.bam

These are aligned bam files, so they're ready for downstream analysis after we index them:

samtools index tum_chr17.bam
samtools index nrm_chr17.bam

We'll also need a reference fasta, which in this case is the human GRCh38 assembly, subset to just chromosome 17.

wget https://storage.googleapis.com/bfx_workshop_tmp/chr17.fa.gz
gunzip chr17.fa.gz

Copy Number

To make copy number calls, we're going to use a tool called CNVkit. It works by comparing a sample of interest to one or more control samples. This is especially important for exome sequencing, where probe affinities may differ, giving dramatically different coverage even over areas with the same true copy number.

In order to do this comparison well, CNVkit needs to know where the targeted regions of the genome are. We have those in a bed file, which you can pull down like this:

wget https://storage.googleapis.com/bfx_workshop_tmp/targets.bed

With all the pieces in place, we could make use of CNVkit's "batch" command to run it's whole workflow on these samples: Counting reads both in and out of target regions, matching them up, and performing segmentation to identify regions of contiguous change.

docker run -v /workspace:/workspace -it etal/cnvkit:0.9.8
cd /workspace/copynumber/
python3 /usr/local/bin/cnvkit.py batch tum_chr17.bam --normal nrm_chr17.bam --fasta chr17.fa --output-dir /workspace/copynumber/cnvkit_outputs --targets targets.bed -p 8

Once that is done , lets exit from the container

exit

Now, chmod the folder so it's accessible to our web browser

cd /workspace/copynumber/
chmod ugo+rx cnvkit_outputs

if you dont wanna do all that, feel free to pull the results.

#grab pre-computed results
wget https://storage.googleapis.com/bfx_workshop_tmp/cnvkit_outputs.tar.gz
tar -xzvf cnvkit_outputs.tar.gz
#chmod the folder so it's accessible to our web browser
chmod ugo+rx cnvkit_outputs

Let's look at what we have here:

  • targets.bed and antitargets.bed, defining the regions of the genome that it is considering
  • .cnn files for each input sample with coverage depth and gc-content
  • tum_chr17.cnr contains the granular results - point estimates of copy number across the genome
  • tum_chr17.cns contains the segmented results where regions of contiguous similar copy number are merged
  • tum_chr17.call.cns an attempt to assign each segment’s absolute integer copy number
  • tum_chr17.bintest.cns an attempt to go back and look for focal copy number events

For more details on these files, you can see CNVkit's excellent documentation

Exercises - Part 1

  • Use less to open the output files and see what they contain.

  • Boy, that gene column really gets in the way, especially in the segments file. How could you exclude that and just view all the other columns?

  • How many copy number segments were called?

  • How many abnormal copy number segments were called?

  • Look at the first few lines of the cns and the call.cns files. How are they similar/different - what happened here?

  • Pull down a bed file of all protein coding genes from here: https://storage.googleapis.com/bfx_workshop_tmp/genes.blocked.protein_coding.ens95.bed. Use bedtools to determine how many genes are amplified in this tumor.

Hint 1

This awk command will print any line where the 3rd column is greater than 100. Modify it to help you find amplified regions: `awk -v OFS="\t" '$3>100 {print $0}' myfile.txt`

Hint 2

`bedtools intersect` is your friend here, but it really hates that header line on the CN data. `tail -n +2` will return every line of a file, starting with the second.

Plotting copy number

Visualizing this data might be nice. There are many ways to do this, but let's explore two of them.

The first is that CNVkit has some basic plotting built in. To run this, we'll fire up the CNVkit docker image:

docker run -v /workspace:/workspace -it etal/cnvkit:0.9.8
cd /workspace/copynumber/cnvkit_outputs

Then run the scatter and diagram commands

python3 /usr/local/bin/cnvkit.py scatter tum_chr17.cnr -s tum_chr17.cns -c chr17 -o scatter.pdf
python3 /usr/local/bin/cnvkit.py diagram -s tum_chr17.cns -o ideogram.pdf

Check the outputs - how do they look? That ideogram with the probe labelling is certainly not great. How could you use command line tools to clean up the input file before plotting it again?

One way could be with awk, outputting all fields except for the 4th:

#keep the header line intact
head -n 1 tum_chr17.cns > tum_chr17.cns.noprobes
#now use awk to make the 4th field blank
tail -n +2 tum_chr17.cns | awk 'OFS="\t" {print $1,$2,$3,"",$5,$6,$7,$8,$9,$10}' >> tum_chr17.cns.noprobes

(There are lots of other ways to do it too!)

Now try making your ideogram again.

These are great for a quick look, but what if you wanted more fine-grained control over the graphics? This might be a good case for using ggplot in R.

Fire up R or RStudio, set the working directory and load in the ggplot library

setwd("~/workspace/copynumber/cnvkit_outputs")
library(ggplot2)

Then, read in the data:

bins = read.table("tum_chr17.cnr",sep="\t",header=T)
segs = read.table("tum_chr17.cns",sep="\t",header=T)

If you try to use head(segs) to see what's going on though, that pesky 4th column is kind of in the way again. Let's drop it:

segs = segs[,-4]

Now let's start with a scatterplot:

ggplot(data=bins, aes(x=start, y=log2, color=depth)) + 
    geom_point()

Not half bad, but a few tweaks might make it better. Let's zoom in to ignore that outlier point:

ggplot(data=bins, aes(x=start, y=log2, color=depth)) +
    geom_point() + 
    ylim(-5,5) 

And then let's add in lines showing the segments

ggplot(data=bins, aes(x=start, y=log2, color=depth)) +
   geom_point() + 
   ylim(-5,5) + 
   geom_segment(data=segs, aes(x=start, xend=end, y=log2, yend=log2), color="red")

Note the use of a secondary data set and set of aes mappings. That's a useful trick for getting multiple data sets onto the same plot.

One important thing to note is that since we only ran this on a small portion of the genome, the "centering" is sometimes not as robust. Here we can assume that that first large chunk is close enough to zero to be considered copy-number neutral (2 copies), while the rest of the chromosome is affected by amplifications or deletions.

Exercises

  • Why is there a big gap in coverage in the middle?

  • Do you think the segmentation was accurate, or would you have called more or fewer segments?

Answer

It's probably a bit hyper-segmented. There are options in CNVkit to tweak parameters or segmentation methods, or you could use the `call.cns` file which has additional merging

  • Representation as log2 is nice for some things, but maybe we'd like to know approximately how many copies of each segment there are.

  • The formula to get log2 copy number was: log2(tum_cn/nrm_cn). If we assume that the normal was "clean" with a CN of 2, to reverse it we can use (2^log2_cn)*2.

We might also want to know the location of particular genes of interest - do they fall within CN-altered regions? Pretend that you're really interested in the ERBB2 (aka HER2) gene - it's a known driver of breast cancer. Can you find its location on GRCh38, then mark it on this plot, using geom_vline? Is it involved in a CN alteration? Let's try

First, load the required library and read in your CNVkit output files.

setwd("~/workspace/copynumber/cnvkit_outputs")
library(ggplot2)

# Read data
bins <- read.table("tum_chr17.cnr", sep = "\t", header = TRUE)
segs <- read.table("tum_chr17.cns", sep = "\t", header = TRUE)

# Drop the 4th column from segs (unnecessary extra info)
segs <- segs[, -4]

Convert log2 ratios to absolute copy number

Copy number is often expressed in log2 space (relative to diploid = 2).
To get absolute copy number estimates, convert as follows:

# Convert to absolute copy number scale
bins$abs_cn <- 2^(bins$log2) * 2
segs$abs_cn <- 2^(segs$log2) * 2

Define coordinates for the ERBB2 gene (GRCh38)

We’ll mark this region on the plot using vertical dashed lines.

# ERBB2 coordinates (GRCh38)
erbb2_start <- 39700064
erbb2_end   <- 39728658

Now let's start with a scatterplot:

ggplot(data = bins, aes(x = start, y = abs_cn, color = depth)) +
  geom_point()

Red horizontal segments indicate CNVkit’s smoothed copy number calls.

ggplot(data = bins, aes(x = start, y = abs_cn, color = depth)) +
  geom_point() +
  geom_segment(
    data = segs,
    aes(x = start, xend = end, y = abs_cn, yend = abs_cn),
    color = "red"
  )

Highlight the ERBB2 gene region

ggplot(data = bins, aes(x = start, y = abs_cn, color = depth)) +
  geom_point() +
  geom_segment(
    data = segs,
    aes(x = start, xend = end, y = abs_cn, yend = abs_cn),
    color = "red"
  ) +
  geom_vline(
    xintercept = c(erbb2_start, erbb2_end),
    color = "red",
    linetype = "dashed"
  ) +
  ylim(-1, 5)

Is ERBB2 involved in a CN event?

Structural Variation

Unfortunately, calling SVs on exome data is suboptimal. Resolving SVs relies in large part on finding and connecting breakpoints, and in exome data, most breakpoints will fall in between the exons and be invisible to us. So let's take a look at a WGS data set from the same cell line.

Start by pulling down a pair of heavily subset bams:

mkdir ~/workspace/sv
cd ~/workspace/sv
wget https://storage.googleapis.com/bfx_workshop_tmp/wgs_nrm.bam https://storage.googleapis.com/bfx_workshop_tmp/wgs_tum.bam
  • Just for fun, you could run samtools flagstat to get the number of reads in this normal bam, then do some quick back-of-the-envelope math to estimate what percentage of the total 30x WGS bam we have here, assuming they're all 100bp reads and the human genome is 3.2e9 bp long. (yeah, WGS data is big!)

Manta is one tool for calling structural variants, and running it is a several step process. (See the PMBio guide for some more details, or the Manta repository) To save time, we'll pull down a completed output file from Manta:

wget http://genomedata.org/pmbio-workshop/results/chr6_and_chr17/somatic/manta_wgs/results/variants/somaticSV.vcf.gz

This is a VCF, and you should be familiar with these, from looking at variant calls. Examine the VCF and identify some differences between this VCF and those from SNPs.

Hints

A few examples

  • The ALT field often doesn't contain actual sequence.
  • There are weird alleles that look like G[chr17:40376708[
  • The FORMAT field doesn't contain a genotype (GT) entry

This is fairly complicated and if you want to know all the hairy details, you can read the full VCF spec or take a look at guides like this one from the GATK folks.

With a few basic facts, though, you'll know enough to do some analyses:

  • Every SV has a chromosome and position, just like other variants

  • Different types of variants will have different information in other fields:

    • Smallish INS and DEL events will list the basepairs added/removed, as with the few-basepair indels we've seen before.

    • Larger ones may only have the start coordinate (in the POS field) and an end in the info field: END=123456

    • Breakends (BND) are another SV type, which just means that we can see that two particular portions of the genome are now connected, but the full event couldn't be resolved.

    • Breakends have a start position and then in the ALT field, a string that includes the location of the other end of the break. G[chr17:40376708[ The brackets indicate the orientation of the connection. see page 25 of the VCF spec for an illustrated example

    • SVLEN info field tells us the size of the event. By convention, INS has positive length and DEL has negative length

  • Different SV callers may output different information, but Manta bases its calls on support from Paired Reads (PR) and Split Reads (SR). Counts of these reads for the ref and alt alleles are listed in the sample field

Whew, that's a lot, but how do we get down to brass tacks and figure out whether these are real, and which ones are important?

If we want to slice and dice these VCFs, we could try to cobble together some really hairy cut/sort/awk commands, but trust me - that road leads to madness (and lots of mistakes)! There are already tools that allow us to query VCFs efficiently, like bcftools

Let's start by extracting all of the variants that pass the variant filters:

bcftools view --include 'FILTER=="PASS"' somaticSV.vcf.gz  | less

We can chain together conditional operators:

  • && for AND
  • || for OR

So what if we wanted to also look at the SOMATICSCORE field, which is a measure of confidence given to the call by Manta:

bcftools view --include 'FILTER=="PASS" && SOMATICSCORE>150' somaticSV.vcf.gz  | less

We can of course then do things like pipe these results to additional commands or to a file. In this case we'll filter to passing variants with a score of 150, then keep header lines or those that contain tandem dups, and save them to a new vcf:

bcftools view --include 'FILTER=="PASS" && SOMATICSCORE>150' somaticSV.vcf.gz  | egrep "^#|TANDEM" >tandem_hq.vcf

###Exercises

  • How many passing SVs are in the VCF provided?
  • How many passing SVs are at least 1Mb in size?
  • How many passing SVs are within 10000 bp of a gene? This one is tricky, but super-useful!

When thinking about intersects with SVs, remember that the type of the SV matters. For example: a gene in the middle of a huge inversion may be relatively unaffected, while one that gets cut in half at a breakpoint might be completely non-functional.

Visualization/Review

Finally, let's load these up in IGV and have a look at a few of them. Load the tumor bam, the normal bam, and the original VCF.

  • Start by navigating to chr6:665964-668234.
    • Notice the coverage drop in this deletion, and the softclipping at the breakpoints.
    • Look at the first few bases of the softclipping and notice how the softclipped sequence matches up perfectly with the other side of the breakpoint. These are examples of split reads providing evidence. Most will also have a supplementary alignment to the other side of the deletion.
    • Color your reads by insert size, then Right-click and choose "View as Pairs". The reads that span the deletion are examples of paired reads providing evidence for the SV.
  • Navigate to chr6:73159273-73161543. This inversion looks solid enough, but the coverage level changes probably mean that it's part of some more complex rearrangement.
  • Navigate to chr6:111296756-111314925. Look at how the coverage clearly shows the presence of multiple copies of this region.
  • Navigate to region chr6:115429512. It's pretty easy to see the start of the event, with softclipping and a coverage change. The other end is listed as being at chr6:122171274, but you can't zoom out enough to see both ends at once. Instead, try this:
    • Make sure your reads are still colored by insert size
    • click on colored reads until you find one whose mate pair is near the other end of this event.
    • right click on the read and choose "View mate region in split screen"
    • Now you can see both ends. Page around until you find the breakpoint at the other end and confirm that the reads have supplementary mappings back to the start.

Final thoughts:

These were all pretty easy to detect SVs, from a high tumor purity cell line, with very deep whole genome coverage (~75x). It's often much harder to identify events in a 20x or 30x genome, where we might only have 3 or 4 reads of support. Deeper genomes help a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment