Skip to content

Instantly share code, notes, and snippets.

@williamrowell
Last active February 13, 2024 16:49
Show Gist options
  • Save williamrowell/cceb718298a05426591c7497707d1696 to your computer and use it in GitHub Desktop.
Save williamrowell/cceb718298a05426591c7497707d1696 to your computer and use it in GitHub Desktop.
Calculate the mean and standard deviation of a coverage depth distribution as if it were normally distributed.
#!/usr/bin/env python3
"""Coverage mean and standard deviation of autosomes
Estimate the mean and standard deviation from a mosdepth coverage BED for
positions with coverage in the range (0, 2 * non-zero mode). This estimate
behaves well for PacBio HiFi WGS of human germline aligned to either hs37d5 and
GRCh38, and may be useful for other situations as well.
$ bash mosdepth --threads 3 --no-per-base --by 500 -m "${BAM%.*}.median" "${BAM}"
$ tabix ${BAM%.*}.median.regions.bed.gz {1..22} | python depth_mean_stddev.py
"""
__author__ = "William Rowell"
__version__ = "0.2.0"
__license__ = "MIT"
import sys
import numpy as np
from scipy.stats import norm, mode
def nonzero_mode(values):
"""Return mode of values > 0"""
return mode(values[values > 0])[0]
def mu_sigma(values):
"""Return mu, sigma of values from 1 to 2*mode"""
return norm.fit(values[(values > 0) & (values < (2*nonzero_mode(values)))])
def main():
"""For a mosdepth bed file provided to stdin, determine the mean and stddev as
if the data were normal, and report these values.
Usage: tabix ${BAM%.*}.median.regions.bed.gz `seq 1 22` | python depth_mean_stddev.py
"""
# pull coverage depth into an array
values = np.loadtxt(sys.stdin, delimiter='\t', usecols=3, dtype=np.float64)
print(mu_sigma(values))
if __name__ == "__main__":
"""$ tabix ${BAM%.*}.median.regions.bed.gz {1..22} | python depth_mean_stddev.py
> (mean, stddev)
"""
main()
@williamrowell
Copy link
Author

williamrowell commented Jan 18, 2020

DESCRIPTION

Calculate coverage mean and standard deviation of autosomes from read alignments.

Estimate the mean and standard deviation from a mosdepth coverage BED for positions with coverage in the range (0, 2 * non-zero mode). This estimate behaves well for WGS of human germline aligned to either hs37d5 and GRCh38, and may be useful for other situations as well.

APPROACH

  • read in the integer array of median coverage of non-zero blocks (use tabix to only provide autosomes)
  • find the modal value from non-zero elements
  • find the mean and standard deviation for values (0, 2 * non-zero mode)

DEPENDENCY INSTALLATION

  • Install tabix
  • Install mosdepth
  • Install python modules. pip install numpy scipy

STEP-BY-STEP

  • bash mosdepth --threads 3 --no-per-base --by 500 -m "${BAM%.*}.median" "${BAM}"
  • tabix ${BAM%.*}.median.regions.bed.gz <space delimited list of autosomes> | python depth_mean_stddev.py

Output:

  • <mean> <standard deviation>

DISCLAIMER

THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.

@williamrowell
Copy link
Author

zcat ${BAM%.*}.median.regions.bed.gz \
    |  awk -v OFS='\t' -v MEAN=$MEAN -v STDEV=$STDEV '{print $1, $2, $3, ($4 - MEAN) / STDEV}' \
    > ${BAM%.*}.median.regions.zscore.bed.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment