Skip to content

Instantly share code, notes, and snippets.

@Wytamma
Last active February 24, 2020 03:25
Show Gist options
  • Save Wytamma/18b76543857b328d4fa25ca377252728 to your computer and use it in GitHub Desktop.
Save Wytamma/18b76543857b328d4fa25ca377252728 to your computer and use it in GitHub Desktop.
Hamming distances used to replace erroneous barcodes in qiime2 - moving-pictures - emp single end sequences.
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"kernelspec": {
"display_name": "Py3-wytamma",
"language": "python",
"name": "py3-wytamma"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
},
"colab": {
"name": "18b76543857b328d4fa25ca377252728#file-hamming-barcode-correction-ipynb",
"provenance": []
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "0j9B4RT2lH4W",
"colab_type": "text"
},
"source": [
"Note: this code is redundant in newer versions of qiime2 (if you are using golay barcodes)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8_EOw-7xW1sF",
"colab_type": "text"
},
"source": [
"# Metagenomics barcode correction using hamming distance\n",
"\n",
"The `barcodes.fastq.gz` file contains the barcode read associated with each sequence in the `sequences.fastq.gz`\n",
"file. The barcodes are used to identify individual samples; this allows samples to be multiplexed in a sequencing\n",
"run. The `barcodes.fastq.gz` file should contain the same number of unique barcodes as the number of\n",
"samples in the multiplex, i.e. in the sample-metadata.tsv file. The expected number of unique barcodes is 34\n",
"(i.e. the same as the number of samples); however, there are 2137 unique barcodes in the `barcodes.fastq.gz`\n",
"file.\n",
"\n",
"There are several possible causes of the mismatch of barcode codes in these sequencing files; the mismatches\n",
"may be intentional or in error. It is possible that these additional barcodes have been included intentionally\n",
"as a part of a larger experiment, we might be working on a subset of samples that were included in the\n",
"sequencing run (see [Caporaso et al. 2011](https://www.pnas.org/content/108/Supplement_1/4516)). A short sequence of DNA bases can hold a tremendous number of\n",
"combinations; each could be used as unique identifiers for many billions (pick your number) of samples. For\n",
"example, the barcoding sequences used in this experiment are 12 bases long. As each base can be one of 4\n",
"nucleotides (A, T, C, or G) the total number of possible combinations is 4ˆ12 or 16,777,216. Theoretically, if\n",
"sequencing were 100% accurate, it would be possible to use all of these barcode combinations to conducted\n",
"massively multiplexed sequencing runs. If this were the case for our experiment, we would expect a high\n",
"number of sequences associated with each unique barcode. However, the density plot below shows that the\n",
"majority of barcodes have very few sequences associated with them (median=1). It is therefore unlikely that\n",
"these barcodes have been included purposefully and more likely that they were introduced though errors in\n",
"the sequencing.\n",
"\n",
"Sequencing is not 100% accurate. Errors will be introduced, and if those errors are in the barcoding region of\n",
"the sequence, then additional erroneous barcodes will be reported. The original `split_libraries_fastq.py`\n",
"module of `qiime` had an option for supplying a user-defined barcode correction function. However, the `qiime2`\n",
"plugin `demux` does not implement this option, and erroneous barcodes (that are not golay corrected) are simply [ignored](https://github.com/qiime2/q2-demux/blob/master/q2_demux/_demux.py#L411). A simple solution to this problem might be to pre-correct the barcodes\n",
"using a sequence distance algorithm. The erroneous barcode could be updated to the closest correct barcode\n",
"(if it is below a given distance threshold). This Python script implements the solution using Hamming\n",
"distances and the principle of parsimony.\n",
"\n",
"**Reading**\n",
"- [Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex](https://www.nature.com/articles/nmeth.1184)\n",
"- [Microbial community profiling for human microbiome projects: Tools, techniques, and challenges](https://genome.cshlp.org/content/19/7/1141)"
]
},
{
"cell_type": "code",
"metadata": {
"id": "nOzFe41QWzFY",
"colab_type": "code",
"colab": {}
},
"source": [
"import pandas as pd\n",
"import gzip"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "hRfMeBTpWzFd",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"outputId": "1c2ea0e8-68c7-41ed-b9fd-c6ab619f4b97"
},
"source": [
"%%bash\n",
"mkdir -p raw_data/emp-single-sequences\n",
"wget -O \"raw_data/sample-metadata.tsv\" \"https://data.qiime2.org/2017.12/tutorials/moving-pictures/sample_metadata.tsv\"\n",
"wget -O \"raw_data/emp-single-sequences/barcodes.fastq.gz\" \"https://data.qiime2.org/2017.12/tutorials/moving-pictures/emp-single-end-sequences/barcodes.fastq.gz\""
],
"execution_count": 12,
"outputs": [
{
"output_type": "stream",
"text": [
"--2020-02-24 00:28:05-- https://data.qiime2.org/2017.12/tutorials/moving-pictures/sample_metadata.tsv\n",
"Resolving data.qiime2.org (data.qiime2.org)... 52.35.38.247\n",
"Connecting to data.qiime2.org (data.qiime2.org)|52.35.38.247|:443... connected.\n",
"HTTP request sent, awaiting response... 302 FOUND\n",
"Location: https://docs.google.com/spreadsheets/d/149386JULm1MlQWf_E1ONSRiTHuDDb65ISp6UQwCzIzM/export?gid=0&format=tsv [following]\n",
"--2020-02-24 00:28:05-- https://docs.google.com/spreadsheets/d/149386JULm1MlQWf_E1ONSRiTHuDDb65ISp6UQwCzIzM/export?gid=0&format=tsv\n",
"Resolving docs.google.com (docs.google.com)... 74.125.31.102, 74.125.31.113, 74.125.31.101, ...\n",
"Connecting to docs.google.com (docs.google.com)|74.125.31.102|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: unspecified [text/tab-separated-values]\n",
"Saving to: ‘raw_data/sample-metadata.tsv’\n",
"\n",
" 0K ... 36.3M=0s\n",
"\n",
"2020-02-24 00:28:06 (36.3 MB/s) - ‘raw_data/sample-metadata.tsv’ saved [3670]\n",
"\n",
"--2020-02-24 00:28:06-- https://data.qiime2.org/2017.12/tutorials/moving-pictures/emp-single-end-sequences/barcodes.fastq.gz\n",
"Resolving data.qiime2.org (data.qiime2.org)... 52.35.38.247\n",
"Connecting to data.qiime2.org (data.qiime2.org)|52.35.38.247|:443... connected.\n",
"HTTP request sent, awaiting response... 302 FOUND\n",
"Location: https://s3-us-west-2.amazonaws.com/qiime2-data/2017.12/tutorials/moving-pictures/emp-single-end-sequences/barcodes.fastq.gz [following]\n",
"--2020-02-24 00:28:06-- https://s3-us-west-2.amazonaws.com/qiime2-data/2017.12/tutorials/moving-pictures/emp-single-end-sequences/barcodes.fastq.gz\n",
"Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.204.224\n",
"Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.204.224|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 3783785 (3.6M) [application/x-gzip]\n",
"Saving to: ‘raw_data/emp-single-sequences/barcodes.fastq.gz’\n",
"\n",
" 0K .......... .......... .......... .......... .......... 1% 609K 6s\n",
" 50K .......... .......... .......... .......... .......... 2% 607K 6s\n",
" 100K .......... .......... .......... .......... .......... 4% 615K 6s\n",
" 150K .......... .......... .......... .......... .......... 5% 39.6M 4s\n",
" 200K .......... .......... .......... .......... .......... 6% 50.7M 3s\n",
" 250K .......... .......... .......... .......... .......... 8% 629K 4s\n",
" 300K .......... .......... .......... .......... .......... 9% 35.5M 3s\n",
" 350K .......... .......... .......... .......... .......... 10% 44.1M 3s\n",
" 400K .......... .......... .......... .......... .......... 12% 57.9M 2s\n",
" 450K .......... .......... .......... .......... .......... 13% 67.5M 2s\n",
" 500K .......... .......... .......... .......... .......... 14% 81.3M 2s\n",
" 550K .......... .......... .......... .......... .......... 16% 650K 2s\n",
" 600K .......... .......... .......... .......... .......... 17% 57.7M 2s\n",
" 650K .......... .......... .......... .......... .......... 18% 52.1M 2s\n",
" 700K .......... .......... .......... .......... .......... 20% 61.9M 2s\n",
" 750K .......... .......... .......... .......... .......... 21% 54.7M 1s\n",
" 800K .......... .......... .......... .......... .......... 23% 61.9M 1s\n",
" 850K .......... .......... .......... .......... .......... 24% 63.2M 1s\n",
" 900K .......... .......... .......... .......... .......... 25% 89.1M 1s\n",
" 950K .......... .......... .......... .......... .......... 27% 91.6M 1s\n",
" 1000K .......... .......... .......... .......... .......... 28% 658K 1s\n",
" 1050K .......... .......... .......... .......... .......... 29% 39.1M 1s\n",
" 1100K .......... .......... .......... .......... .......... 31% 66.3M 1s\n",
" 1150K .......... .......... .......... .......... .......... 32% 76.7M 1s\n",
" 1200K .......... .......... .......... .......... .......... 33% 62.5M 1s\n",
" 1250K .......... .......... .......... .......... .......... 35% 61.8M 1s\n",
" 1300K .......... .......... .......... .......... .......... 36% 67.3M 1s\n",
" 1350K .......... .......... .......... .......... .......... 37% 59.3M 1s\n",
" 1400K .......... .......... .......... .......... .......... 39% 83.6M 1s\n",
" 1450K .......... .......... .......... .......... .......... 40% 85.3M 1s\n",
" 1500K .......... .......... .......... .......... .......... 41% 82.3M 1s\n",
" 1550K .......... .......... .......... .......... .......... 43% 87.7M 1s\n",
" 1600K .......... .......... .......... .......... .......... 44% 92.5M 1s\n",
" 1650K .......... .......... .......... .......... .......... 46% 394M 1s\n",
" 1700K .......... .......... .......... .......... .......... 47% 406M 1s\n",
" 1750K .......... .......... .......... .......... .......... 48% 427M 1s\n",
" 1800K .......... .......... .......... .......... .......... 50% 385M 0s\n",
" 1850K .......... .......... .......... .......... .......... 51% 337M 0s\n",
" 1900K .......... .......... .......... .......... .......... 52% 686K 1s\n",
" 1950K .......... .......... .......... .......... .......... 54% 62.4M 0s\n",
" 2000K .......... .......... .......... .......... .......... 55% 56.9M 0s\n",
" 2050K .......... .......... .......... .......... .......... 56% 65.1M 0s\n",
" 2100K .......... .......... .......... .......... .......... 58% 37.0M 0s\n",
" 2150K .......... .......... .......... .......... .......... 59% 59.5M 0s\n",
" 2200K .......... .......... .......... .......... .......... 60% 244M 0s\n",
" 2250K .......... .......... .......... .......... .......... 62% 274M 0s\n",
" 2300K .......... .......... .......... .......... .......... 63% 257M 0s\n",
" 2350K .......... .......... .......... .......... .......... 64% 232M 0s\n",
" 2400K .......... .......... .......... .......... .......... 66% 213M 0s\n",
" 2450K .......... .......... .......... .......... .......... 67% 319M 0s\n",
" 2500K .......... .......... .......... .......... .......... 69% 276M 0s\n",
" 2550K .......... .......... .......... .......... .......... 70% 242M 0s\n",
" 2600K .......... .......... .......... .......... .......... 71% 251M 0s\n",
" 2650K .......... .......... .......... .......... .......... 73% 252M 0s\n",
" 2700K .......... .......... .......... .......... .......... 74% 356M 0s\n",
" 2750K .......... .......... .......... .......... .......... 75% 328M 0s\n",
" 2800K .......... .......... .......... .......... .......... 77% 296M 0s\n",
" 2850K .......... .......... .......... .......... .......... 78% 288M 0s\n",
" 2900K .......... .......... .......... .......... .......... 79% 378M 0s\n",
" 2950K .......... .......... .......... .......... .......... 81% 420M 0s\n",
" 3000K .......... .......... .......... .......... .......... 82% 379M 0s\n",
" 3050K .......... .......... .......... .......... .......... 83% 399M 0s\n",
" 3100K .......... .......... .......... .......... .......... 85% 412M 0s\n",
" 3150K .......... .......... .......... .......... .......... 86% 405M 0s\n",
" 3200K .......... .......... .......... .......... .......... 87% 333M 0s\n",
" 3250K .......... .......... .......... .......... .......... 89% 342M 0s\n",
" 3300K .......... .......... .......... .......... .......... 90% 420M 0s\n",
" 3350K .......... .......... .......... .......... .......... 92% 419M 0s\n",
" 3400K .......... .......... .......... .......... .......... 93% 388M 0s\n",
" 3450K .......... .......... .......... .......... .......... 94% 684K 0s\n",
" 3500K .......... .......... .......... .......... .......... 96% 77.5M 0s\n",
" 3550K .......... .......... .......... .......... .......... 97% 96.2M 0s\n",
" 3600K .......... .......... .......... .......... .......... 98% 75.6M 0s\n",
" 3650K .......... .......... .......... .......... ..... 100% 102M=0.7s\n",
"\n",
"2020-02-24 00:28:07 (5.49 MB/s) - ‘raw_data/emp-single-sequences/barcodes.fastq.gz’ saved [3783785/3783785]\n",
"\n"
],
"name": "stderr"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "wvNVoWxGWzFh",
"colab_type": "code",
"colab": {}
},
"source": [
"def hamming_distance(s1, s2):\n",
" \"\"\"\n",
" Return the Hamming distance between equal-length sequences\n",
" https://en.wikipedia.org/wiki/Hamming_distance#Algorithm_example\n",
" \"\"\"\n",
" if len(s1) != len(s2):\n",
" raise ValueError(\"Undefined for sequences of unequal length\")\n",
" return sum(el1 != el2 for el1, el2 in zip(s1, s2)) "
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "AdiDnKNTWzFj",
"colab_type": "code",
"colab": {}
},
"source": [
"data = pd.read_csv('raw_data/sample-metadata.tsv', delimiter=\"\\t\")\n",
"correct_barcodes = data.BarcodeSequence.tolist()\n",
"\n",
"barcodes_file_path = \"raw_data/emp-single-sequences/barcodes.fastq.gz\"\n",
"with gzip.open(barcodes_file_path, 'rt') as f:\n",
" # Barcode seqs are fonud on every 4th line starting with the 2nd line\n",
" barcodes = set([v.strip() for i, v in enumerate(f, start=3) if i % 4 == 0])"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "8UkHRagQWzFm",
"colab_type": "code",
"colab": {}
},
"source": [
"ham_thresh = 3 # 5 is too high\n",
"corrected_barcodes_dict = {}\n",
"\n",
"for barcode in barcodes:\n",
" if barcode in correct_barcodes:\n",
" # Skip correct barcodes\n",
" continue\n",
" hamming_barcodes_dict = {cb:hamming_distance(barcode, cb) for cb in correct_barcodes}\n",
" min_ham_barcode = min(hamming_barcodes_dict, key=hamming_barcodes_dict.get)\n",
" if hamming_barcodes_dict[min_ham_barcode] > ham_thresh:\n",
" # Skip barcodes above the threshold value\n",
" continue\n",
" if list(hamming_barcodes_dict.values()).count(hamming_barcodes_dict[min_ham_barcode]) > 1:\n",
" # Parsimony is used to assign the correct barcode,\n",
" # if 2 or more seqs have the same minimum hamming \n",
" # distance then the correct barcode is ambigious.\n",
" continue \n",
" corrected_barcodes_dict[barcode] = min_ham_barcode"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "G54VZhbzWzFp",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "d2ff78b7-4e12-4de1-ab57-976796b57be0"
},
"source": [
"print(f\"Corrected {len(corrected_barcodes_dict)} barcodes\")"
],
"execution_count": 16,
"outputs": [
{
"output_type": "stream",
"text": [
"Corrected 76 barcodes\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "MpH_7P3_WzFs",
"colab_type": "code",
"colab": {}
},
"source": [
"with gzip.open(barcodes_file_path, \"rt\") as fin:\n",
" with gzip.open(barcodes_file_path[:-2]+\"corrected.gz\", \"wt\") as fout:\n",
" c = 0\n",
" for i, line in enumerate(fin, start=3):\n",
" if i % 4 == 0 and line.strip() in corrected_barcodes_dict:\n",
" fout.write(line.replace(\n",
" line.strip(), \n",
" corrected_barcodes_dict[line.strip()]\n",
" ))\n",
" c += 1\n",
" else:\n",
" fout.write(line)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "3CZYKs7PWzFv",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "b017c4b8-3f0b-4041-fd83-03a5290cc492"
},
"source": [
"print(f\"Replaced barcodes for {c} sequences\")"
],
"execution_count": 18,
"outputs": [
{
"output_type": "stream",
"text": [
"Replaced barcodes for 80 sequences\n"
],
"name": "stdout"
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment