Last active
January 19, 2023 08:27
-
-
Save mr-eyes/a44aa142829bb2690e54a068e3506261 to your computer and use it in GitHub Desktop.
kSpider Demo
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
name: kspider | |
channels: | |
- conda-forge | |
- bioconda | |
dependencies: | |
- python=3.9 | |
- pip | |
- sourmash | |
- pip: | |
- kSpider |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "ad8278f7-7eeb-47b9-98f3-c5f93b43f4c5", | |
"metadata": {}, | |
"source": [ | |
"<div align=\"center\">\n", | |
"<h1 style=\"color:darkred;\"> kSpider Demo </h1>\n", | |
"\n", | |
"\n", | |
"\n", | |
"\n", | |
"<p style=\"text-align:center;\"><img src=\"https://camo.githubusercontent.com/1f6c503d8d682eeec44fdddcde6c0882b289736b2258e8631b7ad3d81c9cd186/68747470733a2f2f692e6962622e636f2f723636566859632f363337333035393034382d30303161626536312d316133632d343863372d616635312d3066643332376239633138612e706e67\" alt=\"Logo\" width=\"150\" height=\"200\"></p>\n", | |
" \n", | |
"<h2><u>Introduction</u></h2>\n", | |
"\n", | |
"<div align=\"left\" style=\"\n", | |
"border: 2px solid black;\n", | |
"padding: 10px;\n", | |
"width: 600px;\">\n", | |
"\n", | |
"<b>kSpider</b> is sequence clustering command-line tool. Given a directory of datasets, kSpider generate a\n", | |
"pairwise [containment/ani] matrix, that can be clustered with any cutoff threshold.\n", | |
"\n", | |
"<hr>\n", | |
"\n", | |
"<ol>\n", | |
"\n", | |
"<li><b>Creating an inverted index</b>\n", | |
" Here we use a modified version of <a href=\"https://github.com/dib-lab/kProcessor/\">kProcessor</a> to create an inverted index. Each k-mer in the index will\n", | |
" have a corresponding color. Each color will refer to a specific combination of source IDs</li>\n", | |
"\n", | |
"<li><b>Pairwise matrix construction</b>\n", | |
" We iterate over each k-mer in the index, then for each combination of source IDs pair, we increment the\n", | |
" count of the number of shared k-mers between them by the color count. So when we iterate over all the\n", | |
" k-mers, we don't pairwise compare any two sources that are not sharing any k-mers.</li>\n", | |
"\n", | |
"<li>After getting the sparse pairwise matrix of the number of shared kmers, we convert the shared kmers\n", | |
" number into distance (containment, ANI, etc ...).</li>\n", | |
"\n", | |
"<li><b>Clustering</b>\n", | |
" Here we perform graph-based clustering. Each node is a source, each edge is a distance. An edge would\n", | |
" only be created if the distance is >= the user predefined threshold. After constructing the graph, each\n", | |
" connected component will represent a cluster.</li>\n", | |
"</ol>\n", | |
"</div>\n", | |
"\n", | |
"</div>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "2a9e7912-d763-43b5-bf48-e8735591bbcd", | |
"metadata": {}, | |
"source": [ | |
"### Download the dataset (some samples from a Tara Ocean's project)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "d5c82661-57f0-4e5d-a476-73d6dcc045d9", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!wget https://farm.cse.ucdavis.edu/~mhussien/kSpider_workshop/tara_oceans.zip" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "6ac7ae09-edb8-48c7-82a6-1d0216b72ded", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!unzip tara_oceans.zip" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "a5f09d98-a8bd-4c6c-967d-0af1b1bb86aa", | |
"metadata": { | |
"tags": [] | |
}, | |
"source": [ | |
"### Exploring the dataset" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "cdc24e79-babf-4005-b509-acfd44c398ce", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%%sh\n", | |
"ls sigs" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "a6e743e8-3143-499a-ab07-6961c7b633aa", | |
"metadata": {}, | |
"source": [ | |
"### Sourmash sig describe" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "80dd5e08-df58-4b06-aa19-2332af5e9a55", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%%sh\n", | |
"sourmash sig describe sigs/84S.sig" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "3d6e6586-007f-401c-a040-3f4b80da57c6", | |
"metadata": {}, | |
"source": [ | |
"<hr>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "89f8db25-14a4-497b-8b6f-0b83f23b02c8", | |
"metadata": {}, | |
"source": [ | |
"### Indexing" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "366bc9fe-7b4c-4556-8ed8-d5df85faba40", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%%sh\n", | |
"kSpider --help" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "5f34c360-4905-410e-9aa5-3f785a07d9e4", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%%sh\n", | |
"kSpider index --help" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "ba203196-58f3-414d-b929-e1c0d4b59589", | |
"metadata": {}, | |
"source": [ | |
"#### Change directory" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "bee945df-a3a1-4b1d-b878-abd1308e6d4c", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!mkdir idx_k21\n", | |
"%cd idx_k21" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "67b5fc5c-13d2-4130-a94c-ec4578bb43aa", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%%sh\n", | |
"kSpider index --dir ../sigs -k 21 --sourmash" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "43a9555b-00f9-4d93-83e2-192836ec7942", | |
"metadata": {}, | |
"source": [ | |
"<hr>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "b9bf8c48-c8d5-4087-ae53-21874128fcc0", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!ls" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "9c3f8f66-fbc9-4f22-86eb-2edf3eb8a5fb", | |
"metadata": {}, | |
"source": [ | |
"### Perform pairwise comparisons" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "b0cf64b9-ad24-4d61-8b58-a143eb88aa6f", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!kSpider pairwise --help" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "10c10b6d-53c4-48a2-93bc-ab0097f6516d", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%%bash\n", | |
"kSpider pairwise -i sigs" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "238553b7-3725-4632-97d8-62bb5e20e1bc", | |
"metadata": {}, | |
"source": [ | |
"#### Explore" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "1a7c5834-a617-4202-bd5e-4d8a256e306a", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!ls" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "41689104-9b94-4ec6-8723-cdf697775156", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%%bash\n", | |
"head sigs_kSpider_pairwise.tsv" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "59ae02cc-f25d-4e94-a5ac-664e4a0f228f", | |
"metadata": {}, | |
"source": [ | |
"### Estimate ani?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "84d418fb-e6ae-44a1-9dc0-a37796945a67", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%%bash\n", | |
"kSpider pairwise -i sigs --estimate-ani -s 10000" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "f55ac074-d2ad-4ce2-9cd1-6f27b2703843", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!ls" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "53b316f9-19cc-4863-9dd0-d12f210a3688", | |
"metadata": {}, | |
"source": [ | |
"<hr>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "889702d0-04e9-4e13-a48b-cdb0ce6d2b10", | |
"metadata": {}, | |
"source": [ | |
"### Comparison exporting" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "d92f8103-41d1-4125-8e2d-45e48abeb624", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!kSpider export --help" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "3c1cc88c-cf61-4350-9c61-74dd8064ca26", | |
"metadata": {}, | |
"source": [ | |
"#### max_containment" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "68dcea1a-df96-4285-bde7-49e6c4a62f5d", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!kSpider export -i sigs --newick --dist-type max_cont -o kSpider_maxCont ## Sorry it should be max_cont, not max_containment" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "12172eda-a9c2-4397-a8ae-c52b4207e38a", | |
"metadata": {}, | |
"source": [ | |
"#### avg_ani" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "e4acd647-8c33-48b1-80c4-0649d73c5c46", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!kSpider export -i sigs --newick --dist-type ani -o kSpider_avgANI" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "a8d84b35-2710-4183-9a6a-7b855be2aed2", | |
"metadata": {}, | |
"source": [ | |
"### Visualize on https://itol.embl.de/" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "9224f962-c864-403d-96a4-47a1636beadd", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!cat kSpider_maxCont.newick" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "a36e4746-e80c-4e2b-ae3f-52aef7f78c4a", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!cat kSpider_avgANI.newick" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "0f155bb1-8338-477b-8ef4-4b6a8f69cfd3", | |
"metadata": {}, | |
"source": [ | |
"### Distance dissimilarity matrix " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "47a89565-9699-4684-99ce-13eebe2625cc", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!head -n2 kSpider_maxCont_distmat.tsv" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "8f228e0c-2ebc-42fa-8fcc-a6c5700d9361", | |
"metadata": {}, | |
"source": [ | |
"<hr> " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "1a12dce9-3e36-41af-a6c8-668043ba79d4", | |
"metadata": {}, | |
"source": [ | |
"### Graph-based Clustering" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "c81e71dc-5eb5-4af4-8d73-e8e3c3e43126", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!kSpider cluster --help" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "df0337e4-20b9-4b92-b162-5239e208a73a", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!kSpider cluster -c 0.98 -i sigs -d max_cont" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "6940bf1d-18cb-4447-8d38-8082e7703f05", | |
"metadata": {}, | |
"source": [ | |
"## Thank you!" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.8.15" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment