Skip to content

Instantly share code, notes, and snippets.

@mr-eyes
Last active January 19, 2023 08:27
Show Gist options
  • Save mr-eyes/a44aa142829bb2690e54a068e3506261 to your computer and use it in GitHub Desktop.
Save mr-eyes/a44aa142829bb2690e54a068e3506261 to your computer and use it in GitHub Desktop.
kSpider Demo
name: kspider
channels:
- conda-forge
- bioconda
dependencies:
- python=3.9
- pip
- sourmash
- pip:
- kSpider
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "ad8278f7-7eeb-47b9-98f3-c5f93b43f4c5",
"metadata": {},
"source": [
"<div align=\"center\">\n",
"<h1 style=\"color:darkred;\"> kSpider Demo </h1>\n",
"\n",
"\n",
"\n",
"\n",
"<p style=\"text-align:center;\"><img src=\"https://camo.githubusercontent.com/1f6c503d8d682eeec44fdddcde6c0882b289736b2258e8631b7ad3d81c9cd186/68747470733a2f2f692e6962622e636f2f723636566859632f363337333035393034382d30303161626536312d316133632d343863372d616635312d3066643332376239633138612e706e67\" alt=\"Logo\" width=\"150\" height=\"200\"></p>\n",
" \n",
"<h2><u>Introduction</u></h2>\n",
"\n",
"<div align=\"left\" style=\"\n",
"border: 2px solid black;\n",
"padding: 10px;\n",
"width: 600px;\">\n",
"\n",
"<b>kSpider</b> is sequence clustering command-line tool. Given a directory of datasets, kSpider generate a\n",
"pairwise [containment/ani] matrix, that can be clustered with any cutoff threshold.\n",
"\n",
"<hr>\n",
"\n",
"<ol>\n",
"\n",
"<li><b>Creating an inverted index</b>\n",
" Here we use a modified version of <a href=\"https://github.com/dib-lab/kProcessor/\">kProcessor</a> to create an inverted index. Each k-mer in the index will\n",
" have a corresponding color. Each color will refer to a specific combination of source IDs</li>\n",
"\n",
"<li><b>Pairwise matrix construction</b>\n",
" We iterate over each k-mer in the index, then for each combination of source IDs pair, we increment the\n",
" count of the number of shared k-mers between them by the color count. So when we iterate over all the\n",
" k-mers, we don't pairwise compare any two sources that are not sharing any k-mers.</li>\n",
"\n",
"<li>After getting the sparse pairwise matrix of the number of shared kmers, we convert the shared kmers\n",
" number into distance (containment, ANI, etc ...).</li>\n",
"\n",
"<li><b>Clustering</b>\n",
" Here we perform graph-based clustering. Each node is a source, each edge is a distance. An edge would\n",
" only be created if the distance is >= the user predefined threshold. After constructing the graph, each\n",
" connected component will represent a cluster.</li>\n",
"</ol>\n",
"</div>\n",
"\n",
"</div>"
]
},
{
"cell_type": "markdown",
"id": "2a9e7912-d763-43b5-bf48-e8735591bbcd",
"metadata": {},
"source": [
"### Download the dataset (some samples from a Tara Ocean's project)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d5c82661-57f0-4e5d-a476-73d6dcc045d9",
"metadata": {},
"outputs": [],
"source": [
"!wget https://farm.cse.ucdavis.edu/~mhussien/kSpider_workshop/tara_oceans.zip"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6ac7ae09-edb8-48c7-82a6-1d0216b72ded",
"metadata": {},
"outputs": [],
"source": [
"!unzip tara_oceans.zip"
]
},
{
"cell_type": "markdown",
"id": "a5f09d98-a8bd-4c6c-967d-0af1b1bb86aa",
"metadata": {
"tags": []
},
"source": [
"### Exploring the dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cdc24e79-babf-4005-b509-acfd44c398ce",
"metadata": {},
"outputs": [],
"source": [
"%%sh\n",
"ls sigs"
]
},
{
"cell_type": "markdown",
"id": "a6e743e8-3143-499a-ab07-6961c7b633aa",
"metadata": {},
"source": [
"### Sourmash sig describe"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "80dd5e08-df58-4b06-aa19-2332af5e9a55",
"metadata": {},
"outputs": [],
"source": [
"%%sh\n",
"sourmash sig describe sigs/84S.sig"
]
},
{
"cell_type": "markdown",
"id": "3d6e6586-007f-401c-a040-3f4b80da57c6",
"metadata": {},
"source": [
"<hr>"
]
},
{
"cell_type": "markdown",
"id": "89f8db25-14a4-497b-8b6f-0b83f23b02c8",
"metadata": {},
"source": [
"### Indexing"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "366bc9fe-7b4c-4556-8ed8-d5df85faba40",
"metadata": {},
"outputs": [],
"source": [
"%%sh\n",
"kSpider --help"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5f34c360-4905-410e-9aa5-3f785a07d9e4",
"metadata": {},
"outputs": [],
"source": [
"%%sh\n",
"kSpider index --help"
]
},
{
"cell_type": "markdown",
"id": "ba203196-58f3-414d-b929-e1c0d4b59589",
"metadata": {},
"source": [
"#### Change directory"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bee945df-a3a1-4b1d-b878-abd1308e6d4c",
"metadata": {},
"outputs": [],
"source": [
"!mkdir idx_k21\n",
"%cd idx_k21"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "67b5fc5c-13d2-4130-a94c-ec4578bb43aa",
"metadata": {},
"outputs": [],
"source": [
"%%sh\n",
"kSpider index --dir ../sigs -k 21 --sourmash"
]
},
{
"cell_type": "markdown",
"id": "43a9555b-00f9-4d93-83e2-192836ec7942",
"metadata": {},
"source": [
"<hr>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b9bf8c48-c8d5-4087-ae53-21874128fcc0",
"metadata": {},
"outputs": [],
"source": [
"!ls"
]
},
{
"cell_type": "markdown",
"id": "9c3f8f66-fbc9-4f22-86eb-2edf3eb8a5fb",
"metadata": {},
"source": [
"### Perform pairwise comparisons"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b0cf64b9-ad24-4d61-8b58-a143eb88aa6f",
"metadata": {},
"outputs": [],
"source": [
"!kSpider pairwise --help"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "10c10b6d-53c4-48a2-93bc-ab0097f6516d",
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"kSpider pairwise -i sigs"
]
},
{
"cell_type": "markdown",
"id": "238553b7-3725-4632-97d8-62bb5e20e1bc",
"metadata": {},
"source": [
"#### Explore"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1a7c5834-a617-4202-bd5e-4d8a256e306a",
"metadata": {},
"outputs": [],
"source": [
"!ls"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "41689104-9b94-4ec6-8723-cdf697775156",
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"head sigs_kSpider_pairwise.tsv"
]
},
{
"cell_type": "markdown",
"id": "59ae02cc-f25d-4e94-a5ac-664e4a0f228f",
"metadata": {},
"source": [
"### Estimate ani?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "84d418fb-e6ae-44a1-9dc0-a37796945a67",
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"kSpider pairwise -i sigs --estimate-ani -s 10000"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f55ac074-d2ad-4ce2-9cd1-6f27b2703843",
"metadata": {},
"outputs": [],
"source": [
"!ls"
]
},
{
"cell_type": "markdown",
"id": "53b316f9-19cc-4863-9dd0-d12f210a3688",
"metadata": {},
"source": [
"<hr>"
]
},
{
"cell_type": "markdown",
"id": "889702d0-04e9-4e13-a48b-cdb0ce6d2b10",
"metadata": {},
"source": [
"### Comparison exporting"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d92f8103-41d1-4125-8e2d-45e48abeb624",
"metadata": {},
"outputs": [],
"source": [
"!kSpider export --help"
]
},
{
"cell_type": "markdown",
"id": "3c1cc88c-cf61-4350-9c61-74dd8064ca26",
"metadata": {},
"source": [
"#### max_containment"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "68dcea1a-df96-4285-bde7-49e6c4a62f5d",
"metadata": {},
"outputs": [],
"source": [
"!kSpider export -i sigs --newick --dist-type max_cont -o kSpider_maxCont ## Sorry it should be max_cont, not max_containment"
]
},
{
"cell_type": "markdown",
"id": "12172eda-a9c2-4397-a8ae-c52b4207e38a",
"metadata": {},
"source": [
"#### avg_ani"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e4acd647-8c33-48b1-80c4-0649d73c5c46",
"metadata": {},
"outputs": [],
"source": [
"!kSpider export -i sigs --newick --dist-type ani -o kSpider_avgANI"
]
},
{
"cell_type": "markdown",
"id": "a8d84b35-2710-4183-9a6a-7b855be2aed2",
"metadata": {},
"source": [
"### Visualize on https://itol.embl.de/"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9224f962-c864-403d-96a4-47a1636beadd",
"metadata": {},
"outputs": [],
"source": [
"!cat kSpider_maxCont.newick"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a36e4746-e80c-4e2b-ae3f-52aef7f78c4a",
"metadata": {},
"outputs": [],
"source": [
"!cat kSpider_avgANI.newick"
]
},
{
"cell_type": "markdown",
"id": "0f155bb1-8338-477b-8ef4-4b6a8f69cfd3",
"metadata": {},
"source": [
"### Distance dissimilarity matrix "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "47a89565-9699-4684-99ce-13eebe2625cc",
"metadata": {},
"outputs": [],
"source": [
"!head -n2 kSpider_maxCont_distmat.tsv"
]
},
{
"cell_type": "markdown",
"id": "8f228e0c-2ebc-42fa-8fcc-a6c5700d9361",
"metadata": {},
"source": [
"<hr> "
]
},
{
"cell_type": "markdown",
"id": "1a12dce9-3e36-41af-a6c8-668043ba79d4",
"metadata": {},
"source": [
"### Graph-based Clustering"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c81e71dc-5eb5-4af4-8d73-e8e3c3e43126",
"metadata": {},
"outputs": [],
"source": [
"!kSpider cluster --help"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "df0337e4-20b9-4b92-b162-5239e208a73a",
"metadata": {},
"outputs": [],
"source": [
"!kSpider cluster -c 0.98 -i sigs -d max_cont"
]
},
{
"cell_type": "markdown",
"id": "6940bf1d-18cb-4447-8d38-8082e7703f05",
"metadata": {},
"source": [
"## Thank you!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment