Skip to content

Instantly share code, notes, and snippets.

@ivirshup
Last active July 20, 2024 03:32
Show Gist options
  • Save ivirshup/e7cc5b717bad6fd32460525765e10c9b to your computer and use it in GitHub Desktop.
Save ivirshup/e7cc5b717bad6fd32460525765e10c9b to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebooks contains an exploration of an API for exposing a few cell guide artifacts via cellxgene_census.\n",
"\n",
"It contains some preliminary functions, some comments on the data they return."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Retrieving data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"CELL_GUIDE_BASE_URI = \"https://cellguide.cellxgene.cziscience.com\"\n",
"LATEST_SNAPSHOT = requests.get(f\"{CELL_GUIDE_BASE_URI}/latest_snapshot_identifier\").text"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def _get_cellguide_file(relpth: str, snapshot: str = LATEST_SNAPSHOT) -> requests.Response:\n",
" req = requests.get(f\"{CELL_GUIDE_BASE_URI}/{snapshot}/{relpth}\")\n",
" if req.text == \"\":\n",
" raise ValueError(f\"No record found for {snapshot}/{relpth}\")\n",
" return req"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def get_computational_marker_genes(ontology_id: str, *, snapshot=LATEST_SNAPSHOT) -> pd.DataFrame:\n",
" resp = _get_cellguide_file(f\"computational_marker_genes/{ontology_id}.json\", snapshot=snapshot)\n",
" return pd.DataFrame.from_records(resp.json(), exclude=[\"groupby_dims\"])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>me</th>\n",
" <th>pc</th>\n",
" <th>marker_score</th>\n",
" <th>specificity</th>\n",
" <th>gene_ontology_term_id</th>\n",
" <th>symbol</th>\n",
" <th>name</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>4.113497</td>\n",
" <td>0.962963</td>\n",
" <td>3.052858</td>\n",
" <td>0.991119</td>\n",
" <td>ENSG00000011465</td>\n",
" <td>DCN</td>\n",
" <td>decorin</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.918067</td>\n",
" <td>0.949495</td>\n",
" <td>2.992588</td>\n",
" <td>0.994485</td>\n",
" <td>ENSG00000077942</td>\n",
" <td>FBLN1</td>\n",
" <td>fibulin 1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.510601</td>\n",
" <td>0.932660</td>\n",
" <td>2.825345</td>\n",
" <td>1.000000</td>\n",
" <td>ENSG00000182326</td>\n",
" <td>C1S</td>\n",
" <td>complement C1s</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2.639954</td>\n",
" <td>0.929293</td>\n",
" <td>2.765793</td>\n",
" <td>0.998127</td>\n",
" <td>ENSG00000159403</td>\n",
" <td>C1R</td>\n",
" <td>complement C1r</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2.802653</td>\n",
" <td>0.978114</td>\n",
" <td>2.678388</td>\n",
" <td>0.998390</td>\n",
" <td>ENSG00000149131</td>\n",
" <td>SERPING1</td>\n",
" <td>serpin family G member 1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>195</th>\n",
" <td>2.229963</td>\n",
" <td>0.316498</td>\n",
" <td>0.675911</td>\n",
" <td>1.000000</td>\n",
" <td>ENSG00000064205</td>\n",
" <td>CCN5</td>\n",
" <td>cellular communication network factor 5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>196</th>\n",
" <td>2.594995</td>\n",
" <td>0.961279</td>\n",
" <td>0.675779</td>\n",
" <td>0.975155</td>\n",
" <td>ENSG00000136156</td>\n",
" <td>ITM2B</td>\n",
" <td>integral membrane protein 2B</td>\n",
" </tr>\n",
" <tr>\n",
" <th>197</th>\n",
" <td>1.997101</td>\n",
" <td>0.313131</td>\n",
" <td>0.674976</td>\n",
" <td>1.000000</td>\n",
" <td>ENSG00000157227</td>\n",
" <td>MMP14</td>\n",
" <td>matrix metallopeptidase 14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>198</th>\n",
" <td>1.874685</td>\n",
" <td>0.306397</td>\n",
" <td>0.670845</td>\n",
" <td>0.975000</td>\n",
" <td>ENSG00000174059</td>\n",
" <td>CD34</td>\n",
" <td>CD34 molecule</td>\n",
" </tr>\n",
" <tr>\n",
" <th>199</th>\n",
" <td>1.832210</td>\n",
" <td>0.407407</td>\n",
" <td>0.664746</td>\n",
" <td>0.955556</td>\n",
" <td>ENSG00000233913</td>\n",
" <td>RPL10P9</td>\n",
" <td>ribosomal protein L10 pseudogene 9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>200 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" me pc marker_score specificity gene_ontology_term_id \\\n",
"0 4.113497 0.962963 3.052858 0.991119 ENSG00000011465 \n",
"1 2.918067 0.949495 2.992588 0.994485 ENSG00000077942 \n",
"2 2.510601 0.932660 2.825345 1.000000 ENSG00000182326 \n",
"3 2.639954 0.929293 2.765793 0.998127 ENSG00000159403 \n",
"4 2.802653 0.978114 2.678388 0.998390 ENSG00000149131 \n",
".. ... ... ... ... ... \n",
"195 2.229963 0.316498 0.675911 1.000000 ENSG00000064205 \n",
"196 2.594995 0.961279 0.675779 0.975155 ENSG00000136156 \n",
"197 1.997101 0.313131 0.674976 1.000000 ENSG00000157227 \n",
"198 1.874685 0.306397 0.670845 0.975000 ENSG00000174059 \n",
"199 1.832210 0.407407 0.664746 0.955556 ENSG00000233913 \n",
"\n",
" symbol name \n",
"0 DCN decorin \n",
"1 FBLN1 fibulin 1 \n",
"2 C1S complement C1s \n",
"3 C1R complement C1r \n",
"4 SERPING1 serpin family G member 1 \n",
".. ... ... \n",
"195 CCN5 cellular communication network factor 5 \n",
"196 ITM2B integral membrane protein 2B \n",
"197 MMP14 matrix metallopeptidase 14 \n",
"198 CD34 CD34 molecule \n",
"199 RPL10P9 ribosomal protein L10 pseudogene 9 \n",
"\n",
"[200 rows x 7 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res = get_computational_marker_genes(\"CL_0000005\")\n",
"res"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def get_canonical_marker_genes(ontology_id: str, *, snapshot=LATEST_SNAPSHOT) -> pd.DataFrame:\n",
" resp = _get_cellguide_file(f\"canonical_marker_genes/{ontology_id}.json\", snapshot=snapshot)\n",
" return pd.DataFrame.from_records(resp.json())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This JSON doesn't include the ensembl gene id, which is a problem. It should include the gene id that CZI maps that symbol to internally."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>tissue</th>\n",
" <th>symbol</th>\n",
" <th>name</th>\n",
" <th>publication</th>\n",
" <th>publication_titles</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>large intestine</td>\n",
" <td>CD4</td>\n",
" <td>CD4</td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>lymph node</td>\n",
" <td>CD3D</td>\n",
" <td>CD3D</td>\n",
" <td>10.4049/jimmunol.1701025</td>\n",
" <td>Identity and Diversity of Human Peripheral Th ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>lymph node</td>\n",
" <td>ID2</td>\n",
" <td>ID2</td>\n",
" <td>10.4049/jimmunol.1701025</td>\n",
" <td>Identity and Diversity of Human Peripheral Th ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>lymph node</td>\n",
" <td>IL7R</td>\n",
" <td>IL7R</td>\n",
" <td>10.4049/jimmunol.1701025</td>\n",
" <td>Identity and Diversity of Human Peripheral Th ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>All Tissues</td>\n",
" <td>CD3D</td>\n",
" <td>CD3D</td>\n",
" <td>10.4049/jimmunol.1701025</td>\n",
" <td>Identity and Diversity of Human Peripheral Th ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>All Tissues</td>\n",
" <td>CD4</td>\n",
" <td>CD4</td>\n",
" <td></td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>All Tissues</td>\n",
" <td>ID2</td>\n",
" <td>ID2</td>\n",
" <td>10.4049/jimmunol.1701025</td>\n",
" <td>Identity and Diversity of Human Peripheral Th ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>All Tissues</td>\n",
" <td>IL7R</td>\n",
" <td>IL7R</td>\n",
" <td>10.4049/jimmunol.1701025</td>\n",
" <td>Identity and Diversity of Human Peripheral Th ...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" tissue symbol name publication \\\n",
"0 large intestine CD4 CD4 \n",
"1 lymph node CD3D CD3D 10.4049/jimmunol.1701025 \n",
"2 lymph node ID2 ID2 10.4049/jimmunol.1701025 \n",
"3 lymph node IL7R IL7R 10.4049/jimmunol.1701025 \n",
"4 All Tissues CD3D CD3D 10.4049/jimmunol.1701025 \n",
"5 All Tissues CD4 CD4 \n",
"6 All Tissues ID2 ID2 10.4049/jimmunol.1701025 \n",
"7 All Tissues IL7R IL7R 10.4049/jimmunol.1701025 \n",
"\n",
" publication_titles \n",
"0 \n",
"1 Identity and Diversity of Human Peripheral Th ... \n",
"2 Identity and Diversity of Human Peripheral Th ... \n",
"3 Identity and Diversity of Human Peripheral Th ... \n",
"4 Identity and Diversity of Human Peripheral Th ... \n",
"5 \n",
"6 Identity and Diversity of Human Peripheral Th ... \n",
"7 Identity and Diversity of Human Peripheral Th ... "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res = get_canonical_marker_genes(\"CL_0000545\")\n",
"res"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There is no versioning on these, which would be a problem if they are updated."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def get_description(ontology_id: str, validated=True) -> dict | str:\n",
" if validated:\n",
" subdir = \"validated_descriptions\"\n",
" else:\n",
" subdir = \"gpt_descriptions\"\n",
"\n",
" req = requests.get(f\"{CELL_GUIDE_BASE_URI}/{subdir}/{ontology_id}.json\")\n",
" if req.text == \"\":\n",
" raise ValueError(f\"No record found for {subdir}/{ontology_id}\")\n",
" return req.json()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The formats seem to be different depending on whether they are validated. I expect we'd want to hide that for the user."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'description': 'Stratified epithelial cells form layers of cells stacked on top of one another, and they primarily function as a protective barrier. Their appearance can be cuboidal, columnar or squamous, and they offer a layer of defense against foreign bodies, harmful substances, pathogens, and prevent water loss and electrolyte imbalance within the body. Depending on the organ and its specific function, the topmost layer of these cells may be specialized as keratinized or non-keratinized. The skin, being constantly exposed to external physical stress, is composed of keratinized stratified epithelial cells that prevent desiccation and provide additional protection. In contrast, non-keratinized stratified epithelial cells line internally exposed surfaces like the oral cavity, esophagus, and vagina, where lubrication is required. \\n\\nIn addition to protection, stratified epithelial cells also contribute to tissue repair processes, referred to as epithelialization. Basal epithelial cells are capable of rapid division and migration to heal wounds and maintain the protective barrier function of the tissues they line. Despite having a limited lifespan, these cells ensure a constant replacement process, maintaining their population and the integrity of epithelial layers.',\n",
" 'references': ['https://wires.onlinelibrary.wiley.com/doi/10.1002/wdev.146',\n",
" 'https://www.ncbi.nlm.nih.gov/books/NBK534261/']}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_description(\"CL_0000079\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Stratified epithelial cells are unique types of cells that majorly constitute the layer of cells stacked on top of each other, typically found lining the surfaces of many organs throughout the body. These cells possess the ability to withstand physical and chemical stresses, making them incredible in protecting the underlying tissues from damage. Their structure offers rigidity, helping maintain the structural integrity of the organs they envelope. \\n\\nStratified epithelial cells primarily function as a protective barrier. They offer a layer of defense against foreign bodies, harmful substances, pathogens and prevent water loss and electrolyte imbalance within the body. Depending on the organ and its specific function, the topmost layer of these cells may be specialized as keratinized or non-keratinized. For instance, the skin, being constantly exposed to external physical stress, has keratinized stratified epithelial cells that prevent desiccation and provide additional protection. On the other hand, non-keratinized stratified epithelial cells line internally exposed surfaces like the oral cavity, esophagus, and vagina, where lubrication is required. \\n\\nIn addition to protection, stratified epithelial cells also play essential roles in sensory perception such as touch and pressure, particularly in the skin where they work together with other types of cells in the epidermis. Furthermore, they contribute to tissue repair processes, demonstrated by their robust regenerative capacity. Stratified epithelial cells are capable of rapid division and migration to heal wounds and maintain the protective barrier function of the tissues they line. Despite having a limited lifespan, these cells ensure a constant replacement process, maintaining their population and the integrity of epithelial layers. The ability of these cells to fulfill such varied functions is a testament to their indispensable role in maintaining homeostasis and overall body health.'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_description(\"CL_0000079\", validated=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Listing available data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just accessing https://cellguide.cellxgene.cziscience.com gives a \"list bucket\" result from the s3 bucket. IDK how that is set up, but it would be nice to be able to list subdirectories and get a lower level look.\n",
"\n",
"However, the s3 bucket itself isn't public (though does have public in the name...), so I don't think I can just use s3fs here."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "cellxgene-census-dev",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment