Skip to content

Instantly share code, notes, and snippets.

@leiterenato
Created April 11, 2022 14:27
Show Gist options
  • Save leiterenato/02c4150b0e2d3e900ac4f80a1e759311 to your computer and use it in GitHub Desktop.
Save leiterenato/02c4150b0e2d3e900ac4f80a1e759311 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Alphafold Metadata\n",
"\n",
"The AlphaFold Inference pipeline uses specialized components to perform certain tasks such as: finding MSA (Multiple Sequence Alignment), aggregating features, generating predictions and relaxing protein folding.\n",
"Each component can generate different metadata relevant to the task at hand.\n",
"\n",
"In this notebook, you will explore how to retrieve metadata from an AlphaFold inference pipeline.\n",
"\n",
"Before dive into the sample code, let's review some important concepts for Vertex AI Metadata Services."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Overview - Vertex AI Metadata Services\n",
"\n",
"A critical part of the scientific method is recording both your observations and the parameters of an experiment. In data science, it is also critical to track the parameters, artifacts, and metrics used in a machine learning (ML) experiment. \n",
"\n",
"This metadata helps you: \n",
" - Analyze runs of a production ML system to understand changes in the quality of predictions.\n",
" - Analyze ML experiments to compare the effectiveness of different sets of hyperparameters.\n",
" - Track the lineage of ML artifacts, for example datasets and models, to understand just what contributed to the creation of an artifact or how that artifact was used to create descendant artifacts.\n",
" - Rerun an ML workflow with the same artifacts and parameters.\n",
" - Track the downstream usage of ML artifacts for governance purposes.\n",
"\n",
"Vertex ML Metadata lets you record the metadata and artifacts produced by your ML system and query that metadata to help analyze, debug, and audit the performance of your ML system or the artifacts that it produces.\n",
"\n",
"### Overview and terminology\n",
"Vertex ML Metadata captures your ML system's metadata as a graph. In the metadata graph, artifacts and executions are nodes, and events are edges that link artifacts as inputs or outputs of executions. Contexts represent subgraphs that are used to logically group sets of artifacts and executions.\n",
"\n",
"The following introduces the data model and terminology that is used to describe Vertex ML Metadata resources and components.\n",
"\n",
"##### **Context**\n",
"A Context is used to group Artifacts and Executions together under a single, queryable, and typed category. Contexts can be used to represent sets of metadata. An example of a Context would be a run of a machine learning pipeline.\n",
"\n",
"##### **Event**\n",
"An Event describes the relationship between Artifacts and Executions. Each Artifact can be produced by an Execution and consumed by other Executions. Events help you to determine the provenance of artifacts in their ML workflows by chaining together Artifacts and Executions.\n",
"\n",
"##### **Execution**\n",
"An Execution is a record of an individual machine learning workflow step, typically annotated with its runtime parameters. Examples of Executions include data ingestion, data validation, model training, model evaluation, and model deployment.\n",
"\n",
"##### **Artifact**\n",
"An Artifact is a discrete entity or piece of data produced and consumed by a machine learning workflow. Examples of Artifacts include input files, transformed datasets, trained models, training logs, and deployed model endpoints.\n",
"\n",
"Here is a sample diagram presenting the relationship between context, execution, event and artifact.\n",
"\n",
"![test](/images/metadata.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Environment definition and client instantiation"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from google.cloud import aiplatform_v1 as vertex_ai\n",
"from collections import namedtuple"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"REGION = 'us-central1'\n",
"API_ENDPOINT = \"{}-aiplatform.googleapis.com\".format(REGION)\n",
"PROJECT_ID = 'alphafold-dev-clean'\n",
"PROJECT_NUMBER = 633510463570"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Create Metadata service client\n",
"client = vertex_ai.MetadataServiceClient(\n",
" client_options={\n",
" \"api_endpoint\": API_ENDPOINT\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Retrieve context ID for Pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a request to list all the contexts from your project (pipeline executions in this case)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"ctx_request = vertex_ai.ListContextsRequest(\n",
" parent=\"projects/{0}/locations/{1}/metadataStores/default\".format(PROJECT_ID, REGION)\n",
")\n",
"ctx_list = client.list_contexts(ctx_request)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Choose one of the following contexts IDs to retrieve information from the Pipeline run. Each context ID represents a single pipeline execution."
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Context ID: projects/633510463570/locations/us-central1/metadataStores/default/contexts/alphafold-inference-one-sequence-20220401135617\n",
"Context ID: projects/633510463570/locations/us-central1/metadataStores/default/contexts/alphafold-inference-one-sequence\n",
"Context ID: projects/633510463570/locations/us-central1/metadataStores/default/contexts/alphafold-inference-mutated-sequences-20220401130734\n",
"Context ID: projects/633510463570/locations/us-central1/metadataStores/default/contexts/alphafold-inference-mutated-sequences\n",
"Context ID: projects/633510463570/locations/us-central1/metadataStores/default/contexts/alphafold-inference-20220401125122\n"
]
}
],
"source": [
"# List 5 Pipeline executions (contexts) from project\n",
"for pipeline in list(ctx_list)[:5]:\n",
" print(f'Context ID: {pipeline.name}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### List inference pipeline Artifacts (input sequence and databases)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Copy and paste the context ID from the previous step\n",
"ctx_name = f'projects/633510463570/locations/us-central1/metadataStores/default/contexts/alphafold-inference-20220324203328'\n",
"\n",
"# Filter which artifacts to present\n",
"FILTER = f'in_context(\"{ctx_name}\") AND ' \\\n",
" f'(display_name=\"importer.artifact\" OR ' \\\n",
" f'display_name=\"importer-3.artifact\")'\n",
"\n",
"artifact_request = vertex_ai.ListArtifactsRequest(\n",
" parent=\"projects/{0}/locations/{1}/metadataStores/default\".format(PROJECT_ID, REGION),\n",
" filter=FILTER\n",
")\n",
"\n",
"artifacts = client.list_artifacts(request=artifact_request)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"DNA sequence file name: gs://af-dataset/fasta/T1050.fasta\n",
"Reference databases.\n",
" - Database: pdb_obsolete\n",
" - Database: pdb70\n",
" - Database: uniclust30\n",
" - Database: uniref90\n",
" - Database: mgnify\n",
" - Database: pdb_mmcif\n",
" - Database: pdb_seqres\n",
" - Database: bfd\n",
" - Database: uniprot\n"
]
}
],
"source": [
"# Print DNA sequence file name and reference databases\n",
"for artifact in artifacts:\n",
" if artifact.display_name == 'importer.artifact':\n",
" print(f'DNA sequence file name: {artifact.uri}')\n",
" if artifact.display_name == 'importer-3.artifact':\n",
" print(f'Reference databases.')\n",
" for db in artifact.metadata:\n",
" print(f' - Database: {db}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### List container executions from pipeline (context)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First let's retrieve the ID of the executions we want to get the metadata."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# Define context ID (pipeline run)\n",
"ctx_name = f'projects/633510463570/locations/us-central1/metadataStores/default/contexts/alphafold-inference-20220324203328'\n",
"\n",
"# Filter which artifacts to present\n",
"FILTER = f'in_context(\"{ctx_name}\") AND ' \\\n",
" '(display_name=\"hhblits\" OR ' \\\n",
" 'display_name=\"hhblits-2\" OR ' \\\n",
" 'display_name=\"jackhmmer\" OR ' \\\n",
" 'display_name=\"jackhmmer-2\" OR ' \\\n",
" 'display_name=\"hhsearch\" OR ' \\\n",
" 'display_name=\"predict\")'\n",
"\n",
"executions_request = vertex_ai.ListExecutionsRequest(\n",
" parent=\"projects/{0}/locations/{1}/metadataStores/default\".format(PROJECT_ID, REGION),\n",
" filter=FILTER\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": [
"# Retrieve execution ID and display name\n",
"Execution = namedtuple('Execution', 'id display_name')\n",
"executions = []\n",
"\n",
"for e in client.list_executions(executions_request):\n",
" executions.append(\n",
" Execution(e.name, e.display_name)\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that you have the ID of the executions, you can retrieve the information from upstream / downstream artifacts."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Execution(id='projects/633510463570/locations/us-central1/metadataStores/default/executions/8365590470963419817', display_name='predict'),\n",
" Execution(id='projects/633510463570/locations/us-central1/metadataStores/default/executions/12409355193973730037', display_name='predict'),\n",
" Execution(id='projects/633510463570/locations/us-central1/metadataStores/default/executions/13277478563318258109', display_name='hhsearch'),\n",
" Execution(id='projects/633510463570/locations/us-central1/metadataStores/default/executions/18308270953372368416', display_name='jackhmmer-2'),\n",
" Execution(id='projects/633510463570/locations/us-central1/metadataStores/default/executions/5507708187653079700', display_name='hhblits'),\n",
" Execution(id='projects/633510463570/locations/us-central1/metadataStores/default/executions/9643879299856786334', display_name='hhblits-2'),\n",
" Execution(id='projects/633510463570/locations/us-central1/metadataStores/default/executions/16638260124073673449', display_name='jackhmmer')]"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"executions"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [],
"source": [
"# import pandas library as pd\n",
"import pandas as pd\n",
"df = pd.DataFrame()\n",
"\n",
"for e in executions:\n",
" exec_input_output_request = vertex_ai.QueryExecutionInputsAndOutputsRequest(\n",
" execution=e.id\n",
" )\n",
" query = client.query_execution_inputs_and_outputs(request=exec_input_output_request)\n",
"\n",
" if e.display_name == 'predict':\n",
" for artifact in query.artifacts:\n",
" if artifact.display_name == 'raw_prediction':\n",
" df[query.executions[0].metadata['input:model_name']] = \\\n",
" [artifact.metadata['ranking_confidence']]\n",
"\n",
" elif e.display_name == 'hhsearch':\n",
" for artifact in query.artifacts:\n",
" if artifact.display_name == 'template_hits':\n",
" df[e.display_name + ': template hits'] = [artifact.metadata['num of hits']]\n",
"\n",
" else:\n",
" for artifact in query.artifacts:\n",
" if artifact.display_name == 'msa':\n",
" df[e.display_name + ': msa'] = [artifact.metadata['num of sequences']]"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>model_1</th>\n",
" <th>model_2</th>\n",
" <th>hhsearch: template hits</th>\n",
" <th>jackhmmer-2: msa</th>\n",
" <th>hhblits: msa</th>\n",
" <th>hhblits-2: msa</th>\n",
" <th>jackhmmer: msa</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>92.197389</td>\n",
" <td>92.324984</td>\n",
" <td>500.0</td>\n",
" <td>10000.0</td>\n",
" <td>2710.0</td>\n",
" <td>5304.0</td>\n",
" <td>10000.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" model_1 model_2 hhsearch: template hits jackhmmer-2: msa \\\n",
"0 92.197389 92.324984 500.0 10000.0 \n",
"\n",
" hhblits: msa hhblits-2: msa jackhmmer: msa \n",
"0 2710.0 5304.0 10000.0 "
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
}
],
"metadata": {
"interpreter": {
"hash": "d4d1e4263499bec80672ea0156c357c1ee493ec2b1c70f0acce89fc37c4a6abe"
},
"kernelspec": {
"display_name": "Python 3.7.12 ('base')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment