Skip to content

Instantly share code, notes, and snippets.

@leiterenato
Created October 6, 2022 21:42
Show Gist options
  • Save leiterenato/d957e470121a7d81b798c062c5159612 to your computer and use it in GitHub Desktop.
Save leiterenato/d957e470121a7d81b798c062c5159612 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Quick Start: AlphaFold Inference with Vertex Pipelines\n",
"\n",
"This quick start notebook demonstrates how to configure and run the inference pipeline using a monomer protein."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install and import required packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -U kfp google-cloud-aiplatform google-cloud-pipeline-components biopython google-cloud-filestore"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from google.cloud import aiplatform as vertex_ai\n",
"from kfp.v2 import compiler\n",
"\n",
"from src.utils import compile_utils\n",
"from src.utils import fasta_utils"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure environment settings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Change the values of the following parameters to reflect your environment.\n",
"\n",
"- `PROJECT_ID` - Project ID of your environment\n",
"- `ZONE`- GCP Zone where your resources will be deployed and are located.\n",
"- `BUCKET_NAME` - GCS bucket to use for Vertex staging. Must be in the same region of ZONE.\n",
"- `FILESTORE_ID` - Instance ID of your Filestore instance\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"PROJECT_ID = 'jk-af-final' # Change to your project ID\n",
"ZONE = 'us-central1-a' # Change to your zone, same as the Filestore was deployed (example: us-central1-c)\n",
"BUCKET_NAME = 'rl-aff-bucket' # Change to your bucket name\n",
"FILESTORE_ID = 'jk-aff-fs' # Change to your Filestore ID"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"REGION = '-'.join(ZONE.split(sep='-')[:-1])\n",
"FILESTORE_IP, FILESTORE_NETWORK = compile_utils.get_filestore_info(\n",
" project_id=PROJECT_ID, instance_id=FILESTORE_ID, location=ZONE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you set up the sandbox environment using the provided Terraform configuration you do not need to change the below settings. Otherwise make sure that they are consistent with your environment.\n",
"\n",
"- `FILESTORE_SHARE` - Filestore share with AlphaFold reference databases\n",
"- `FILESTORE_MOUNT_PATH` - Mount path for Filestore fileshare\n",
"- `MODEL_PARAMS` - GCS location of AlphaFold model parameters. The pipelines are configured to retrieve the parameters from the `<MODEL_PARAMS>/params` folder.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"FILESTORE_SHARE = '/datasets'\n",
"FILESTORE_MOUNT_PATH = '/mnt/nfs/alphafold'\n",
"MODEL_PARAMS = f'gs://{BUCKET_NAME}'\n",
"IMAGE_URI = f'gcr.io/{PROJECT_ID}/alphafold-components'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure and run the Universal pipeline with monomer presets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are two types of parameters that can be used to customize Vertex Pipelines: compile time and runtime. The compile time parameters must be set before compiling the pipeline code. The runtime parameters can be supplied when starting a pipeline run.\n",
"\n",
"In the AlphaFold inference pipelines, the compile time parameters are used to control settings like CPU/GPU configuration of compute nodes and the Filestore instance settings. The runtime parameters include a sequence to fold, model presets, the maximum date for template searches and more. \n",
"\n",
"The pipelines have been designed to retrieve compile time parameters from environment variables. This makes it easy to integrate a pipeline compilation step with CI/CD systems.\n",
"\n",
"By default, the pipeline uses a `c2-standard-16` node to run the feature engineering step and `n1-standard-8` nodes with NVIDIA T4 GPUs to run prediction and relaxation. For now, you will use the default settings. This hardware configuration is optimal for folding smaller proteins, roughly 1000 residues or fewer. \n",
"\n",
"In other notebooks we will demonstrate how to change the default hardware configuration."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Set compile time parameters\n",
"\n",
"At minimum you have to configure:\n",
"- the settings of your Filestore instance that hosts genetic databases, \n",
"- the URI of the docker image that packages custom KFP components, and \n",
"- the GCS location of AlphaFold parameters."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"os.environ['ALPHAFOLD_COMPONENTS_IMAGE'] = IMAGE_URI\n",
"os.environ['NFS_SERVER'] = FILESTORE_IP\n",
"os.environ['NFS_PATH'] = FILESTORE_SHARE\n",
"os.environ['NETWORK'] = FILESTORE_NETWORK\n",
"os.environ['MODEL_PARAMS_GCS_LOCATION'] = MODEL_PARAMS"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you are working with larger proteins that demand GPUs with more memory/processing power, you can change the default settings and reconfigure the pipeline to use nodes with, for example, NVIDIA A100 GPU for prediction and relaxation.\n",
"\n",
"If that's the case, please uncomment the following cell and redefine the default values (remember that the default is set to T4 GPUs).\n",
"\n",
"To review how to properlly set these parameters, please refer to the following documentation: \n",
"https://cloud.google.com/vertex-ai/docs/pipelines/machine-types"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Instance (VM) configuration to run model prediction\n",
"os.environ['MEMORY_LIMIT'] = '170' # Amount of host memory (RAM)\n",
"os.environ['CPU_LIMIT'] = '24' # Number of CPUs\n",
"os.environ['GPU_LIMIT'] = '2' # Number of GPUs\n",
"os.environ['GPU_TYPE'] = 'nvidia-tesla-a100' # GPU type\n",
"\n",
"# Instance (VM) configuration to run protein relaxation\n",
"os.environ['RELAX_MEMORY_LIMIT'] = '170' # Amount of host memory (RAM)\n",
"os.environ['RELAX_CPU_LIMIT'] = '24' # Number of CPUs\n",
"os.environ['RELAX_GPU_LIMIT'] = '2' # Number of GPUs\n",
"os.environ['RELAX_GPU_TYPE'] = 'nvidia-tesla-a100' # GPU type"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Compile the pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from src.pipelines.alphafold_inference_pipeline import alphafold_inference_pipeline as pipeline\n",
"\n",
"pipeline_name = 'universal-pipeline'\n",
"compiler.Compiler().compile(\n",
" pipeline_func=pipeline,\n",
" package_path=f'{pipeline_name}.json')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure runtime parameters\n",
"\n",
"At minimum you need to configure a GCS location of your sequence, the maximum date for template searches and a project and region where to run the pipeline. With the default settings, the pipeline will run monomer inference using the small version of BFD.\n",
"\n",
"**Note about multimer sequences**: When processing multimer sequences, the `num_multimer_predictions_per_model` parameter controls how many predictions are run for each model. The default value has been set to 5, which is the same as in the [run_alphafold.py](https://github.com/deepmind/alphafold/blob/main/run_alphafold.py) script.\n",
"\n",
"#### Copy the sample sequence to a GCS location\n",
"\n",
"You can find a few sample sequences in the `sequences` folder."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sequence = 'BRCA2.reference.fa'\n",
"\n",
"is_monomer, sequences = fasta_utils.validate_fasta_file(\n",
" os.path.join('sequences', sequence))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gcs_sequence_path = f'gs://{BUCKET_NAME}/fasta/{sequence}'\n",
"! gsutil cp sequences/{sequence} {gcs_sequence_path}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Define Alphafold training parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"max_template_date = '2020-05-14'\n",
"use_small_bfd = False # 'True' will only use a portion of the BDF database\n",
"num_multimer_predictions_per_model = 5 # Number of predictions per model for multimer model preset\n",
"is_run_relax = 'relax' # Wheather or not to run relaxation process. If you don't need to run the relaxation step, pass an empty string ''."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"params = {\n",
" 'sequence_path': gcs_sequence_path,\n",
" 'max_template_date': max_template_date,\n",
" 'model_preset': 'monomer' if is_monomer else 'multimer',\n",
" 'project': PROJECT_ID,\n",
" 'region': REGION,\n",
" 'use_small_bfd': use_small_bfd,\n",
" 'num_multimer_predictions_per_model': num_multimer_predictions_per_model,\n",
" 'is_run_relax': is_run_relax\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"params"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit a pipeline run"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We recommend annotating pipeline runs with at least two labels. The first label groups multiple pipeline runs into a single experiment. The second label identifies a given run within the experiment. Annotating with labels helps to discover and analyze pipeline runs in large scale settings. The third notebook that demonstrates how to analyze pipeline runs depends on the labels. \n",
"\n",
"You will be able to monitor the run using the link printed by executing the cell."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vertex_ai.init(\n",
" project=PROJECT_ID,\n",
" location=REGION,\n",
" staging_bucket=f'gs://{BUCKET_NAME}/staging'\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"experiment_id = 'BRCA2-experiment'\n",
"labels = {'experiment_id': experiment_id.lower(), 'sequence_id': sequence.split(sep='.')[0].lower()}\n",
"\n",
"pipeline_job = vertex_ai.PipelineJob(\n",
" display_name=pipeline_name,\n",
" template_path=f'{pipeline_name}.json',\n",
" pipeline_root=f'gs://{BUCKET_NAME}/pipeline_runs/{pipeline_name}',\n",
" parameter_values=params,\n",
" enable_caching=True,\n",
" labels=labels\n",
")\n",
"\n",
"pipeline_job.run(sync=False)\n",
"pipeline_job.wait_for_resource_creation()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check the state of the pipeline\n",
"pipeline_job.state"
]
}
],
"metadata": {
"environment": {
"kernel": "python3",
"name": "common-cpu.m92",
"type": "gcloud",
"uri": "gcr.io/deeplearning-platform-release/base-cpu:m92"
},
"kernelspec": {
"display_name": "Python 3.8.13 ('alphafold')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
},
"vscode": {
"interpreter": {
"hash": "5d4ba5cd9e03f1df519c5604fda4f32e7a6119f61e07e901bf30b393ad1ef277"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment