Skip to content

Instantly share code, notes, and snippets.

@BryanFauble
Last active February 9, 2026 23:27
Show Gist options
  • Select an option

  • Save BryanFauble/4ec3ab8614fafdb3b38fed5cd21d8ae3 to your computer and use it in GitHub Desktop.

Select an option

Save BryanFauble/4ec3ab8614fafdb3b38fed5cd21d8ae3 to your computer and use it in GitHub Desktop.
Curator MVP Setup via Python

⚠️ ⚠️ ⚠️

This setup guide has been deprecated in favor of https://python-docs.synapse.org/en/stable/guides/extensions/curator/metadata_curation/ - It is reccommended that you use the curator extension in favor of following this guide.

⚠️ ⚠️ ⚠️

Synapse Curator MVP Setup Guide

This guide helps you set up and run programmatic curation tools for managing metadata in Synapse. These tools support two different metadata workflows depending on your data organization needs.

Overview: Choose Your Metadata Workflow

File-based Metadata

When to use: Metadata describes individual data files and is stored as annotations directly on each file.

Example: Each sequencing data file has annotations like sample type, sequencing method, and sample identifier attached to the file itself.

Record-based Metadata

When to use: Metadata is normalized in structured records to eliminate duplication and ensure consistency.

Example: Sample information is stored once in a CSV record, and multiple data files reference that sample by its identifier instead of duplicating the sample metadata. A project might have multiple record-based metadata types that would each receive separate treatment.


Prerequisites

For All Workflows

  • Python 3.9 or later
  • Synapse account with appropriate permissions
  • Synapse authentication configured (login credentials)

Installation Requirements

pip install --upgrade synapseclient[pandas]

Step 1: Find and Select Your JSON Schema

Option A:

For JSON Schemas generated and stored to Synapse outside of the DPE team you will need to find the correct schema URI to use based on the values it was used to be registered.

Option B: Browse manually

Many registered JSON Schemas are available at: https://synapse.org/Synapse:syn69735275/tables/

  1. Browse the schemas at the table URL and find the one that matches your data type
  2. Note the schema URI (e.g., sage.schemas.v2571-ad.IndividualAnimalMetadataTemplate.schema-0.1.0)

Option C: Use the schema query script

Use the query_schema_registry.py script to programmatically find schemas by DCC and datatype:

python query_schema_registry.py --dcc ad --datatype IndividualAnimalMetadataTemplate

This will output something like:

Found 2 matching schema(s):
  1. URI: sage.schemas.v2571-ad.IndividualAnimalMetadataTemplate.schema-0.1.0
     Version: 0.1.0
     DCC: ad
     DataType: IndividualAnimalMetadataTemplate

  2. URI: sage.schemas.v2571-ad.IndividualAnimalMetadataTemplate.schema-0.0.0
     Version: 0.0.0
     DCC: ad
     DataType: IndividualAnimalMetadataTemplate

Latest schema URI: sage.schemas.v2571-ad.IndividualAnimalMetadataTemplate.schema-0.1.0

Use this URI in your scripts: sage.schemas.v2571-ad.IndividualAnimalMetadataTemplate.schema-0.1.0

Alternative script options:

  • Edit the script constants and run without arguments: python query_schema_registry.py

Alternative: Edit script constants

Update these variables in query_schema_registry.py:

DCC = ""  # Data Coordination Center (e.g., 'ad', 'amp', 'mc2')
DATATYPE = ""  # Data type name from schema (must be unique; may include prefixes/suffixes like `IndividualAnimalMetadataTemplate.studyXYZ`, `AssayRNAseqTemplate.studyXYZ`)

Step 2: Bind Schema to Your Synapse Folder or RecordSet

Before creating curation tasks, you must bind a JSON schema to your Synapse folder or recordset. The script supports binding to either entity type, but only one can be specified at a time.

Option A: Use the dedicated script for folders

python bind_json_schema.py --folder-id syn12345678 --uri sage.schemas.full.schema.uri

Option B: Use the dedicated script for recordsets

python bind_json_schema.py --recordset-id syn87654321 --uri sage.schemas.full.schema.uri

Option C: Edit the script constants and run

Open bind_json_schema.py and update these variables:

URI = "" # Your Schema URI
FOLDER_ID = ""  # Your Synapse folder ID (leave empty if using recordset)
RECORDSET_ID = ""  # Your Synapse recordset ID (leave empty if using folder)

Then run: python bind_json_schema.py

Note: The script enforces mutual exclusion - you can only bind to either a folder OR a recordset, not both simultaneously.

Step 3A: File-based Metadata Workflow

Use this when you want metadata stored as annotations directly on data files.

Run the File-based Metadata Script

python create_file_based_metadata_task.py --folder-id syn12345678 --datatype DATATYPE_NAME --instructions "Your curation instructions"

Required parameters:

  • --folder-id: Your Synapse folder ID
  • --datatype: Data type name for the CurationTask (must be unique)
  • --instructions: Instructions for the curation task

Optional parameters:

  • --no-wiki: Skip creating a wiki page for the file view

Examples:

# Basic usage with required datatype
python create_file_based_metadata_task.py --folder-id syn12345678 --datatype IndividualAnimalMetadataTemplate.studyXYZ --instructions "Please curate the metadata for this dataset according to the schema requirements"

# Skip wiki creation
python create_file_based_metadata_task.py --folder-id syn12345678 --datatype IndividualAnimalMetadataTemplate.studyXYZ --instructions "Custom curation instructions" --no-wiki

What this creates:

  • EntityView: Shows all files in your folder with their metadata
  • CurationTask: Manages the curation process
  • Wiki page: Documents the file view (unless --no-wiki is used)

Alternative: Edit script constants

Update these variables in create_file_based_metadata_task.py:

FOLDER_ID = "syn12345678"  # Your folder ID
DATATYPE = "YourDatatype.studyXYZ"  # Data type name (required)
INSTRUCTIONS = "Your custom curation instructions"  # Instructions (required)
ATTACH_WIKI = True        # Set to False to skip wiki creation

Step 3B: Record-based Metadata Workflow

Use this when you want structured metadata records that files can reference.

Run the Record-based Metadata Script

python create_record_based_metadata_task.py \
    --project_id syn1234 \
    --folder_id syn12345678 \
    --dcc DCC_VALUE \
    --datatype DATATYPE_PYTHON \
    --schema_uri SCHEMA_URI \
    --upsert_keys ONE_OR_MORE_KEYS \
    --instructions "Your curation instructions"

Required parameters:

  • --project_id: Your Synapse project ID
  • --folder_id: Your Synapse folder ID
  • --dcc: Data Coordination Center (e.g., ad, amp, mc2)
  • --datatype: Data type name from your schema (must be unique; can include prefixes/suffixes like IndividualAnimalMetadataTemplate.studyXYZ, `AssayRNAseqTemplate.studyXYZ)
  • --upsert_keys: One or more column names used as unique identifiers (e.g., specimenID, or specimenID participantID sampleDate)
  • --instructions: Instructions for the curation task

Choose one of these parameters:

  • --schema_uri: Schema URI (e.g., sage.schemas.v2571-ad.IndividualAnimalMetadataTemplate.schema-0.1.0, sage.schemas.v2571-el.AssayRNAseqTemplate.schema-0.0.1)
  • --schema_path: Absolute local path to schema file (alternative to URI)

What this creates:

  • RecordSet: CSV-based structured metadata storage
  • CurationTask: Manages the curation process
  • Grid: Interactive view for editing metadata records, only used to export the results back to the RecordSet to "bootstrap" it

Note: After creating the RecordSet, you can optionally bind a schema directly to it using:

python bind_json_schema.py --recordset-id <RECORDSET_ID> --uri <SCHEMA_URI>

Alternative: Edit script constants

Update these variables in create_record_based_metadata_task.py:

PROJECT_ID = ""
FOLDER_ID = ""
DCC = "" 
DATATYPE = ""  # Must be unique; can include prefixes/suffixes for uniqueness
SCHEMA_URI = ""
UPSERT_KEYS = []
INSTRUCTIONS = ""  # Instructions (required)

Validation and Troubleshooting

Validate Your Schema Binding

validate_json_schema.py

This script allows you to validate files that exist within a particular folder giving back a summary of valid and invalid entities according to your schema.

Update the URI and FOLDER_ID.

List Existing Curation Tasks

list_curation_task.py

Update the PROJECT_ID in the script to match your project.


Next Steps

After running these scripts successfully:

  1. File-based workflow: Upload data files to your folder. The EntityView will automatically show them with their metadata annotations.

  2. Record-based workflow: Use the Grid interface in Synapse to edit your metadata records, then upload data files that reference these records.

  3. Monitor curation progress through the CurationTask interface in Synapse. (UI Elements are still being developed)

For additional features and web UI integration, refer to the full Synapse documentation (To be created).

import argparse
from synapseclient import Synapse
from synapseclient.models import Folder, RecordSet
# The URI of the JSON Schema you want to bind
URI = ""
# The Synapse ID of the folder you want to bind the JSON Schema to
FOLDER_ID = ""
# The Synapse ID of the recordset you want to bind the JSON Schema to
RECORDSET_ID = ""
def main():
"""Main function for command-line usage."""
parser = argparse.ArgumentParser(
description="Bind JSON schema to a Synapse folder or recordset"
)
parser.add_argument(
'--folder-id',
type=str,
help='Synapse folder ID'
)
parser.add_argument(
'--recordset-id',
type=str,
help='Synapse recordset ID'
)
parser.add_argument(
'--uri',
type=str,
help='The URI of the JSON Schema to bind'
)
args = parser.parse_args()
# Determine which entity type and ID to use
if args.folder_id is not None:
entity_id = args.folder_id
entity_type = "folder"
elif args.recordset_id is not None:
entity_id = args.recordset_id
entity_type = "recordset"
elif FOLDER_ID:
entity_id = FOLDER_ID
entity_type = "folder"
elif RECORDSET_ID:
entity_id = RECORDSET_ID
entity_type = "recordset"
else:
raise ValueError(
"Either folder-id or recordset-id must be provided via CLI or set as global constants in script"
)
if args.uri is not None:
uri = args.uri
elif URI:
uri = URI
else:
uri = None
syn = Synapse()
syn.login()
# Bind schema to the appropriate entity type
if entity_type == "folder":
entity = Folder(id=entity_id).get()
entity.bind_schema(json_schema_uri=uri)
print(f"Bound JSON schema {uri} to folder {entity_id}")
elif entity_type == "recordset":
entity = RecordSet(id=entity_id, download_file=False).get()
entity.bind_schema(json_schema_uri=uri)
print(f"Bound JSON schema {uri} to recordset {entity_id}")
if __name__ == "__main__":
main()
"""
Create a file view and CurationTask for schema-bound folders following the file-based metadata workflow.
Pre-Requisites:
Requires conflicting versions of schematicpy and synapseclient.
Install schematicpy dependencies first, then uninstall synapseclient and reinstall with pip install git+https://github.com/Sage-Bionetworks/synapsePythonClient.git@synpy-1653-metadata-tasks-and-recordsets
Usage:
python create_file_view.py --folder-id syn12345678 --datatype MyDatatype.studyXYZ
python create_file_view.py --folder-id syn12345678 --datatype MyDatatype.studyXYZ \\
--instructions "Custom curation instructions"
python create_file_view.py --folder-id syn12345678 --datatype MyDatatype.studyXYZ --no-wiki
Users can also set arguments using the global variables below,
but CLI arguments are used first.
"""
import argparse
import warnings
from typing import Any, Optional
from synapseclient import Synapse # type: ignore
from synapseclient import Wiki # type: ignore
from synapseclient.core.exceptions import SynapseHTTPError # type: ignore
from synapseclient.models import ( # type: ignore
Column,
ColumnType,
EntityView,
Folder,
ViewTypeMask,
)
from synapseclient.models.curation import CurationTask, FileBasedMetadataTaskProperties
from synapseclient.services.json_schema import JsonSchemaVersion
FOLDER_ID = "" # The Synapse ID of the entity you want to create the file view and CurationTask for
ATTACH_WIKI = None # Whether or not to attach the file view to the folder wiki. True or False
DATATYPE = "" # Data type name for the CurationTask (required)
# Instructions for the curation task (required)
INSTRUCTIONS = ""
TYPE_DICT = {
"string": ColumnType.STRING,
"number": ColumnType.DOUBLE,
"integer": ColumnType.INTEGER,
"boolean": ColumnType.BOOLEAN,
}
LIST_TYPE_DICT = {
"string": ColumnType.STRING_LIST,
"integer": ColumnType.INTEGER_LIST,
"boolean": ColumnType.BOOLEAN_LIST,
}
def create_json_schema_entity_view(
syn: Synapse,
synapse_entity_id: str,
entity_view_name: str = "JSON Schema view",
) -> str:
"""
Creates a Synapse entity view based on a JSON Schema that is bound to a Synapse entity
This functionality is needed only temporarily. See note at top of module.
Args:
syn: A Synapse object thats been logged in
synapse_entity_id: The ID of the entity in Synapse to bind the JSON Schema to
entity_view_name: The name the crated entity view will have
Returns:
The Synapse id of the crated entity view
"""
warnings.warn(
"This function is a prototype, and could change or be removed at any point."
)
js_service = syn.service("json_schema")
json_schema = js_service.get_json_schema(synapse_entity_id)
org = js_service.JsonSchemaOrganization(
json_schema["jsonSchemaVersionInfo"]["organizationName"]
)
schema_version = JsonSchemaVersion.from_response(
org,
json_schema["jsonSchemaVersionInfo"],
)
columns = _create_columns_from_json_schema(schema_version.body)
view = EntityView(
name=entity_view_name,
parent_id=synapse_entity_id,
scope_ids=[synapse_entity_id],
view_type_mask=ViewTypeMask.FILE,
columns=columns,
).store(synapse_client=syn)
# This reorder is so that these show up in the front of the EntityView in Synapse
view.reorder_column(name="createdBy", index=0)
view.reorder_column(name="name", index=0)
view.reorder_column(name="id", index=0)
view.store(synapse_client=syn)
return view.id
def create_or_update_wiki_with_entity_view(
syn: Synapse,
entity_view_id: str,
owner_id: str,
title: Optional[str] = None,
) -> Wiki:
"""
Creates or updates a Wiki for an entity if the wiki exists or not.
An EntityView query is added to the wiki markdown
This functionality is needed only temporarily. See note at top of module.
Args:
syn: A Synapse object thats been logged in
entity_view_id: The Synapse id of the EntityView for the query
owner_id: The ID of the entity in Synapse that the wiki will be created/updated
title: The (new) title of the wiki to be created/updated
Returns:
The created Wiki object
"""
warnings.warn(
"This function is a prototype, and could change or be removed at any point."
)
entity = syn.get(owner_id)
try:
wiki = syn.getWiki(entity)
except SynapseHTTPError:
wiki = None
if wiki:
return update_wiki_with_entity_view(syn, entity_view_id, owner_id, title)
return create_entity_view_wiki(syn, entity_view_id, owner_id, title)
def create_entity_view_wiki(
syn: Synapse,
entity_view_id: str,
owner_id: str,
title: Optional[str] = None,
) -> Wiki:
"""
Creates a wiki with a query of an entity view
This functionality is needed only temporarily. See note at top of module.
Args:
syn: A Synapse object thats been logged in
entity_view_id: The Synapse id of the entity view to make the wiki for
owner_id: The ID of the entity in Synapse to put as owner of the wiki
title: The title of the wiki to be created
Returns:
The created wiki object
"""
warnings.warn(
"This function is a prototype, and could change or be removed at any point."
)
content = (
"${synapsetable?query=select %2A from "
f"{entity_view_id}"
"&showquery=false&tableonly=false}"
)
if title is None:
title = "Entity View"
wiki = Wiki(title=title, owner=owner_id, markdown=content)
wiki = syn.store(wiki)
return wiki
def update_wiki_with_entity_view(
syn: Synapse, entity_view_id: str, owner_id: str, title: Optional[str] = None
) -> Wiki:
"""
Updates a wiki to include a query of an entity view
This functionality is needed only temporarily. See note at top of module.
Args:
syn: A Synapse object thats been logged in
entity_view_id: The Synapse id of the entity view to make the query for
owner_id: The ID of the entity in Synapse to put as owner of the wiki
title: The title of the wiki to be updated
Returns:
The created wiki object
"""
warnings.warn(
"This function is a prototype, and could change or be removed at any point."
)
entity = syn.get(owner_id)
wiki = syn.getWiki(entity)
new_content = (
"\n"
"${synapsetable?query=select %2A from "
f"{entity_view_id}"
"&showquery=false&tableonly=false}"
)
wiki.markdown = wiki.markdown + new_content
if title:
wiki.title = title
syn.store(wiki)
return wiki
def _create_columns_from_json_schema(json_schema: dict[str, Any]) -> list[Column]:
"""Creates a list of Synapse Columns based on the JSON Schema type
Arguments:
json_schema: The JSON Schema in dict form
Raises:
ValueError: If the JSON Schema has no properties
ValueError: If the JSON Schema properties is not a dict
Returns:
A list of Synapse columns based on the JSON Schema
"""
properties = json_schema.get("properties")
if properties is None:
raise ValueError("The JSON Schema is missing a 'properties' field.")
if not isinstance(properties, dict):
raise ValueError(
"The 'properties' field in the JSON Schema must be a dictionary."
)
columns = []
for name, prop_schema in properties.items():
column_type = _get_column_type_from_js_property(prop_schema)
maximum_size = None
if column_type == "STRING":
maximum_size = 100
if column_type in LIST_TYPE_DICT.values():
maximum_size = 5
column = Column(
name=name,
column_type=column_type,
maximum_size=maximum_size,
default_value=None,
)
columns.append(column)
return columns
def _get_column_type_from_js_property(js_property: dict[str, Any]) -> ColumnType:
"""
Gets the Synapse column type from a JSON Schema property.
The JSON Schema should be valid but that should not be assumed.
If the type can not be determined ColumnType.STRING will be returned.
Args:
js_property: A JSON Schema property in dict form.
Returns:
A Synapse ColumnType based on the JSON Schema type
"""
# Enums are always strings in Synapse tables
if "enum" in js_property:
return ColumnType.STRING
if "type" in js_property:
if js_property["type"] == "array":
return _get_list_column_type_from_js_property(js_property)
return TYPE_DICT.get(js_property["type"], ColumnType.STRING)
# A oneOf list usually indicates that the type could be one or more different things
if "oneOf" in js_property and isinstance(js_property["oneOf"], list):
return _get_column_type_from_js_one_of_list(js_property["oneOf"])
return ColumnType.STRING
def _get_column_type_from_js_one_of_list(js_one_of_list: list[Any]) -> ColumnType:
"""
Gets the Synapse column type from a JSON Schema oneOf list.
Items in the oneOf list should be dicts, but that should not be assumed.
Args:
js_one_of_list: A list of items to check for type
Returns:
A Synapse ColumnType based on the JSON Schema type
"""
# items in a oneOf list should be dicts
items = [item for item in js_one_of_list if isinstance(item, dict)]
# Enums are always strings in Synapse tables
if [item for item in items if "enum" in item]:
return ColumnType.STRING
# For Synapse ColumnType we can ignore null types in JSON Schemas
type_items = [item for item in items if "type" in item if item["type"] != "null"]
if len(type_items) == 1:
type_item = type_items[0]
if type_item["type"] == "array":
return _get_list_column_type_from_js_property(type_item)
return TYPE_DICT.get(type_item["type"], ColumnType.STRING)
return ColumnType.STRING
def _get_list_column_type_from_js_property(js_property: dict[str, Any]) -> ColumnType:
"""
Gets the Synapse column type from a JSON Schema array property
Args:
js_property: A JSON Schema property in dict form.
Returns:
A Synapse ColumnType based on the JSON Schema type
"""
if "items" in js_property and isinstance(js_property["items"], dict):
# Enums are always strings in Synapse tables
if "enum" in js_property["items"]:
return ColumnType.STRING_LIST
if "type" in js_property["items"]:
return LIST_TYPE_DICT.get(
js_property["items"]["type"], ColumnType.STRING_LIST
)
return ColumnType.STRING_LIST
def create_file_view(
folder_id: str,
attach_wiki: bool,
datatype: str,
instructions: str
) -> tuple[str, str]:
"""
Create a file view for a schema-bound folder using schematic.
Args:
folder_id: The Synapse Folder ID to crate the file view for
attach_wiki (bool): Wether or not to attack a Synapse Wiki
datatype (str): Data type name for the CurationTask (required)
instructions (str): Instructions for the curation task (required)
Returns:
A tuple:
The first item is Synapse ID of the entity view created
The second item is the task ID of the curation task created
"""
syn = Synapse()
syn.login()
syn.logger.info("Attempting to create entity view.")
try:
entity_view_id = create_json_schema_entity_view(
syn=syn,
synapse_entity_id=folder_id
)
except Exception as e:
msg = f"Error creating entity view: {str(e)}"
syn.logger.error(msg)
raise e
syn.logger.info("Created entity view.")
if attach_wiki:
syn.logger.info("Attempting to attach wiki.")
try:
create_or_update_wiki_with_entity_view(
syn=syn,
entity_view_id=entity_view_id,
owner_id=folder_id
)
except Exception as e:
msg = f"Error creating wiki: {str(e)}"
syn.logger.error(msg)
raise e
syn.logger.info("Wiki attached.")
# Validate that the folder has an attached JSON schema
# The datatype parameter is now required and used directly for the CurationTask.
js = syn.service("json_schema")
syn.logger.info("Attempting to get the attached schema.")
try:
js.get_json_schema_from_entity(folder_id)
except Exception as e:
msg = "Error getting the attached schema."
syn.logger.exception(msg)
raise e
syn.logger.info("Schema retrieval successful")
# Use the provided datatype (required parameter)
task_datatype = datatype
syn.logger.info("Attempting to get the Synapse ID of the provided folders project.")
try:
entity = Folder(folder_id).get(synapse_client=syn)
parent = syn.get(entity.parent_id)
project = None
while not project:
if parent.concreteType == "org.sagebionetworks.repo.model.Project":
project = parent
break
parent = syn.get(parent.parentId)
except Exception as e:
msg = "Error getting the Synapse ID of the provided folders project}"
syn.logger.exception(msg)
raise e
syn.logger.info("Got the Synapse ID of the provided folders project.")
syn.logger.info("Attempting to create the CurationTask.")
try:
task = CurationTask(
data_type=task_datatype,
project_id=project.id,
instructions=instructions,
task_properties=FileBasedMetadataTaskProperties(
upload_folder_id=folder_id,
file_view_id=entity_view_id,
)
).store(synapse_client=syn)
except Exception as e:
msg = f"Error creating the CurationTask.: {str(e)}"
syn.logger.error(msg)
raise e
syn.logger.info("Created the CurationTask.")
return (entity_view_id, task.task_id)
def main():
"""Main function for command-line usage."""
parser = argparse.ArgumentParser(
description="Create file views for schema-bound folders"
)
parser.add_argument(
'--folder-id',
type=str,
# required=True,
help='Synapse folder ID'
)
parser.add_argument(
'--datatype',
type=str,
help='Data type name for the CurationTask (required)'
)
parser.add_argument(
'--instructions',
type=str,
help='Instructions for the curation task (required)'
)
parser.add_argument(
'--no-wiki',
action='store_false',
help='Do not attach view to folder wiki'
)
args = parser.parse_args()
if args.folder_id is not None:
folder_id = args.folder_id
elif FOLDER_ID:
folder_id = FOLDER_ID
else:
raise ValueError("folder_id must be provided via CLI or global in script")
if args.datatype is not None:
datatype = args.datatype
elif DATATYPE:
datatype = DATATYPE
else:
raise ValueError("datatype must be provided via CLI argument --datatype or set in global variable DATATYPE")
if args.instructions is not None:
instructions = args.instructions
elif INSTRUCTIONS:
instructions = INSTRUCTIONS
else:
raise ValueError(
"instructions must be provided via CLI argument --instructions or set in global variable INSTRUCTIONS"
)
if not args.no_wiki:
attach_wiki = False
elif ATTACH_WIKI is not None:
attach_wiki = ATTACH_WIKI
else:
attach_wiki = True
entity_view_id, curation_task_id = create_file_view(
folder_id=folder_id,
attach_wiki=attach_wiki,
datatype=datatype,
instructions=instructions
)
print(f"Wiki attached: {attach_wiki}")
print(f"View ID: {entity_view_id}")
print(f"Task ID: {curation_task_id}")
if __name__ == "__main__":
main()
"""
Generate and upload CSV templates as a RecordSet for record-based metadata, create a
CurationTask, and also create a Grid to bootstrap the ValidationStatistics.
Usage:
python create_record_based_metadata_task.py --project-id syn12345678 --folder-id syn12345678 --dcc AD \\
--datatype BiospecimenMetadataTemplate --schema_path path/to/schema.json \\
--schema_uri schema_uri --upsert_keys specimenID \\
--instructions "Please curate this metadata according to the schema requirements"
# Multiple upsert keys:
python create_record_based_metadata_task.py --project-id syn12345678 --folder-id syn12345678 --dcc AD \\
--datatype BiospecimenMetadataTemplate --schema_uri schema_uri \\
--upsert_keys specimenID participantID sampleDate
Users can also set arguments using the global variables below,
but CLI arguments are used first.
"""
import argparse
import tempfile
import pandas as pd
from pprint import pprint
from typing import Dict, Any, List, Optional
import json
import synapseclient
from synapseclient import Synapse
from synapseclient.models import RecordSet, CurationTask, RecordBasedMetadataTaskProperties, Grid
from synapseclient.services.json_schema import JsonSchemaService
PROJECT_ID = "" # The Synapse ID of the project where the folder exists
FOLDER_ID = "" # The Synapse ID of the folder to upload to
DCC = "" # Data Coordination Center
DATATYPE = "" # Data type name
SCHEMA_URI = "" # JSON schema URI
SCHEMA_PATH = None # Path to JSON schema file located on your machine, alternative to SCHEMA_URI
UPSERT_KEYS = [] # List of column names to use as upsert keys, e.g., ['specimenID', 'participantID']
# Instructions for the curation task (required)
INSTRUCTIONS = "These are my custom instructions to tell someone what to do"
def extract_property_titles(schema_data: Dict[str, Any]) -> List[str]:
"""
Extract title fields from all properties in a JSON schema.
Args:
schema_data: The parsed JSON schema data
Returns:
List of title values from the properties
"""
titles = []
# Check if 'properties' exists in the schema
if 'properties' not in schema_data:
return titles
properties = schema_data['properties']
for property_name in properties.keys():
titles.append(property_name)
return titles
def create_dataframe_from_titles(titles: List[str]) -> pd.DataFrame:
"""
Create an empty DataFrame with the extracted titles as column names.
Args:
titles: List of title strings to use as column names
Returns:
Empty DataFrame with titles as columns
"""
if not titles:
return pd.DataFrame()
df = pd.DataFrame(columns=titles)
return df
def extract_schema_properties_from_dict(schema_data: Dict[str, Any]) -> pd.DataFrame:
"""
Process a JSON schema dictionary and return a DataFrame with property titles as columns.
Args:
schema_data: The parsed JSON schema data as a dictionary
Returns:
DataFrame with property titles as columns
"""
titles = extract_property_titles(schema_data)
df = create_dataframe_from_titles(titles)
return df
def extract_schema_properties_from_file(json_file_path: str) -> pd.DataFrame:
"""
Process a JSON schema file and return a DataFrame with property titles as columns.
Args:
json_file_path: Path to the JSON schema file
Returns:
DataFrame with property titles as columns
Raises:
FileNotFoundError: If the JSON file doesn't exist
json.JSONDecodeError: If the JSON file is malformed
ValueError: If the file doesn't contain a valid schema structure
"""
try:
with open(json_file_path, 'r', encoding='utf-8') as file:
schema_data = json.load(file)
return extract_schema_properties_from_dict(schema_data)
except FileNotFoundError as e:
raise FileNotFoundError(f"JSON schema file not found: {json_file_path}") from e
except json.JSONDecodeError as e:
raise json.JSONDecodeError(f"Invalid JSON in file '{json_file_path}': {e}", e.doc, e.pos)
def extract_schema_properties_from_web(syn: Synapse, schema_uri: str) -> pd.DataFrame:
"""
Extract schema properties from a web-based JSON schema URI using Synapse.
This function retrieves a JSON schema from a web URI through the Synapse platform
and extracts property titles to create a DataFrame with those titles as columns.
Args:
syn: Authenticated Synapse client instance
schema_uri: URI pointing to the JSON schema resource
Returns:
DataFrame with property titles from the schema as column names
"""
try:
org_name, schema_name, version = schema_uri.split("-")
except ValueError as e:
raise ValueError(
f"Invalid schema URI format: {schema_uri}. Expected format 'org-name-schema.name.schema-version'.") from e
js = JsonSchemaService(synapse=syn)
schemas_list = js.list_json_schemas(organization_name=org_name)
if not any(schema_name == s["schemaName"] for s in schemas_list):
raise ValueError(f"Schema URI '{schema_uri}' not found in Synapse JSON schemas.")
schema = js.get_json_schema_body(json_schema_uri=schema_uri)
return extract_schema_properties_from_dict(schema)
def extract_schema(syn: Synapse, schema_path: Optional[str] = None, schema_uri: Optional[str] = None) -> pd.DataFrame:
"""
Extract schema properties from either a local file or web URI.
This function provides a unified interface for extracting JSON schema properties
from different sources. It accepts either a local file path or a web URI and
delegates to the appropriate extraction function.
Args:
syn: Authenticated Synapse client instance (required for web URI extraction)
schema_path: Optional path to a local JSON schema file
schema_uri: Optional URI pointing to a web-based JSON schema resource
Returns:
DataFrame with property titles from the schema as column names
Raises:
ValueError: If neither schema_path nor schema_uri is provided, or if both are provided
FileNotFoundError: If schema_path is provided but the file doesn't exist
json.JSONDecodeError: If the local schema file contains invalid JSON
SynapseError: If there are issues retrieving the web-based schema
Note:
At least one of schema_path or schema_uri must be provided, if both are given the uri will be used.
"""
if schema_uri:
return extract_schema_properties_from_web(syn, schema_uri)
elif schema_path:
return extract_schema_properties_from_file(schema_path)
else:
raise ValueError("Either schema_path or schema_uri must be provided.")
def main():
"""Main function for command-line usage."""
parser = argparse.ArgumentParser(
description="Generate and upload CSV templates for record-based metadata"
)
parser.add_argument('--project_id', type=str, required=False,
help='Synapse project ID where the folder exists')
parser.add_argument('--folder_id', type=str, required=False,
help='Synapse folder ID for upload')
parser.add_argument('--dcc', type=str, required=False,
help='Data Coordination Center')
parser.add_argument('--datatype', type=str, required=False,
help='Data type name')
parser.add_argument('--schema_uri', type=str, required=False, default=None,
help='JSON schema URI')
parser.add_argument('--schema_path', type=str, required=False, default=None,
help='path to JSON schema')
parser.add_argument('--upsert_keys', type=str, nargs='+', required=False,
help='Column names to use as upsert keys (one or more)')
parser.add_argument('--instructions', type=str, required=False,
help='Instructions for the curation task (required)')
args = parser.parse_args()
# Use CLI arguments first, then fall back to constants
project_id = args.project_id if args.project_id is not None else PROJECT_ID
folder_id = args.folder_id if args.folder_id is not None else FOLDER_ID
dcc = args.dcc if args.dcc is not None else DCC
datatype = args.datatype if args.datatype is not None else DATATYPE
schema_uri = args.schema_uri if args.schema_uri is not None else SCHEMA_URI
schema_path = args.schema_path if args.schema_path is not None else SCHEMA_PATH
upsert_keys = args.upsert_keys if args.upsert_keys is not None else UPSERT_KEYS
instructions = args.instructions if args.instructions is not None else INSTRUCTIONS
# Validate required parameters
if project_id is None:
raise ValueError("project_id must be provided via CLI or global variable PROJECT_ID")
if folder_id is None:
raise ValueError("folder_id must be provided via CLI or global variable FOLDER_ID")
if dcc is None:
raise ValueError("dcc must be provided via CLI or global variable DCC")
if datatype is None:
raise ValueError("datatype must be provided via CLI or global variable DATATYPE")
if upsert_keys is None:
raise ValueError("upsert_keys must be provided via CLI or global variable UPSERT_KEYS")
if instructions is None:
raise ValueError("instructions must be provided via CLI or global variable INSTRUCTIONS")
syn = synapseclient.Synapse()
syn.login()
template_df = extract_schema(syn=syn, schema_path=schema_path, schema_uri=schema_uri)
syn.logger.info(f"Extracted schema properties and created template: {template_df.columns.tolist()}")
tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".csv")
try:
with open(tmp.name, 'w') as f:
template_df.to_csv(f, index=False)
except Exception as e:
syn.logger.error(f"Error writing template to temporary CSV file: {e}")
raise e
try:
with open(tmp.name, 'r') as f:
recordset_with_data = RecordSet(
name=f"{dcc}_{datatype}_RecordSet",
parent_id=folder_id,
description=f"RecordSet for {dcc} {datatype}",
path=f.name,
upsert_keys=upsert_keys
).store(synapse_client=syn)
recordset_id = recordset_with_data.id
syn.logger.info(f"Created RecordSet with ID: {recordset_id}")
pprint(recordset_with_data)
except Exception as e:
syn.logger.error(f"Error creating RecordSet in Synapse: {e}")
raise e
try:
curation_task = CurationTask(
data_type=datatype,
project_id=project_id,
instructions=instructions,
task_properties=RecordBasedMetadataTaskProperties(
record_set_id=recordset_id,
)
).store(synapse_client=syn)
syn.logger.info(
f"Created CurationTask ({curation_task.task_id}) in folder {folder_id} for data type {datatype}")
pprint(curation_task)
except Exception as e:
syn.logger.error(f"Error creating CurationTask in Synapse: {e}")
raise e
try:
curation_grid: Grid = Grid(
record_set_id=recordset_id,
)
curation_grid.create(synapse_client=syn)
curation_grid = curation_grid.export_to_record_set(synapse_client=syn)
syn.logger.info(f"Created Grid view for RecordSet ID: {recordset_id} for data type {datatype}")
pprint(curation_grid)
except Exception as e:
syn.logger.error(f"Error creating Grid view in Synapse: {e}")
raise e
if __name__ == "__main__":
main()
from pprint import pprint
from synapseclient import Synapse
from synapseclient.models.curation import CurationTask
PROJECT_ID = "" # The Synapse ID of the project to list tasks from
syn = Synapse()
syn.login()
for curation_task in CurationTask.list(
project_id=PROJECT_ID
):
pprint(curation_task)
"""
Query the Synapse schema registry table to retrieve Schema URIs based on DCC and datatype.
This script queries the schema registry table at syn69735275 to find matching schemas
based on the provided DCC (Data Coordination Center) and datatype parameters.
Results are sorted by version and the URI is returned.
Usage:
python query_schema_registry.py --dcc ad --datatype IndividualAnimalMetadataTemplate
# Or use the global variables in the script
python query_schema_registry.py
Users can also set arguments using the global variables below,
but CLI arguments take precedence.
"""
import argparse
from typing import List, Optional
from synapseclient import Synapse
from synapseclient.models import Table
# Global variables - set these if you don't want to use command line arguments
DCC = "" # Data Coordination Center (e.g., 'ad', 'amp', 'mc2')
DATATYPE = "" # Data type name from schema
# The Synapse ID of the schema registry table
SCHEMA_REGISTRY_TABLE_ID = "syn69735275"
def query_schema_registry(
dcc: str,
datatype: str,
synapse_client: Optional[Synapse] = None
) -> List[dict]:
"""
Query the schema registry table to find schemas matching DCC and datatype.
Arguments:
dcc: Data Coordination Center identifier (e.g., 'ad', 'amp', 'mc2')
datatype: Data type name from the schema
synapse_client: Authenticated Synapse client instance
Returns:
List of dictionaries containing schema information, sorted by version
"""
if synapse_client is None:
syn = Synapse()
syn.login()
else:
syn = synapse_client
# Construct SQL query to search for schemas matching DCC and datatype
# The query looks for exact matches in DCC and contains match for datatype
# Results are sorted by version in descending order (newest first)
query = f"""
SELECT * FROM {SCHEMA_REGISTRY_TABLE_ID}
WHERE dcc = '{dcc}'
AND datatype LIKE '%{datatype}%'
ORDER BY version DESC
"""
print(f"Querying schema registry with DCC='{dcc}' and datatype='{datatype}'...")
print(f"SQL Query: {query}")
# Query the table and get results as a pandas DataFrame
table = Table(id=SCHEMA_REGISTRY_TABLE_ID)
results_df = table.query(query=query)
if results_df.empty:
print(f"No schemas found for DCC='{dcc}' and datatype='{datatype}'")
return []
# Convert DataFrame to list of dictionaries for easier handling
results = results_df.to_dict('records')
print(f"Found {len(results)} matching schema(s):")
for i, result in enumerate(results, 1):
print(f" {i}. URI: {result.get('uri', 'N/A')}")
print(f" Version: {result.get('version', 'N/A')}")
print(f" DCC: {result.get('dcc', 'N/A')}")
print(f" DataType: {result.get('datatype', 'N/A')}")
if i < len(results):
print()
return results
def get_latest_schema_uri(dcc: str, datatype: str, synapse_client: Optional[Synapse] = None) -> Optional[str]:
"""
Get the URI of the latest schema version for the given DCC and datatype.
Arguments:
dcc: Data Coordination Center identifier
datatype: Data type name from the schema
synapse_client: Authenticated Synapse client instance
Returns:
URI string of the latest schema version, or None if not found
"""
results = query_schema_registry(dcc, datatype, synapse_client)
if results:
latest_schema = results[0] # Results are sorted by version DESC, so first is latest
uri = latest_schema.get('uri')
print(f"\nLatest schema URI: {uri}")
return uri
else:
print(f"\nNo schema found for DCC='{dcc}' and datatype='{datatype}'")
return None
def main():
"""Main function for command-line usage."""
parser = argparse.ArgumentParser(
description="Query the Synapse schema registry to find Schema URIs by DCC and datatype"
)
parser.add_argument(
'--dcc',
type=str,
help='Data Coordination Center identifier (e.g., ad, amp, mc2)'
)
parser.add_argument(
'--datatype',
type=str,
help='Data type name from the schema (e.g., IndividualAnimalMetadataTemplate)'
)
args = parser.parse_args()
# Use command line arguments if provided, otherwise use global variables
if args.dcc is not None:
dcc = args.dcc
elif DCC:
dcc = DCC
else:
raise ValueError("DCC must be provided via CLI argument --dcc or set in global variable DCC")
if args.datatype is not None:
datatype = args.datatype
elif DATATYPE:
datatype = DATATYPE
else:
raise ValueError("datatype must be provided via CLI argument --datatype or set in global variable DATATYPE")
# Initialize Synapse client
syn = Synapse()
syn.login()
# Get just the latest schema URI
latest_uri = get_latest_schema_uri(dcc, datatype, syn)
if latest_uri:
print(f"\nUse this URI in your scripts: {latest_uri}")
if __name__ == "__main__":
main()
from synapseclient import Synapse
from synapseclient.models import Folder
# Data from: https://synapse.org/Synapse:syn69735275/tables/
# The URI of the JSON Schema you want to bind, for example: `sage.schemas.v2571-ad.IndividualAnimalMetadataTemplate.schema-0.1.0`
URI = ""
# The Synapse ID of the entity you want to bind the JSON Schema to. This should be the ID of a Folder where you want to enforce the schema.
FOLDER_ID = ""
syn = Synapse()
syn.login()
folder = Folder(id=FOLDER_ID).get()
schema_validation = folder.validate_schema()
print(f"Schema validation result for folder {FOLDER_ID}: {schema_validation}")
@andrewelamb
Copy link
Copy Markdown

andrewelamb commented Oct 1, 2025

I would add something about the registered schemas:

  • The ones in the Synapse tables were created by DPE using Schematic
  • One not created by Schematic (if they are using a LinkML workflow) may not work with the scripts we created.

@andrewelamb
Copy link
Copy Markdown

I put something about trying to use the same datatype twice in the same project. That could be a note here, or some error handling in the script/python client that gives the user a better error message.

@rxu17
Copy link
Copy Markdown

rxu17 commented Oct 2, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment