This setup guide has been deprecated in favor of https://python-docs.synapse.org/en/stable/guides/extensions/curator/metadata_curation/ - It is reccommended that you use the curator extension in favor of following this guide.
This guide helps you set up and run programmatic curation tools for managing metadata in Synapse. These tools support two different metadata workflows depending on your data organization needs.
When to use: Metadata describes individual data files and is stored as annotations directly on each file.
Example: Each sequencing data file has annotations like sample type, sequencing method, and sample identifier attached to the file itself.
When to use: Metadata is normalized in structured records to eliminate duplication and ensure consistency.
Example: Sample information is stored once in a CSV record, and multiple data files reference that sample by its identifier instead of duplicating the sample metadata. A project might have multiple record-based metadata types that would each receive separate treatment.
- Python 3.9 or later
- Synapse account with appropriate permissions
- Synapse authentication configured (login credentials)
pip install --upgrade synapseclient[pandas]For JSON Schemas generated and stored to Synapse outside of the DPE team you will need to find the correct schema URI to use based on the values it was used to be registered.
Many registered JSON Schemas are available at: https://synapse.org/Synapse:syn69735275/tables/
- Browse the schemas at the table URL and find the one that matches your data type
- Note the schema URI (e.g.,
sage.schemas.v2571-ad.IndividualAnimalMetadataTemplate.schema-0.1.0)
Use the query_schema_registry.py script to programmatically find schemas by DCC and datatype:
python query_schema_registry.py --dcc ad --datatype IndividualAnimalMetadataTemplateThis will output something like:
Found 2 matching schema(s):
1. URI: sage.schemas.v2571-ad.IndividualAnimalMetadataTemplate.schema-0.1.0
Version: 0.1.0
DCC: ad
DataType: IndividualAnimalMetadataTemplate
2. URI: sage.schemas.v2571-ad.IndividualAnimalMetadataTemplate.schema-0.0.0
Version: 0.0.0
DCC: ad
DataType: IndividualAnimalMetadataTemplate
Latest schema URI: sage.schemas.v2571-ad.IndividualAnimalMetadataTemplate.schema-0.1.0
Use this URI in your scripts: sage.schemas.v2571-ad.IndividualAnimalMetadataTemplate.schema-0.1.0
Alternative script options:
- Edit the script constants and run without arguments:
python query_schema_registry.py
Update these variables in query_schema_registry.py:
DCC = "" # Data Coordination Center (e.g., 'ad', 'amp', 'mc2')
DATATYPE = "" # Data type name from schema (must be unique; may include prefixes/suffixes like `IndividualAnimalMetadataTemplate.studyXYZ`, `AssayRNAseqTemplate.studyXYZ`)Before creating curation tasks, you must bind a JSON schema to your Synapse folder or recordset. The script supports binding to either entity type, but only one can be specified at a time.
python bind_json_schema.py --folder-id syn12345678 --uri sage.schemas.full.schema.uripython bind_json_schema.py --recordset-id syn87654321 --uri sage.schemas.full.schema.uriOpen bind_json_schema.py and update these variables:
URI = "" # Your Schema URI
FOLDER_ID = "" # Your Synapse folder ID (leave empty if using recordset)
RECORDSET_ID = "" # Your Synapse recordset ID (leave empty if using folder)Then run: python bind_json_schema.py
Note: The script enforces mutual exclusion - you can only bind to either a folder OR a recordset, not both simultaneously.
Use this when you want metadata stored as annotations directly on data files.
python create_file_based_metadata_task.py --folder-id syn12345678 --datatype DATATYPE_NAME --instructions "Your curation instructions"Required parameters:
--folder-id: Your Synapse folder ID--datatype: Data type name for the CurationTask (must be unique)--instructions: Instructions for the curation task
Optional parameters:
--no-wiki: Skip creating a wiki page for the file view
Examples:
# Basic usage with required datatype
python create_file_based_metadata_task.py --folder-id syn12345678 --datatype IndividualAnimalMetadataTemplate.studyXYZ --instructions "Please curate the metadata for this dataset according to the schema requirements"
# Skip wiki creation
python create_file_based_metadata_task.py --folder-id syn12345678 --datatype IndividualAnimalMetadataTemplate.studyXYZ --instructions "Custom curation instructions" --no-wiki- EntityView: Shows all files in your folder with their metadata
- CurationTask: Manages the curation process
- Wiki page: Documents the file view (unless
--no-wikiis used)
Update these variables in create_file_based_metadata_task.py:
FOLDER_ID = "syn12345678" # Your folder ID
DATATYPE = "YourDatatype.studyXYZ" # Data type name (required)
INSTRUCTIONS = "Your custom curation instructions" # Instructions (required)
ATTACH_WIKI = True # Set to False to skip wiki creationUse this when you want structured metadata records that files can reference.
python create_record_based_metadata_task.py \
--project_id syn1234 \
--folder_id syn12345678 \
--dcc DCC_VALUE \
--datatype DATATYPE_PYTHON \
--schema_uri SCHEMA_URI \
--upsert_keys ONE_OR_MORE_KEYS \
--instructions "Your curation instructions"Required parameters:
--project_id: Your Synapse project ID--folder_id: Your Synapse folder ID--dcc: Data Coordination Center (e.g.,ad,amp,mc2)--datatype: Data type name from your schema (must be unique; can include prefixes/suffixes likeIndividualAnimalMetadataTemplate.studyXYZ, `AssayRNAseqTemplate.studyXYZ)--upsert_keys: One or more column names used as unique identifiers (e.g.,specimenID, orspecimenID participantID sampleDate)--instructions: Instructions for the curation task
Choose one of these parameters:
--schema_uri: Schema URI (e.g.,sage.schemas.v2571-ad.IndividualAnimalMetadataTemplate.schema-0.1.0,sage.schemas.v2571-el.AssayRNAseqTemplate.schema-0.0.1)--schema_path: Absolute local path to schema file (alternative to URI)
- RecordSet: CSV-based structured metadata storage
- CurationTask: Manages the curation process
- Grid: Interactive view for editing metadata records, only used to export the results back to the RecordSet to "bootstrap" it
Note: After creating the RecordSet, you can optionally bind a schema directly to it using:
python bind_json_schema.py --recordset-id <RECORDSET_ID> --uri <SCHEMA_URI>Update these variables in create_record_based_metadata_task.py:
PROJECT_ID = ""
FOLDER_ID = ""
DCC = ""
DATATYPE = "" # Must be unique; can include prefixes/suffixes for uniqueness
SCHEMA_URI = ""
UPSERT_KEYS = []
INSTRUCTIONS = "" # Instructions (required)validate_json_schema.py
This script allows you to validate files that exist within a particular folder giving back a summary of valid and invalid entities according to your schema.
Update the URI and FOLDER_ID.
list_curation_task.py
Update the PROJECT_ID in the script to match your project.
After running these scripts successfully:
-
File-based workflow: Upload data files to your folder. The EntityView will automatically show them with their metadata annotations.
-
Record-based workflow: Use the Grid interface in Synapse to edit your metadata records, then upload data files that reference these records.
-
Monitor curation progress through the CurationTask interface in Synapse. (UI Elements are still being developed)
For additional features and web UI integration, refer to the full Synapse documentation (To be created).
I would add something about the registered schemas: