Skip to content

Instantly share code, notes, and snippets.

@xbrianh
Last active May 12, 2020 17:00
Show Gist options
  • Save xbrianh/cc5eb346d91b067633e1275faaab95f2 to your computer and use it in GitHub Desktop.
Save xbrianh/cc5eb346d91b067633e1275faaab95f2 to your computer and use it in GitHub Desktop.
Jean meeting questions and answers

Meeting with Fellow: Jean Monlong

  1. Jean is interested in tooling to help clean up his Terra workspace, including deleting intermediate files from completed workflows, or deleting files that did not successfully complete. Brian, I believe some of your tooling may be useful here or you could work with Jean to develop your tooling further?

    • The terra-notebook-utils package contains tooling to remove workflow intermediate files from your workspace bucket. Code is available now on the master branch, and will be released soon through the pip package.
    • From Python:
      from terra_notebook_utils import workspace
      workspace.remove_workflow_logs()
      
      pass in submission id to delete intermediate logs from a specific submission:
      from terra_notebook_utils import workspace
      workspace.remove_workflow_logs(submission_id=my_submission_id)
      
    • Via CLI:
      tnu workspace delete-workflow-logs
      
    • Development version of terra-notebook-utils can be installed with
      pip install https://github.com/DataBiosphere/terra-notebook-utils
      
    • Terra document on controlling costs: https://support.terra.bio/hc/en-us/articles/360029772212-Controlling-Cloud-costs-sample-use-cases#storingdata

    Action Items (Brian):

  2. Better routes than a single notebook that include functions that take a long time to run — specifically, preventing issues with your notebook disconnecting and not updating the user. Maybe, saving outputs to your bucket along the way in your notebook, converting these tasks into workflows, or running multiple notebooks?

    • What sort of operations are being performed in the long running notebooks?
    • If the operation can be interupted, we can explore functionality to save state
    • Use workflows instead?

    Action Items (Brian):

    • How does autopause in Terra work?
    • Look into saving state in Jean's notebooks
  3. Better ways to set up batch jobs that may encounter Cromwell limits — Jean is currently doing this in a manual way. Maybe this is a Terra dev issue.

    • Which limits/quotas?
    • Avoid hog limits (2400 concurrent jobs per user) This is likely to be increased in the future, according to the Broad.

    Action Items (Brian):

    • Write job batching and monitoring utility?
    • Ping Terra about removing the Hog limit.
  4. If there are best practices for workflow set up to make things run cheaper/faster. Maybe including some topics around localization?

    • Localization can take a long time (expensive). Streaming is a good option if your software permits.
    • Use Cromwell call caching if possible. This allows Cromwell to immediately return results for jobs that have previously run with exactly the same input.
    • Consider tradeoffs of larger machines and fewer workflows.

    Action Items (Brian):

    • Look at workflows to understand potential for streaming
    • Contact Terra about correct ratio of preemptible and on-demand instance?
    • Contact Terra about rule of thumb for preemptible workflow duration

VCF Discussion:

VCFs are huge, contain all mutations and variants

  • query for most frequent variants
    • select only those with value higher than x
    • Download portion of VCF matching query
  • Useful for other people
  • Want variants around genes
  • How much of this is already available in Samtools/tabix?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment