Skip to content

Instantly share code, notes, and snippets.

View seandavi's full-sized avatar

Sean Davis seandavi

View GitHub Profile
@seandavi
seandavi / setup_wandb.md
Created March 22, 2024 02:14
Setup weights-and-biases using docker compose

Sure! Here's the converted Docker Compose YAML file with a MySQL server as a separate container and a Docker volume for storage:

version: '3'
services:
  wandb-local:
    image: wandb/local
    container_name: wandb-local
    environment:
 - HOST=https://YOUR_DNS_NAME
@seandavi
seandavi / prompt_example.txt
Created March 7, 2024 23:20
Example of prompt to automate job candidate applications with set of minimal and preferred qualifications to YAML
You are an HR specialist and are evaluating the qualifications of job applicants
for a high-performance computing (HPC) specialist position.
You have been given a set of criteria to evaluate each candidate.
The candidate materials are in the attached PDF.
For each job applicant, fill in the following YAML-format criteria document. You
may use the "comment" field to provide additional context or justification for
your evaluation.
---
# candidate name
@seandavi
seandavi / cmgd_se_to_csv.R
Created February 25, 2024 23:31
convert all CMGD SummarizedExperiments to CSV files
# convert all CMGD SummarizedExperiments to CSV files
# Should run more-or-less directly as a script
# Requires more than 128GB RAM to complete
# Generates about 200GB of files
# BiocManager::install('curatedMetagenomicData')
# BiocManager::install(c('arrow','data.table','dplyr', 'readr'))
library(curatedMetagenomicData)convert all CMGD SummarizedExperiments to CSV files
@seandavi
seandavi / gist:dd7052951a199e5ea5ce584b01c5e0f2
Created January 31, 2024 20:57
Common Fund Data Ecosystem funding from reporter
#!/bin/bash
# results in json format
# Actual data in "results" array
#
# Opportunity numbers taken from https://commonfund.nih.gov/dataecosystem/FundedResearch
curl \
-X POST \
https://api.reporter.nih.gov/v2/projects/search \
-d '{"criteria":{"opportunity_numbers": ["RFA-RM-23-003", "PA20-185", "OTA-23-004", "RFA-RM-22-007", "OTA-23-005", "RFA-RM-17-026", "RFA-RM-21-007", "RFA-RM-19-012"]}}' \
-H 'Content-Type: application/json'
@seandavi
seandavi / sentence_embeddings_for_metadata_curation.ipynb
Created January 25, 2024 18:11
A quick demonstration of using sentence embeddings for semantic similarity search of metadata terms against "ontology" terms
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@seandavi
seandavi / TCGAtranslateID.R
Last active January 8, 2024 21:12
Translate GDC file_ids to TCGA barcodes
library(GenomicDataCommons)
library(magrittr)
TCGAtranslateID = function(file_ids) {
info = files() %>%
GenomicDataCommons::filter( ~ file_id %in% file_ids) %>%
GenomicDataCommons::select('cases.samples.submitter_id') %>%
results_all()
# The mess of code below is to extract TCGA barcodes
# id_list will contain a list (one item for each file_id)
@seandavi
seandavi / bioconductor_bibliometrix_summary.txt
Created January 5, 2024 20:26
Basic bibliometrix analysis based on dois available from CITATION files in Bioconductor, searched through openalex.
MAIN INFORMATION ABOUT DATA
Timespan 2004 : 2023
Sources (Journals, Books, etc) 100
Documents 586
Annual Growth Rate % 11.38
Document Average Age 6.51
Average citations per doc 626.4
Average citations per year per doc 56.14
References 10872
@seandavi
seandavi / platformMap.txt
Created November 13, 2014 20:49
Bioconductor/GEO platform mapping
"title" "gpl" "bioc_package" "manufacturer" "organism" "data_row_count"
"Illumina Sentrix Array Matrix (SAM) - GoldenGate Methylation Cancer Panel I" "GPL15380" "GGHumanMethCancerPanelv1" "Illumina" "Homo sapiens" 1536
"Illumina HumanMethylation27 BeadChip (HumanMethylation27_270596_v.1.2)" "GPL8490" "IlluminaHumanMethylation27k" "Illumina, Inc." "Homo sapiens" 27578
"Illumina HumanMethylation450 BeadChip (HumanMethylation450_15017482)" "GPL13534" "IlluminaHumanMethylation450k" "Illumina, Inc." "Homo sapiens" 485577
"GE Healthcare/Amersham Biosciences CodeLink™ ADME Rat 16-Assay Bioarray" "GPL2898" "adme16cod" "GE Healthcare" "Rattus norvegicus" 1280
"[AG] Affymetrix Arabidopsis Genome Array" "GPL71" "ag" "Affymetrix" "Arabidopsis thaliana" 8297
"[ATH1-121501] Affymetrix Arabidopsis ATH1 Genome Array" "GPL198" "ath1121501" "Affymetrix" "Arabidopsis thaliana" 22810
"[Bovine] Affymetrix Bovine Genome Array" "GPL2112" "bovine" "Affymetrix" "Bos taurus" 24128
"[Canine] Affymetrix Canine Genome 1.0 Array" "GPL39
@seandavi
seandavi / README.md
Last active December 28, 2023 02:46
snpEff on the NIH Biowulf cluster

Usage

To use these scripts:

  • Clone this repository: git clone https://gist.github.com/95a4b2ab3b90f6f0bfd9.git snpEffScript
  • cd snpEffScript
  • make appropriate changes to setup.sh
  • call snpEff.sh like so:
@seandavi
seandavi / file_metadata.json
Last active November 29, 2023 17:29
Proposal for available files metadata json for easier and more robust client parsing [note that data are fake]
{
"accession": "GSE000123",
"files": [
{
"filetype": "Series SOFT file",
"name": "GSE227465_family.soft.gz",
"size": 23413,
"md5sum": "....",
"created_at": "DATE",
"updated_at": "DATE"