Skip to content

Instantly share code, notes, and snippets.

View seandavi's full-sized avatar

Sean Davis seandavi

View GitHub Profile
@seandavi
seandavi / main.tf
Last active June 4, 2023 01:19
Terraform for setting up OpenAI APIs on Microsoft Azure
# <seandavi@gmail.com>, 2023-06-02
#
# Terraform for setting up GPT-4, GPT-3.5-turbo, and text-embedding-ada-002
# endpoints. Note that not all models are available in all regions, so
# check before changing the region here, currently set to "southcentralus"
#
# Assumes az-cli authenticated (requires Azure subscription) and terraform
# available and installed
#
terraform {
---
title: "data-engineering-R"
---
## Background^[From https://www.stitchdata.com/columnardatabase/]
Suppose you're a retailer maintaining a web-based storefront. An ecommerce site generates a lot of data. Consider product purchase transactions:
![Purchase table](https://www.stitchdata.com/static/purchase-table-69d1c4b69867e15fda5daf0005e9b81d.png)
{"title":"Mesothelioma_52413","status":"Public on Dec 23 2022","submission_date":"2022-08-05","last_update_date":"2022-12-23","type":"genomic","anchor":null,"contact":{"city":"Nagoya","name":{"first":"Shinya","middle":"","last":"Toyokuni"},"email":"akatsuka@med.nagoya-u.ac.jp","state":"Aichi","address":"65 Tsuruma-Cho, Showa-Ku","department":"Pathology","country":"Japan","web_link":null,"institute":"Nagoya University","zip_postal_code":null,"phone":null},"description":null,"accession":"GSM6433302","biosample":null,"tag_count":null,"tag_length":null,"platform_id":"GPL10451","hyb_protocol":"The labeled DNA was hybridized with Agilent SurePrint G3 Mouse CGH 4x180k microarray at 67°C for 24 hours according to the manufacturer's protocol (Version 8.0).","channel_count":2,"scan_protocol":"The slides were scanned in an Agilent DNA microarray scanner with SureScan High-Resolution Technology (G2565CA).","data_row_count":174012,"library_source":null,"overall_design":null,"sra_experiment":null,"data_processing":"The sc
## -----------------------------------------------------------------------------
## GEOquery
## -----------------------------------------------------------------------------
library(GEOquery)
gse = getGEO("GSE103512")[[1]]
## -----------------------------------------------------------------------------
library(SummarizedExperiment)
se = as(gse, "SummarizedExperiment")
@seandavi
seandavi / bio361-geoquery-walkthrough.R
Last active April 4, 2022 16:07
GEOquery simple example for four cancer/normal samplesets
## ----message=FALSE,warning=FALSE----------------------------------------------
pkgs = c(
"ggplot2",
"GEOquery",
"SummarizedExperiment"
)
ins = installed.packages(repos = BiocManager::repositories())
for(pkg in pkgs) {
if(!(pkg %in% rownames(ins)))
BiocManager::install(pkg)
@seandavi
seandavi / semanticscholar_to_bigquery.sh
Created January 29, 2022 19:08
Load semantic scholar json to bigquery
#!/bin/bash
# requires about 200G of disk space
# downloads stuff
# create disposable bucket
# upload
# bq load
# remove bucket
mkdir -p ss
cd ss
wget https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/2022-01-01/manifest.txt
@seandavi
seandavi / start_rstudio.sh
Created January 4, 2022 03:30
Start and stop an Rstudio-based container on GCP VM
#!/bin/bash
CONTAINER=ghcr.io/seandavi/buildabiocworkshop
ZONE=us-central1-a
PASSWORD=rstudio
INSTANCE=rs-2
gcloud compute instances create-with-container $INSTANCE \
--container-image $CONTAINER \
--container-env PASSWORD=$PASSWORD \
--tags rstudio
@seandavi
seandavi / start_instance.sh
Created December 17, 2021 21:35
A self-deleting gcp instance
#!/bin/bash
gcloud compute instances create myinstance \
--metadata-from-file=startup-script=startup.sh \
--scopes=compute-rw
@seandavi
seandavi / download_biosample.R
Created July 22, 2021 16:52
download all of EBI biosample as json
start_date='2000-01-01'
end_date = '2021-12-31'
datefilter = function(date) {
startdate = format(date,'%Y-%m-%d')
return(sprintf("dt:release:from=%suntil=%s",startdate,startdate))
}
download_biosample = function(date) {
require(httr)

output: rmarkdown::html_document: highlight: pygments toc: true toc_depth: 3 fig_width: 5 bibliography: "r system.file(package='dummychapter1', 'vignettes', 'bibliography.bib')" vignette: > %\VignetteIndexEntry{dummychapter1}