Skip to content

Instantly share code, notes, and snippets.

@MichelNivard
MichelNivard / mediation.md
Last active August 14, 2023 17:25
Mediation model in GenomicSEM

Mediation model in GenomicSEM

As part of a genome wide association study (GWAS) it has become common practice to find traits genetically correlated to your trait of interest. The goal for these type of analyses is generally to better understand the ethiology of a trait. This is also done in this GWAS of social outcomes by Hill et al.(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5130721/)

Their GWAS of social deprivation, and income, and uses LD score regression to correlate these traits to several other traits. High genetic correlations exist betweem income and edicationa attainment and between income and ADHD.

A very obvious follow up question would be whether ADHD effects income directly, or whether the relation between income and ADHD is mediated by education. Perhaps the effects of ADHD on income are entirely attributable to a reduction in educational attainment caused by ADHD. We are going to fit a model to try and awnser this question.

**You can use GenomicSEMto awnser this question, you can fi

# I used this online tool to extract the data from the scatter plot:
# https://apps.automeris.io/wpd/
# need Hmisc:
install.packages("Hmisc")
# read the data from my mac into R:
ssgac <- read.csv("ssgac.csv", header=FALSE)

Minor update: Genetic correlations and Genomic Control(GC) in GenomicSEM

This document describes a minor update to genomic SEM that provides the user with the option to control how the LD score intercept is used to apply genomic control to GenomicSEM GWAS and code to get quick initial genetic correlations and the standard errors of the genetic correlation from the ldsc() function.

Better documentation and options for Genomic Control.

Behind the scenes, and poorly documented (there were some comments in the code, that’s it), GenomicSEM was applying Genomic Control. The LD score regression intercept produces an expectation for the mean chi-square statistic under the null. As a chi2 distribution with 1 df has a mean of 1.0, an LDSC intercept greater than 1.0 can be used as an index of inflation of the test statistic attributable to uncontrolled confounding (Bulik Sullivan et al. 2015). Specifically, we estimate the univariate LD score intercept and inflate the SE of the estimated SNP-trait covarianc

We can't make this file beautiful and searchable because it's too large.
ArticleId,Text,Category
1833,worldcom ex-boss launches defence lawyers defending former worldcom chief bernie ebbers against a battery of fraud charges have called a company whistleblower as their first witness. cynthia cooper worldcom s ex-head of internal accounting alerted directors to irregular accounting practices at the us telecoms giant in 2002. her warnings led to the collapse of the firm following the discovery of an $11bn (£5.7bn) accounting fraud. mr ebbers has pleaded not guilty to charges of fraud and conspiracy. prosecution lawyers have argued that mr ebbers orchestrated a series of accounting tricks at worldcom ordering employees to hide expenses and inflate revenues to meet wall street earnings estimates. but ms cooper who now runs her own consulting business told a jury in new york on wednesday that external auditors arthur andersen had approved worldcom s accounting in early 2001 and 2002. she said andersen had given a green light to the procedures and practices used by worldcom. mr
Sys.setenv( # get an API key here: https://platform.openai.com/account/api-keys
OPENAI_API_KEY = 'YOUR_API_KEY_HERE'
)
### Make a text "database" to search:
library(tm)
library(dplyr)
library(corpus)
library(rjson)
import tkinter
import customtkinter
from bs4 import BeautifulSoup
# Langchain loads:
from langchain.document_loaders import DirectoryLoader,PagedPDFSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS, Qdrant
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from transformers import pipeline
from transformers import GPTJForCausalLM
from transformers import GPTJForCausalLM, AutoTokenizer
import torch
model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
prompt = (
"Monday Diary Entry: Had a busy weekend and started the day tired and empty, the good weather helps lift the mood. Worked fro home and spend (too much) time on learning about language models. Had 2 or 3 productive calls, tired and prob still a bit sick today, which put me in a somewhat somber mood. Had a long bath which maybe helped?"
cat author_manuscript_txt.incr.2022-12-19/*/*.txt > merged-file.txt
from datasets import load_dataset
dataset = load_dataset('text', data_files="merged-file.txt")
print(dataset)
dataset2 = dataset.filter(lambda x: len(x["text"]) > 500)
print(dataset2)
@MichelNivard
MichelNivard / Example _transcript.md
Last active March 6, 2023 11:52
Example long training data

Speaker 0:

You wrote a piece a follow-up piece to your oral history titled, there is no replacement for black Twitter. I think back in November, What do you think we lose if we lose black Twitter? Tell

Speaker 1:

me not to meet your Mac, but we lose everything. I'm John Favreau. Welcome to offline.

Speaker 0:

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments
# Set the path to the text file to fine-tune on
path_to_file = "path/to/text/file.txt"
# Load the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')