Skip to content

Instantly share code, notes, and snippets.

View rjurney's full-sized avatar

Russell Jurney rjurney

View GitHub Profile
@rjurney
rjurney / instructions.txt
Last active May 30, 2024 19:58
Address label multiplication data augmentation strategy
System: I need your help with a data science, data augmentation task. I am fine-tuning a sentence transformer paraphrase model to match pairs of addresses. I tried several embedding models and none of them perform well. They need fine-tuning for this task. I have created 27 example pairs of addresses to serve as training data for fine-tuning a SentenceTransformer model. Each record has the fields Address1, Address2, a Description of the semantic they express (ex. 'different street number') and a Label (1.0 for positive match, 0.0 for negative).
The training data covers two categories of corner cases. The first is when similar addresses in string distance aren't the same. The second is the opposite: when dissimilar addresses in string distance are the same. Your task is to read a pair of Addresses, their Description and their Label and generate 100 different examples that express a similar semantic. Your job is to create variations of these records. For some of the records, implement the logic in the Descript
@rjurney
rjurney / conversion.py
Created January 1, 2024 06:32
Converting a 5-day drug schedule to a matching weekly drug schedule
import numpy as np
import pk
import seaborn as sns
drug = pk.Drug(hl=8, t_max=1)
# 5 day simulation
conc = drug.concentration(
60,
1,
@rjurney
rjurney / AREADME.md
Last active December 14, 2023 17:42
Excellent name similarity results between sentence encoders 'sentence-transformers/all-MiniLM-L12-v2' and 'paraphrase-multilingual-MiniLM-L12-v2'

All vs Paraphrase Mini-LM Model Comparisons

This experiment compares multiple methods of sentence encoding on people's names - including across character sets - using the following models:

Notes

Compared to the names, JSON tends to compress scores together owing to overlapping text in formatting: field names, quotes and brackets. You can see in the name pairs name length is a source of error. The dates behave well in the JSON records.

import gspread
from gspread_dataframe import set_with_dataframe
import pandas as pd
# Assume df_users and df_companies are your DataFrames
df_users = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Profile': ['alice123', 'bob456']})
df_companies = pd.DataFrame({'Company': ['TechCorp', 'BizInc'], 'Industry': ['Tech', 'Finance']})
# Step 1: Authenticate to Google Sheets API
# (You'll need to follow the gspread authentication steps which involve creating a service account and obtaining a JSON credentials file)
@rjurney
rjurney / make_graphframes_nodes.py
Created October 9, 2023 10:46
GraphFrames scales very well however... it requires nodes and edges all have one pyspark.sql.DataFrame schema :(
from pyspark.sql.types import StructField, IntegerType, LongType, StringType, TimestampType
def add_missing_columns(df, all_columns):
"""Add any missing columns from any DataFrame among several we want to merge."""
for col_name, schema_field in all_columns:
if col_name not in df.columns:
df = df.withColumn(col_name, F.lit(None).cast(schema_field.dataType))
return df
@rjurney
rjurney / docker-compose.yml
Last active December 31, 2023 16:56
Still trying to do RAG Q&A on all my academic papers to do RAG... Chroma couldn't ingest 900 PDFs. I bet OpenSearch can...
version: "3.8"
services:
opensearch-node1: # This is also the hostname of the container within the Docker network (i.e. https://opensearch-node1/)
image: opensearchproject/opensearch:latest # Specifying the latest available image - modify if you want a specific version
container_name: opensearch-node1
environment:
- cluster.name=opensearch-cluster # Name the cluster
- node.name=opensearch-node1 # Name the node that will run in this container
@rjurney
rjurney / ChatGPT-4-prompt.md
Created October 3, 2023 18:14
Seeking feedback on my ChatGPT prompting. What can I do to improve this result?

I have run the following code to compute dimension reduction with unlabeled UMAP and DBScan for clustering to group dissimilar names for the same academic journals into clusters representing each journal.

The UMAP code is:

# Step 2: Dimension Reduction with UMAP
reducer = umap.UMAP()
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
@rjurney
rjurney / keyboardshortcuts.json
Last active October 2, 2023 23:27
VSCode Keyboard Shortcuts: How do you focus on the 2nd-8th editor tab in the first group
// Place your key bindings in this file to override the defaultsauto[]
[
{
"key": "cmd+l",
"command": "workbench.action.gotoLine"
},
{
"key": "ctrl+g",
"command": "-workbench.action.gotoLine"
},
@rjurney
rjurney / cluster_to_label.py
Created October 1, 2023 03:39
Code that clusters the dirty journal name property of an arXiv citation graph to create clean journal names as labels for classification
#
# Create a pd.DataFrame of the nodes for analysis in a notebook
#
# Extract nodes and their attributes into a list of dictionaries
node_data = [{**{"node": node}, **attr} for node, attr in G.nodes(data=True)]
# Convert the list of dictionaries into a DataFrame
node_df = pd.DataFrame(node_data)
@rjurney
rjurney / networkx_matches_dgl.py
Created September 29, 2023 03:30
I made a networkx network and a DGL network. They match :) I am pleased.
# The DGL network has sentence encoded JSON for node features.
In [26]: g
Out[26]:
[Graph(num_nodes=27770, num_edges=352807,
ndata_schemes={'x': Scheme(shape=(384,), dtype=torch.float64)}
edata_schemes={})]
# The networkx network parsed the records and has inidividual fields for analysis
In [27]: G
Out[27]: <networkx.classes.digraph.DiGraph at 0x14fe5c4c0>