Skip to content

Instantly share code, notes, and snippets.

View rjurney's full-sized avatar

Russell Jurney rjurney

View GitHub Profile
@rjurney
rjurney / Dockerfile.cli
Last active July 22, 2024 01:49
Senzing Dockerfile for Python environment setup
# Dockerfile for a Spark environment with Python 3.10. The image is based on the miniconda3 image
# and installs OpenJDK 17, Spark 3.5.1 with Hadoop 3 and Scala 2.13 and Poetry. The image then
# installs the OpenJDK 17 and the Python packages specified in the pyproject.toml file.
FROM continuumio/miniconda3
RUN apt update && \
apt-get install -y curl apt-transport-https openjdk-17-jdk-headless wget build-essential git \
autoconf automake libtool pkg-config libpq5 libpq-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
@rjurney
rjurney / complete.json
Created July 4, 2024 22:39
6 JSON Lines records in Senzing format
{
"DATA_SOURCE": "TEST",
"RECORD_ID": "1",
"RECORD_TYPE": "PERSON",
"NAME_LIST": [
{
"NAME_TYPE": "PRIMARY",
"NAME_FULL": "KIM SOO IN"
}
],
@rjurney
rjurney / company.json
Created July 4, 2024 21:44
Example of a valid Senzing record that is an edge with only source metadata. How could I encode a second out-link without copying the source metadata?
{
"DATA_SOURCE": "TEST",
"RECORD_ID": "6",
"RECORD_TYPE": "ORGANIZATION",
"NAME_LIST": [
{
"NAME_TYPE": "PRIMARY",
"NAME_ORG": "Random Company, LTD."
}
],
@rjurney
rjurney / download.cmd
Created July 1, 2024 21:12
Download and unzip the International Consortium of Investigative Journalists (ICIJ) knowledge graph dataset
#!/usr/bin/env bash
: '
@echo off
powershell -ExecutionPolicy Bypass -Command "$ErrorActionPreference='Stop'; $ProgressPreference='SilentlyContinue';
$output_file = 'data/full-oldb.LATEST.zip'
$extract_dir = 'data'
Write-Host "`nDownloading the ICIJ Offshore Leaks Database to $output_file`n"
Invoke-WebRequest -Uri 'https://offshoreleaks-data.icij.org/offshoreleaks/csv/full-oldb.LATEST.zip' -OutFile $output_file
@rjurney
rjurney / cosine_sentence_bert.py
Created July 1, 2024 01:03
Cosine similarity adaptation of Sentence-BERT
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
class CosineSentenceBERT(nn.Module):
def __init__(self, model_name=SBERT_MODEL, dim=384):
super().__init__()
self.model_name = model_name
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
@rjurney
rjurney / open_sanctions_pairs.sh
Created June 26, 2024 22:52
Script to extract addresses, names and company names from OpenSanctions Pairs
#!/bin/bash
#
# Quickly extract all unique address, person and company name records from pairs.json: https://www.opensanctions.org/docs/pairs/
# Note: non-commercial use only, affordable licenses available at https://www.opensanctions.org/licensing/
#
# Get the data
wget https://data.opensanctions.org/contrib/training/pairs.json -O data/pairs.json
@rjurney
rjurney / sentence_bert.py
Created June 24, 2024 21:19
An address matching SentenceBERT class Claude helped me write
class SentenceBERT(torch.nn.Module):
def __init__(self, model_name=SBERT_MODEL, dim=384):
super().__init__()
self.model_name = model_name
self.tokenizer = AutoTokenizer.from_pretrained("data/fine-tuned-sbert-paraphrase-multilingual-MiniLM-L12-v2-original/checkpoint-2400/")
self.model = AutoModel.from_pretrained("data/fine-tuned-sbert-paraphrase-multilingual-MiniLM-L12-v2-original/checkpoint-2400/")
self.ffnn = torch.nn.Linear(dim*3, 1)
# Freeze the weights of the pre-trained model
for param in self.model.parameters():
@rjurney
rjurney / instructions.txt
Last active May 30, 2024 19:58
Address label multiplication data augmentation strategy
System: I need your help with a data science, data augmentation task. I am fine-tuning a sentence transformer paraphrase model to match pairs of addresses. I tried several embedding models and none of them perform well. They need fine-tuning for this task. I have created 27 example pairs of addresses to serve as training data for fine-tuning a SentenceTransformer model. Each record has the fields Address1, Address2, a Description of the semantic they express (ex. 'different street number') and a Label (1.0 for positive match, 0.0 for negative).
The training data covers two categories of corner cases. The first is when similar addresses in string distance aren't the same. The second is the opposite: when dissimilar addresses in string distance are the same. Your task is to read a pair of Addresses, their Description and their Label and generate 100 different examples that express a similar semantic. Your job is to create variations of these records. For some of the records, implement the logic in the Descript
@rjurney
rjurney / conversion.py
Created January 1, 2024 06:32
Converting a 5-day drug schedule to a matching weekly drug schedule
import numpy as np
import pk
import seaborn as sns
drug = pk.Drug(hl=8, t_max=1)
# 5 day simulation
conc = drug.concentration(
60,
1,
@rjurney
rjurney / AREADME.md
Last active December 14, 2023 17:42
Excellent name similarity results between sentence encoders 'sentence-transformers/all-MiniLM-L12-v2' and 'paraphrase-multilingual-MiniLM-L12-v2'

All vs Paraphrase Mini-LM Model Comparisons

This experiment compares multiple methods of sentence encoding on people's names - including across character sets - using the following models:

Notes

Compared to the names, JSON tends to compress scores together owing to overlapping text in formatting: field names, quotes and brackets. You can see in the name pairs name length is a source of error. The dates behave well in the JSON records.