Russell Jurney rjurney

## Dockerfile.cli
# Dockerfile for a Spark environment with Python 3.10. The image is based on the miniconda3 image
# and installs OpenJDK 17, Spark 3.5.1 with Hadoop 3 and Scala 2.13 and Poetry. The image then
# installs the OpenJDK 17 and the Python packages specified in the pyproject.toml file.
FROM continuumio/miniconda3

RUN apt update && \
    apt-get install -y curl apt-transport-https openjdk-17-jdk-headless wget build-essential git \
    autoconf automake libtool pkg-config libpq5 libpq-dev && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

## complete.json
{
    "DATA_SOURCE": "TEST",
    "RECORD_ID": "1",
    "RECORD_TYPE": "PERSON",
    "NAME_LIST": [
        {
            "NAME_TYPE": "PRIMARY",
            "NAME_FULL": "KIM SOO IN"
        }
    ],

## company.json
{
    "DATA_SOURCE": "TEST",
    "RECORD_ID": "6",
    "RECORD_TYPE": "ORGANIZATION",
    "NAME_LIST": [
        {
            "NAME_TYPE": "PRIMARY",
            "NAME_ORG": "Random Company, LTD."
        }
    ],

## download.cmd
#!/usr/bin/env bash
: '
@echo off
powershell -ExecutionPolicy Bypass -Command "$ErrorActionPreference='Stop'; $ProgressPreference='SilentlyContinue';
$output_file = 'data/full-oldb.LATEST.zip'
$extract_dir = 'data'

Write-Host "`nDownloading the ICIJ Offshore Leaks Database to $output_file`n"
Invoke-WebRequest -Uri 'https://offshoreleaks-data.icij.org/offshoreleaks/csv/full-oldb.LATEST.zip' -OutFile $output_file

## cosine_sentence_bert.py
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

class CosineSentenceBERT(nn.Module):
    def __init__(self, model_name=SBERT_MODEL, dim=384):
        super().__init__()
        self.model_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

## open_sanctions_pairs.sh
#!/bin/bash

#
# Quickly extract all unique address, person and company name records from pairs.json: https://www.opensanctions.org/docs/pairs/
# Note: non-commercial use only, affordable licenses available at https://www.opensanctions.org/licensing/
#

# Get the data
wget https://data.opensanctions.org/contrib/training/pairs.json -O data/pairs.json

## sentence_bert.py
class SentenceBERT(torch.nn.Module):
    def __init__(self, model_name=SBERT_MODEL, dim=384):
        super().__init__()
        self.model_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained("data/fine-tuned-sbert-paraphrase-multilingual-MiniLM-L12-v2-original/checkpoint-2400/")
        self.model = AutoModel.from_pretrained("data/fine-tuned-sbert-paraphrase-multilingual-MiniLM-L12-v2-original/checkpoint-2400/")
        self.ffnn = torch.nn.Linear(dim*3, 1)

        # Freeze the weights of the pre-trained model
        for param in self.model.parameters():

## instructions.txt
System: I need your help with a data science, data augmentation task. I am fine-tuning a sentence transformer paraphrase model to match pairs of addresses. I tried several embedding models and none of them perform well. They need fine-tuning for this task. I have created 27 example pairs of addresses to serve as training data for fine-tuning a SentenceTransformer model. Each record has the fields Address1, Address2, a Description of the semantic they express (ex. 'different street number') and a Label (1.0 for positive match, 0.0 for negative).

The training data covers two categories of corner cases. The first is when similar addresses in string distance aren't the same. The second is the opposite: when dissimilar addresses in string distance are the same. Your task is to read a pair of Addresses, their Description and their Label and generate 100 different examples that express a similar semantic. Your job is to create variations of these records. For some of the records, implement the logic in the Descript

## conversion.py
import numpy as np
import pk
import seaborn as sns


drug = pk.Drug(hl=8, t_max=1)
# 5 day simulation
conc = drug.concentration(
    60,
    1,

## AREADME.md

      
              4 files
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                rjurney
                / AREADME.md
            
            
              Last active
              December 14, 2023 17:42
            
              
                Excellent name similarity results between sentence encoders 'sentence-transformers/all-MiniLM-L12-v2' and 'paraphrase-multilingual-MiniLM-L12-v2'
              
          
    All vs Paraphrase Mini-LM Model Comparisons

This experiment compares multiple methods of sentence encoding on people's names - including across character sets - using the following models:

sentence-transformers/all-MiniLM-L12-v2
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Notes

Compared to the names, JSON tends to compress scores together owing to overlapping text in formatting: field names, quotes and brackets. You can see in the name pairs name length is a source of error. The dates behave well in the JSON records.
	# Dockerfile for a Spark environment with Python 3.10. The image is based on the miniconda3 image
	# and installs OpenJDK 17, Spark 3.5.1 with Hadoop 3 and Scala 2.13 and Poetry. The image then
	# installs the OpenJDK 17 and the Python packages specified in the pyproject.toml file.
	FROM continuumio/miniconda3

	RUN apt update && \
	apt-get install -y curl apt-transport-https openjdk-17-jdk-headless wget build-essential git \
	autoconf automake libtool pkg-config libpq5 libpq-dev && \
	apt-get clean && \
	rm -rf /var/lib/apt/lists/*
	{
	"DATA_SOURCE": "TEST",
	"RECORD_ID": "1",
	"RECORD_TYPE": "PERSON",
	"NAME_LIST": [
	{
	"NAME_TYPE": "PRIMARY",
	"NAME_FULL": "KIM SOO IN"
	}
	],
	{
	"DATA_SOURCE": "TEST",
	"RECORD_ID": "6",
	"RECORD_TYPE": "ORGANIZATION",
	"NAME_LIST": [
	{
	"NAME_TYPE": "PRIMARY",
	"NAME_ORG": "Random Company, LTD."
	}
	],
	#!/usr/bin/env bash
	: '
	@echo off
	powershell -ExecutionPolicy Bypass -Command "$ErrorActionPreference='Stop'; $ProgressPreference='SilentlyContinue';
	$output_file = 'data/full-oldb.LATEST.zip'
	$extract_dir = 'data'

	Write-Host "`nDownloading the ICIJ Offshore Leaks Database to $output_file`n"
	Invoke-WebRequest -Uri 'https://offshoreleaks-data.icij.org/offshoreleaks/csv/full-oldb.LATEST.zip' -OutFile $output_file
	import torch
	import torch.nn as nn
	import torch.nn.functional as F
	from transformers import AutoModel, AutoTokenizer

	class CosineSentenceBERT(nn.Module):
	def __init__(self, model_name=SBERT_MODEL, dim=384):
	super().__init__()
	self.model_name = model_name
	self.tokenizer = AutoTokenizer.from_pretrained(model_name)
	#!/bin/bash

	#
	# Quickly extract all unique address, person and company name records from pairs.json: https://www.opensanctions.org/docs/pairs/
	# Note: non-commercial use only, affordable licenses available at https://www.opensanctions.org/licensing/
	#

	# Get the data
	wget https://data.opensanctions.org/contrib/training/pairs.json -O data/pairs.json
	class SentenceBERT(torch.nn.Module):
	def __init__(self, model_name=SBERT_MODEL, dim=384):
	super().__init__()
	self.model_name = model_name
	self.tokenizer = AutoTokenizer.from_pretrained("data/fine-tuned-sbert-paraphrase-multilingual-MiniLM-L12-v2-original/checkpoint-2400/")
	self.model = AutoModel.from_pretrained("data/fine-tuned-sbert-paraphrase-multilingual-MiniLM-L12-v2-original/checkpoint-2400/")
	self.ffnn = torch.nn.Linear(dim*3, 1)

	# Freeze the weights of the pre-trained model
	for param in self.model.parameters():
	System: I need your help with a data science, data augmentation task. I am fine-tuning a sentence transformer paraphrase model to match pairs of addresses. I tried several embedding models and none of them perform well. They need fine-tuning for this task. I have created 27 example pairs of addresses to serve as training data for fine-tuning a SentenceTransformer model. Each record has the fields Address1, Address2, a Description of the semantic they express (ex. 'different street number') and a Label (1.0 for positive match, 0.0 for negative).

	The training data covers two categories of corner cases. The first is when similar addresses in string distance aren't the same. The second is the opposite: when dissimilar addresses in string distance are the same. Your task is to read a pair of Addresses, their Description and their Label and generate 100 different examples that express a similar semantic. Your job is to create variations of these records. For some of the records, implement the logic in the Descript
	import numpy as np
	import pk
	import seaborn as sns


	drug = pk.Drug(hl=8, t_max=1)
	# 5 day simulation
	conc = drug.concentration(
	60,
	1,