Skip to content

Instantly share code, notes, and snippets.

@rtkclouds
Created November 25, 2023 02:38
Show Gist options
  • Save rtkclouds/6aad0fbb44018f18692484e81702cdf5 to your computer and use it in GitHub Desktop.
Save rtkclouds/6aad0fbb44018f18692484e81702cdf5 to your computer and use it in GitHub Desktop.
The analogy between hierarchical clustering using Word2Vec and a heuristic system like the one used in the A* (pronounced "A-star") algorithm is based on the way both approaches use prior knowledge or assumptions to optimize a process. Word2Vec provides a representation of words in a feature
import os
import numpy as np
# Assuming 'word2vec_function' is the Python equivalent of your 'word2vec' function
# and 'gpt_encode' is a function that encodes a string similarly to 'gpt.encode'
def word2vec_function(input_path, output_path, options, callback):
# This function should perform the word2vec operation and write the output to 'output_path'
# Then call 'callback' with the result
pass
def gpt_encode(text):
# Replace this with the actual GPT encoding function
return text
def word2cluster(vec, n=0):
temp = []
if n == 0:
for s in vec:
temp.extend(s)
else:
# Assuming 'vec' is a list of strings at this point
vec_chunked = [vec[i:i+2] for i in range(0, len(vec), 2)]
vec_filtered = [chunk for chunk in vec_chunked if len(chunk) == 2]
temp = [':'.join(pair) for pair in vec_filtered]
# Write to temporary file
with open('/tmp/w2v', 'w') as f:
f.write(' '.join(temp))
# Here you would call your word2vec function and wait for it to complete
word2vec_function('/tmp/w2v', f'./data/class{n}', {'classes': 256}, callback=lambda x: process_result(x, n, temp))
def process_result(x, n, temp):
with open(f'./data/class{n}') as f:
lines = f.read().split('\n')
y = {line.split(' ')[0]: int(line.split(' ')[1]) for line in lines if line}
j = [y.get(s, s) for s in temp] # Replace string with number if available
n += 1
if n < 8:
word2cluster(j, n)
# Read data from file and start the clustering process
with open('./data/f.txt') as f:
d = f.read().split('\n')
word2cluster([gpt_encode(line) for line in d])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment