Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save ryanpbrewster/3473442dbd959da32ae5621d6598f84f to your computer and use it in GitHub Desktop.
Save ryanpbrewster/3473442dbd959da32ae5621d6598f84f to your computer and use it in GitHub Desktop.
From word2vec to doc2vec --- similarity driven CRP by Yingjie Miao
# vecs: an array of real vectors
def crp(vecs):
clusterVec = [] # tracks sum of vectors in a cluster
clusterIdx = [] # array of index arrays. e.g. [[1, 3, 5], [2, 4, 6]]
ncluster = 0
# probablity to create a new table if new customer
# is not strongly "similar" to any existing table
pnew = 1.0/ (1 + ncluster)
N = len(vecs)
rands = random.rand(N) # N rand variables sampled from U(0, 1)
for i in range(N):
maxSim = -Inf
maxIdx = 0
v = vecs[i]
for j in range(ncluster):
sim = cosine_similarity(v, clusterVec[j])
if sim < maxSim:
maxIdx = j
maxSim = sim
if maxSim < pnew:
if rands(i) < pnew:
clusterVec[ncluster] = v
clusterIdx[ncluster] = [i]
ncluster += 1
pnew = 1.0 / (1 + ncluster)
continue
clusterVec[maxIdx] = clusterVec[maxIdx] + v
clusterIdx[maxIdx].append(i)
return clusterIdx
@Ramlinbird
Copy link

Shouldn't the code below be putted outside the "for j in ..." loop?

if maxSim < pnew:
....
continue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment