Skip to content

Instantly share code, notes, and snippets.

@pszufe
Last active March 28, 2023 14:25
  • Star 7 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save pszufe/02666497d2c138d1b2de5b7f67784d2b to your computer and use it in GitHub Desktop.
"Clustering via hypergraph modularity" - companion source code and files

Below you will find the companing files for the Clustering via hypergraph modularity submitted to the PLOS ONE journal by Bogumił Kamiński, Valérie Poulin, Paweł Prałat, Przemysław Szufel and François Théberge.

In the paper, a hypergraph modularity function is proposed that generalizes its well established and widely used graph counterpart measure of how clustered a network is.

The provided code files contain implementation of heuristic algorithm presented in the paper along with the following two numerical examples:

The code that has beem used for experiments is acompanied with the data files that are required to run our examples. In order to ensure a full replicability of our research, in the source code a fixed random number seed has been used.

Installation and configuration.

In order to run the examples we recommend using Python version 3 that can be downloaded via the Python Anaconda distribution. Once your Anaconda is installed, you need to install the igraph package, simply execute tha bash command below:

conda install -c vtraag python-igraph

Test on synthetic hypergraphs

In this numerical example, we generate hyperedges following the process in M. Leodeanu et al. (2012).

We consider 150 points on the space [-.5,.5]x[-.5,.5] such that:

Nodes 0-29: noisy points on a line A with slope -1 Nodes 30-59: noisy points on a line B with slope 0.02 Nodes 60-99: noisy points on a line C with slope 0.8 Nodes 100-159: random points

We sample points in sets of 3 and 4 respecteivly, keeping groups of points that are well-aligned (w.r.t. residuals of the numpy.polyfit() function).

The hyperedges are listed in the supplied files:

hypergraph_3uniform.csv: 159,816 3-edges hypergraph_4uniform.csv: 160,000 4-edges

To generate our plots, we randomly sampled from those sets with respect to 3 different regimes:

  • same proportion of 3 and 4-edges
  • 2/3 3-edges, or
  • 2/3 4-edges

and we also sample such that we expect twice as many edges all coming from the same line (A, B or C) as we have sets with mixed lines and/or random points.

To generate the plots, just run:

	python hypergraph.py

This requires the functions supplied in the strictModularity.py file. Hyperedges are read from the 2 csv files. Two plots are produced: lines_qG.eps and lines_qH.eps. They constitute the Figure 1 in our paper.
In order to ensure replicability of our research we use a fixed random number seed. However, you are free to change it in order to see that the results may vary slightly with each sampling but the conclusions are stable.

Implmentation and test of CM-like algorithm

The code works with the file generated in the previous step

using SimpleHypergraphs, StatsBase

function do_file(name::String)
    f = open(name)
    line= readline(f)
    h = Hypergraph{Bool,Int}(0,0)

    for v_meta in parse.(Int,(split(line,"\t")))
        add_vertex!(h,vertex_meta=v_meta)
    end
    for line in eachline(f)
        x = parse.(Int,(split(line,"\t")))
        inds = x .+ 1
        add_hyperedge!(h;vertices=Dict(inds .=> true))

    end
    close(f)
    h
end
function find_first(c::Array{Set{Int}}, vals)
    for i in 1:length(c)
        for v in vals
            v in c[i] && return i
        end
    end
end

function find_comms(h::Hypergraph, nreps::Int=1000; ha = SimpleHypergraphs.HypergraphAggs(h))
    best_modularity = 0
    comms = [Set(i) for i in 1:nhv(h)]
    mod_history = Vector{Float64}(undef, nreps)
    for rep in 1:nreps
        he = rand(1:nhe(h))
        vers = collect(keys(getvertices(h, he)))
        c = deepcopy(comms)
        i0 = find_first(c, vers)
        max_i = length(c)
        i_cur = i0
        while i_cur < max_i
            i_cur += 1
            if length(intersect(c[i_cur],vers)) > 0
                union!(c[i0], c[i_cur])
                c[i_cur]=c[max_i]
                max_i += -1
            end
        end
        resize!(c,max_i)
        m = modularity(h, c, ha)
        if m > best_modularity
            best_modularity = m
            comms = c
        end
        mod_history[rep] = best_modularity
    end
    return (bm=best_modularity, bp=comms, mod_history=mod_history)
end

Now let use the code to perform experiments:

const h = do_file("h2_100");
const ha = SimpleHypergraphs.HypergraphAggs(h)
res = Matrix{Float64}(undef,500,0)
for i=1:1920  #in the production code we use distributed computing here.
    res = hcat(res, find_comms(h,500, ha=ha)[3])
end
using DelimitedFiles
writedlm("res.txt",res)

Once the above code has been run we have used the data to create the Figure 2.

Test on a small DBLP dataset

The DBLP computer science bibliography database contains open bibliographic information on major computer science journals and proceedings. The DBLP database is operated jointly by University of Trier and Schloss Dagstuhl. We consider a hypergraph of citations where each vertex represents an author and hyperedges are papers.

The data files for this experiment are available here and contain preprocessed DBLP data. The file dblp_edges contains hyper-edges, where each line is of the form {2, 3, 4, 5}; the digits correspond to unique authors; for each author, we have identified a field of academic interest, which are listed in the file dblp_authors-fields

As reported in the paper, we pruned all edges that contained at least one author with unknown field of interest and all edges of size 1; we also randomly selected 1/3 of the edges, and we kept only the (unique) large connected component.

We ended up with 1637 nodes, 865 edges of size 2, 470 of size 3, 152 of size 4 and 37 of size 5 to 7.

Those are included in the file dblp.pkl along with author fields and the partitions we obtained.

To re-create the results from the paper, run

	python dblp.py

By default, the partitions are read from the dblp.pkl file; un-comment the lines in the Python code to re-run the Louvain and hypergraph-CNM algorithms (this one is slow).

This code is also using the basic functions that are loaded from the strictModularity.py file. Output is sent to stdout.

Licensing

This file and all supplied datasets are licensed under the terms of Creative Commons "By Attribution" License 4.0 (CC BY 4.0).

All source code in this archive is licensed under the terms of the MIT License.

# coding: utf-8
# # Strict and 2-section Hypergraph Clustering
# In[1]:
import pandas as pd
import numpy as np
import igraph as ig
from sklearn.metrics import adjusted_rand_score as ARI
from strictModularity import *
# In[2]:
def hcut(H,P):
s = 0
l = 0
for i in range(len(H)):
l = l + len(H[i])
for j in range(len(P)):
s = s + sum([x < set(P[j]) for x in H[i] ])
return (l-s)/l
# In[3]:
import pickle
H, truth, partitions = pickle.load(open('dblp.pkl','rb'))
PL = partitions[0] ## 2-section Louvain
PC = partitions[1] ## hypergraph-CNM
# In[4]:
## make igraph object
n, m = H_size(H)
e, w = TwoSecEdges(H,m)
g = ig.Graph()
g.add_vertices(n)
g.add_edges([tuple(x) for x in e])
g.es["weight"] = w
g = g.simplify(combine_edges=sum)
# In[5]:
## uncomment to re-compute the Louvain partition
#ml = g.community_multilevel(weights="weight")
#PL = [x for x in ml]
# In[6]:
## Table1 with Louvain
print('Table 1 with Louvain')
print('H-modularity ',modularityH(H,PL))
print('G-modularity ',modularityG(H,PL))
print('hcut ',hcut(H,PL))
print('number of parts ',len(PL))
# In[7]:
## uncomment to re-compute the hyper-CNM partition -- NB: this is slow!
#
#qC, PC = cnmAlgo(H, verbose=True)
# In[8]:
## Table1 with hypergraph-CNM
print('Table 1 with hypergraph-CNM')
print('H-modularity ',modularityH(H,PC))
print('G-modularity ',modularityG(H,PC))
print('hcut ',hcut(H,PC))
print('number of parts ',len(PC))
# In[9]:
def member(P):
s = sum([len(x) for x in P])
M = [-1]*s
for i in range(len(P)):
for j in list(P[i]):
M[j]=i
return M
def edgeLabels(g, gcomm):
x = [(gcomm[x.tuple[0]]==gcomm[x.tuple[1]]) for x in g.es]
return x
def AGRI(g, u, v):
bu = edgeLabels(g, u)
bv = edgeLabels(g, v)
su = sum(bu)
sv = sum(bv)
suv = sum(np.array(bu)*np.array(bv))
m = len(bu)
return((suv - su*sv/m) / (0.5*(su+sv) - su*sv/m))
## cluster similarity
print('Partition similarity measures')
print('ARI ',ARI(member(PL),member(PC)))
print('AGRI ',AGRI(g,member(PL),member(PC)))
# In[10]:
def hcutPlus(H,P):
hc = [-1]*len(H)
S = 0
L = 0
for i in range(len(H)):
l = len(H[i])
L = L + l
s = 0
for j in range(len(P)):
s = s + sum([x < set(P[j]) for x in H[i] ])
S = S + s
if l>0:
hc[i] = (l-s)/l
return hc
# In[11]:
## Louvain partition
hc = hcutPlus(H,PL)
print('With Louvain partition:')
for i in [2,3,4]:
print('prop. of edges of size',i,'cut:',hc[i])
# In[12]:
## hyper-CNM partition
hc = hcutPlus(H,PC)
print('With hyper-CNM partition:')
for i in [2,3,4]:
print('prop. of edges of size',i,'cut:',hc[i])
# coding: utf-8
# # Strict and 2-section Hypergraph Clustering
# In[1]:
import pandas as pd
import numpy as np
import copy
import matplotlib.pyplot as plt
from sklearn.metrics import adjusted_rand_score as Rand
import igraph as ig
from scipy import stats
## All hypergraph functions are here
from strictModularity import *
np.random.seed(0)
import random
random.seed(0)
# In[2]:
## proportion of hyperedges cuted w.r.t. partition P
def hcut(H,P):
s = 0
l = 0
for i in range(len(H)):
l = l + len(H[i])
for j in range(len(P)):
s = s + sum([x < set(P[j]) for x in H[i] ])
return (l-s)/l
# In[3]:
## number of hyperedge of each type
pointsPerLine = 30
numOutliers = 60
REP = 100
nl = [0]*pointsPerLine + [1]*pointsPerLine + [2]*pointsPerLine + [3]*numOutliers
## is this edge from the same line?
def lineEdge(e):
s = set([nl[i] for i in e])
return (len(s)==1 and 3 not in s)
L = []
mu = 0.33 ## proportion of noisy edges to keep
for z in [0.5,2,1]: ## ratios of 3-edges and 4-edges
fn = './hypergraph_3uniform.csv'
x3 = np.loadtxt(fn, delimiter=',', dtype='int')
fn = './hypergraph_4uniform.csv'
x4 = np.loadtxt(fn, delimiter=',', dtype='int')
## Downsample
p3 = .1 * z
base3 = [set(i) for i in x3 if np.random.random()<p3]
p4 = .035 / z
base4 = [set(i) for i in x4 if np.random.random()<p4]
base = base3+base4
for rep in range(REP):
## sample down False cases to get mu
x = [lineEdge(i) for i in base]
t = x.count(True)
f = x.count(False)
p = mu*t/((1-mu)*f)
h = [base[i] for i in range(len(base)) if x[i] or np.random.sample()<p]
H = list2H(h)
## 2-section edges and weights
n, m = H_size(H)
e, w = TwoSecEdges(H,m)
## make weighted igraph object
g = ig.Graph()
g.add_vertices(n)
g.add_edges([tuple(x) for x in e])
g.es["weight"] = w
g = g.simplify(combine_edges=sum)
## Louvain partition
ml = g.community_multilevel(weights="weight")
P = [x for x in ml]
l = [len(x) for x in H]
## modularities and hcut
L.append([l[3]/(l[3]+l[4]),modularityH(H,P),modularityG(H,P),hcut(H,P)])
# In[5]:
D = pd.DataFrame(L, columns=['ratio_3','qH','qG','hcut'])
x = D['ratio_3']>.6
y = D['ratio_3']>.4
D['regime'] = [int(x[i])+int(y[i]) for i in range(len(x))]
# In[9]:
X = D[D['regime']==0]
plt.plot(X['hcut'],X['qG'],'*r',label='majority of 4-edges')
X = D[D['regime']==2]
plt.plot(X['hcut'],X['qG'],'*b',label='majority of 3-edges')
X = D[D['regime']==1]
plt.plot(X['hcut'],X['qG'],'*g',label='balanced')
plt.legend();
plt.xlabel('Hcut value',fontsize=14)
plt.ylabel('Graph modularity', fontsize=14)
slope, intercept, r, p, stderr = stats.linregress(X['hcut'],X['qG'])
print('Balanced case -- Slope:', slope,' R_squared:',r*r)
plt.savefig('lines_qG.eps')
# In[10]:
X = D[D['regime']==0]
plt.plot(X['hcut'],X['qH'],'*r',label='majority of 4-edges')
X = D[D['regime']==2]
plt.plot(X['hcut'],X['qH'],'*b',label='majority of 3-edges')
X = D[D['regime']==1]
plt.plot(X['hcut'],X['qH'],'*g',label='balanced')
plt.legend()
plt.xlabel('Hcut value',fontsize=14)
plt.ylabel('Hypergraph modularity', fontsize=14)
slope, intercept, r, p, stderr = stats.linregress(X['hcut'],X['qH'])
print('Balanced case -- Slope:', slope,' R_squared:',r*r)
plt.savefig('lines_qH.eps')
##########################################################
## Hypergraph algorithms
## Quick Python implementation - small datasets only!
##
## Data representation
##
## h: list of sets (the hyperedges), with 0-based integer vertices
## example: h = [{0,1},{1,2,3},{2,3,4},{4,6},{5,6}]
## H = list2H(h) (hypergraph)
## example: H = [[], [], [{0, 1}, {4, 6}, {5, 6}], [{1, 2, 3}, {2, 3, 4}]]
##
## qH, PH = randomAlgo(H, steps=10, verbose=True, ddeg=False)
## qH, PH = cnmAlgo(H, verbose=True)
## qG, PG = randomAlgoTwoSec(H, steps=10, verbose=True)
##
##########################################################
from scipy.special import comb as choose
import itertools
from random import shuffle, randint, sample
##########################################################
## return: number of nodes (recall: 0-based), and
## number of edges of each cardinality
def H_size(H):
M = len(H)
m = []
n = 0
for i in range(M):
m.append(len(H[i]))
if(len(H[i])>0):
j = max(set.union(*H[i]))
if j>n:
n = j
return n+1, m
## vertex d-degrees for each d
def d_Degrees(H, n, m):
M = len(H)
d = [[]]*M
for i in range(M):
if(m[i]>0):
x = [y for x in H[i] for y in list(x)]
y = [x.count(i) for i in range(n)]
d[i] = y
return d
## vertex total degrees
def Degrees(H,n,m,d):
M = len(H)
D = [0]*n
for i in range(M):
if(m[i]>0):
for j in range(n):
D[j] = D[j] + d[i][j]
return D
##########################################################
## edge contribution: given (H,A,m)
def EdgeContribution(H,A,m):
ec = 0
for i in range(len(H)):
for j in range(len(H[i])):
for k in range(len(A)):
if(H[i][j].issubset(A[k])):
ec = ec + 1
break
ec = ec / sum(m)
return ec
##########################################################
## degree tax - with d-degrees as null model
def d_DegreeTax(A,m,d):
dt = 0
for i in range(len(m)):
if (m[i]>0):
S = 0
for j in range(len(A)):
s = 0
for k in A[j]:
s = s + d[i][k]
s = s ** i
S = S + s
S = S / (i**i * m[i]**(i-1) * sum(m))
dt = dt + S
return dt
## degree tax - with degrees as null model
def DegreeTax(A,m,D):
dt = 0
vol = sum(D)
M = sum(m)
## vol(A_i)'s
volA = [0]*len(A)
for i in range(len(A)):
for j in A[i]:
volA[i] = volA[i] + D[j]
volA[i] = volA[i] / vol
## sum over d
S = 0
for i in range(len(m)):
if (m[i]>0):
x = sum([a**i for a in volA]) * m[i] / M
S = S + x
return S
##########################################################
## 2-section: return extended list of edges and edge weights
def TwoSecEdges(H,m):
e = []
w = []
for i in range(len(m)):
if(m[i]>0 and i>1):
den = choose(i,2)
for j in range(len(H[i])):
s = [set(k) for k in itertools.combinations(H[i][j],2)]
x = [1/den]*len(s)
e.extend(s)
w.extend(x)
return e,w
def TwoSecEdgeContribution(A,e,w):
ec = 0
for i in range(len(A)):
for j in range(len(e)):
if(e[j].issubset(A[i])):
ec = ec + w[j]
ec = ec / sum(w)
return ec
def TwoSecDegrees(n,e,w):
d = [0]*n
for i in range(len(e)):
d[list(e[i])[0]] = d[list(e[i])[0]] + w[i]
d[list(e[i])[1]] = d[list(e[i])[1]] + w[i]
return d
def TwoSecDegreeTax(A,d):
dt = 0
for i in range(len(A)):
s = 0
for j in list(A[i]):
s = s + d[j]
s = s**2
dt = dt + s
dt = dt / (4*(sum(d)/2)**2)
return dt
##########################################################
## take a partition and an edge (set)
## return new partition with new edge "active"
def newPart(A,s):
P = []
for i in range(len(A)):
if(len(s.intersection(A[i])) == 0):
P.append(A[i])
else:
s = s.union(A[i])
P.append(s)
return P
##########################################################
def cnmAlgo(H, verbose=False, ddeg=False):
## get degrees from H
n, m = H_size(H)
d = d_Degrees(H,n,m)
D = Degrees(H,n,m,d)
## get all edges in a list
e = []
for i in range(len(H)):
e.extend(H[i])
## initialize modularity, partition
A_opt = []
for i in range(n):
A_opt.extend([{i}])
if ddeg:
q_opt = EdgeContribution(H,A_opt,m) - d_DegreeTax(A_opt,m,d)
else:
q_opt = EdgeContribution(H,A_opt,m) - DegreeTax(A_opt,m,D)
## e contains the edges NOT yet in a part
while len(e)>0:
q0 = -1
e0 = -1
if verbose:
print('best overall:',q_opt, 'edges left: ',len(e))
## pick best edge to add .. this is slow as is!
for i in range(len(e)):
P = newPart(A_opt,e[i])
if ddeg:
q = EdgeContribution(H,P,m) - d_DegreeTax(P,m,d)
else:
q = EdgeContribution(H,P,m) - DegreeTax(P,m,D)
if q>q0:
e0 = i
q0 = q
## add best edge found if any
if(q0 > q_opt):
q_opt = q0
A_opt = newPart(A_opt,e[e0])
## remove all 'active' edges
r = []
for i in range(len(e)):
for j in range(len(A_opt)):
if(e[i].issubset(A_opt[j])):
r.append(e[i])
break
for i in range(len(r)):
e.remove(r[i])
## early stop if no immediate improvement
else:
break
return q_opt, A_opt
##########################################################
## random algorithm - start from singletons, add edges w.r.t. permutation IF q improves
def randomAlgoTwoSec(H, steps=10, verbose=False):
## get degrees from H
n, m = H_size(H)
ed, w = TwoSecEdges(H, m)
d = TwoSecDegrees(n, ed, w)
## get all edges in H
e = []
for i in range(len(H)):
e.extend(H[i])
## initialize modularity, partition
q_opt = -1
A_opt= []
## algorithm - go through random permutations
for ctr in range(steps):
## Loop here
shuffle(e)
## list of singletons
A = []
for i in range(n):
A.extend([{i}])
## starting (degree) modularity
q0 = TwoSecEdgeContribution(A, ed, w) - TwoSecDegreeTax(A, d)
for i in range(len(e)):
P = newPart(A,e[i])
q = TwoSecEdgeContribution(P,ed, w) - TwoSecDegreeTax(P, d)
if q > q0:
A = P
q0 = q
if q0 > q_opt:
q_opt = q0
A_opt = A
if verbose:
print('step',ctr,':',q_opt)
return q_opt, A_opt
## random algorithm - start from singletons, add edges w.r.t. permutation IF q improves
def randomAlgo(H, steps=10, verbose=False, ddeg=False):
## get degrees from H
n, m = H_size(H)
d = d_Degrees(H,n,m)
D = Degrees(H,n,m,d)
## get all edges in H
e = []
for i in range(len(H)):
e.extend(H[i])
## initialize modularity, partition
q_opt = -1
A_opt= []
## algorithm - go through random permutations
for ctr in range(steps):
## Loop here
shuffle(e)
## list of singletons
A = []
for i in range(n):
A.extend([{i}])
## starting (degree) modularity
if ddeg:
q0 = EdgeContribution(H,A,m) - d_DegreeTax(A,m,d)
else:
q0 = EdgeContribution(H,A,m) - DegreeTax(A,m,D)
for i in range(len(e)):
P = newPart(A,e[i])
if ddeg:
q = EdgeContribution(H,P,m) - d_DegreeTax(P,m,d)
else:
q = EdgeContribution(H,P,m) - DegreeTax(P,m,D)
if q > q0:
A = P
q0 = q
if q0 > q_opt:
q_opt = q0
A_opt = A
if verbose:
print('step',ctr,':',q_opt)
return q_opt, A_opt
##########################################################
## Map vertices 0 .. n-1 to their respective 0-based part number
def PartitionLabels(P):
n = 0
for i in range(len(P)):
n = n + len(P[i])
label = [-1]*n
for i in range(len(P)):
l = list(P[i])
for j in range(len(l)):
label[l[j]] = i
return label
##########################################################
## generate m edges between [idx1,idx2] inclusively
## of size between [size1,size2] inclusively
## Store in a list of lists of sets
def generateEdges(m,idx1,idx2,size1,size2):
## init
L = [[]]*(size2+1)
for i in range(size2+1):
L[i]=[]
v = list(range(idx1,idx2+1))
if size2>len(v):
size2 = len(v)
## generate - never mind repeats for now
for i in range(m):
size = randint(size1,size2)
L[size].append(set(sample(v,size)))
return L
## merge two lists of lists of sets
def mergeEdges(L1,L2):
l = max(len(L1),len(L2))
L = [[]]*l
for i in range(len(L1)):
L[i] = L1[i]
for i in range(len(L2)):
L[i] = L[i] + L2[i]
## uniquify
for i in range(l):
L[i] = [set(j) for j in set(frozenset(i) for i in L[i])]
return L
##########################################################
## format Hypergraph given list of hyperedges (list of sets of 0-based integers)
def list2H(h):
ml = max([len(x) for x in h])
H = [[]]*(ml+1)
for i in range(ml+1):
H[i] = []
for i in range(len(h)):
l = len(h[i])
H[l].append(h[i])
return H
## two section modularity
def modularityG(H,A):
n, m = H_size(H)
ed, w = TwoSecEdges(H, m)
d = TwoSecDegrees(n, ed, w)
return(TwoSecEdgeContribution(A, ed, w) - TwoSecDegreeTax(A, d))
## strict H-modularity
def modularityH(H,A,ddeg=False):
n, m = H_size(H)
d = d_Degrees(H,n,m)
D = Degrees(H,n,m,d)
if ddeg:
return(EdgeContribution(H,A,m) - d_DegreeTax(A,m,d))
else:
return(EdgeContribution(H,A,m) - DegreeTax(A,m,D))
##########################################################
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment