pszufe/README.MD

## README.MD

      
    Raw
  

              README.MD
            
          
    Below you will find the companing files for the Clustering via hypergraph modularity
submitted to the PLOS ONE journal by Bogumił Kamiński, Valérie Poulin,
Paweł Prałat, Przemysław Szufel and François Théberge.
In the paper, a hypergraph modularity function is proposed that
generalizes its well established and widely used graph
counterpart measure of how clustered a network is.
The provided code files contain implementation of heuristic algorithm
presented in the paper along with the following two numerical examples:

Testing heuristics on synthetic hypergraphs (Subsection 4.1 of the paper)
Implementation of the CNM-like algorithm for hypergraphs and numerical experiments(Subsection 4.2 of the paper)
Testing heuristics on the DBLP dataset represented as a hypergraph (Subsection 4.3 of the paper)

The code that has beem used for experiments is acompanied with the
data files
that are required to run our examples.
In order to ensure a full replicability of our research,
in the source code a fixed random number seed has been used.
Installation and configuration.

In order to run the examples we recommend using Python version 3
that can be downloaded via the
Python Anaconda distribution.
Once your Anaconda is installed, you need to install the igraph package,
simply execute tha bash command below:
conda install -c vtraag python-igraph
 Test on synthetic hypergraphs

In this numerical example, we generate hyperedges
following the process in M. Leodeanu et al. (2012).
We consider 150 points on the space [-.5,.5]x[-.5,.5] such that:
Nodes 0-29:  noisy points on a line A with slope -1
Nodes 30-59: noisy points on a line B with slope 0.02
Nodes 60-99: noisy points on a line C with slope 0.8
Nodes 100-159: random points
We sample points in sets of 3 and 4 respecteivly, keeping
groups of points that are well-aligned (w.r.t. residuals
of the numpy.polyfit() function).
The hyperedges are listed in the supplied files:
hypergraph_3uniform.csv: 159,816 3-edges
hypergraph_4uniform.csv: 160,000 4-edges
To generate our plots, we randomly sampled from those sets
with respect to 3 different regimes:

same proportion of 3 and 4-edges
2/3 3-edges, or
2/3 4-edges

and we also sample such that we expect twice as many edges
all coming from the same line (A, B or C) as we have sets with
mixed lines and/or random points.
To generate the plots, just run:
	python hypergraph.py
This requires the functions supplied in the strictModularity.py file.
Hyperedges are read from the 2 csv files.
Two plots are produced: lines_qG.eps and lines_qH.eps. They constitute
the Figure 1 in our paper.

In order to ensure replicability of our research we use a fixed random
number seed. However, you are free to change it in order to see
that the results may vary slightly with each sampling but the conclusions
are stable.
 Implmentation and test of CM-like algorithm

The code works with the file generated in the previous step
using SimpleHypergraphs, StatsBase

function do_file(name::String)
    f = open(name)
    line= readline(f)
    h = Hypergraph{Bool,Int}(0,0)

    for v_meta in parse.(Int,(split(line,"\t")))
        add_vertex!(h,vertex_meta=v_meta)
    end
    for line in eachline(f)
        x = parse.(Int,(split(line,"\t")))
        inds = x .+ 1
        add_hyperedge!(h;vertices=Dict(inds .=> true))

    end
    close(f)
    h
end
function find_first(c::Array{Set{Int}}, vals)
    for i in 1:length(c)
        for v in vals
            v in c[i] && return i
        end
    end
end

function find_comms(h::Hypergraph, nreps::Int=1000; ha = SimpleHypergraphs.HypergraphAggs(h))
    best_modularity = 0
    comms = [Set(i) for i in 1:nhv(h)]
    mod_history = Vector{Float64}(undef, nreps)
    for rep in 1:nreps
        he = rand(1:nhe(h))
        vers = collect(keys(getvertices(h, he)))
        c = deepcopy(comms)
        i0 = find_first(c, vers)
        max_i = length(c)
        i_cur = i0
        while i_cur < max_i
            i_cur += 1
            if length(intersect(c[i_cur],vers)) > 0
                union!(c[i0], c[i_cur])
                c[i_cur]=c[max_i]
                max_i += -1
            end
        end
        resize!(c,max_i)
        m = modularity(h, c, ha)
        if m > best_modularity
            best_modularity = m
            comms = c
        end
        mod_history[rep] = best_modularity
    end
    return (bm=best_modularity, bp=comms, mod_history=mod_history)
end
Now let use the code to perform experiments:
const h = do_file("h2_100");
const ha = SimpleHypergraphs.HypergraphAggs(h)
res = Matrix{Float64}(undef,500,0)
for i=1:1920  #in the production code we use distributed computing here.
    res = hcat(res, find_comms(h,500, ha=ha)[3])
end
using DelimitedFiles
writedlm("res.txt",res)
Once the above code has been run we have used the data to create the Figure 2.
 Test on a small DBLP dataset

The DBLP computer science bibliography database contains
open bibliographic information on major computer science
journals and proceedings.
The DBLP database is operated jointly by University of
Trier and Schloss Dagstuhl.
We consider a hypergraph of citations where each
vertex represents an author and hyperedges are papers.
The data files for this experiment are available
here
and contain preprocessed DBLP data.
The file dblp_edges contains hyper-edges, where
each line is of the form {2, 3, 4, 5};
the digits correspond to unique authors;
for each author, we have identified a field of academic
interest, which are listed in the file dblp_authors-fields
As reported in the paper, we pruned all edges that contained
at least one author with unknown field of interest
and all edges of size 1;
we also randomly selected 1/3 of the edges,
and we kept only the (unique) large connected component.
We ended up with 1637 nodes, 865 edges of size 2, 470 of size 3,
152 of size 4 and 37 of size 5 to 7.
Those are included in the file dblp.pkl along with author
fields and the partitions we obtained.
To re-create the results from the paper, run
	python dblp.py
By default, the partitions are read from the dblp.pkl file;
un-comment the lines in the Python code to re-run the Louvain
and hypergraph-CNM algorithms (this one is slow).
This code is also using the basic functions that are loaded
from the strictModularity.py file. Output is sent to stdout.
Licensing

This file and all supplied datasets are licensed under the terms of Creative Commons "By Attribution" License 4.0 (CC BY 4.0).
All source code in this archive is licensed under the terms of the MIT License.

  
## dblp.py

# coding: utf-8

# # Strict and 2-section Hypergraph Clustering

# In[1]:

import pandas as pd
import numpy as np
import igraph as ig
from sklearn.metrics import adjusted_rand_score as ARI
from strictModularity import *


# In[2]:


def hcut(H,P):
    s = 0
    l = 0
    for i in range(len(H)):
        l = l + len(H[i])
        for j in range(len(P)):
            s = s + sum([x < set(P[j]) for x in H[i] ])
    return (l-s)/l


# In[3]:


import pickle
H, truth, partitions = pickle.load(open('dblp.pkl','rb'))
PL = partitions[0] ## 2-section Louvain
PC = partitions[1] ## hypergraph-CNM


# In[4]:


## make igraph object
n, m = H_size(H)
e, w = TwoSecEdges(H,m)
g = ig.Graph()
g.add_vertices(n)
g.add_edges([tuple(x) for x in e])
g.es["weight"] = w
g = g.simplify(combine_edges=sum)


# In[5]:


## uncomment to re-compute the Louvain partition
#ml = g.community_multilevel(weights="weight")
#PL = [x for x in ml]


# In[6]:


## Table1 with Louvain
print('Table 1 with Louvain')
print('H-modularity ',modularityH(H,PL))
print('G-modularity ',modularityG(H,PL))
print('hcut ',hcut(H,PL))
print('number of parts ',len(PL))


# In[7]:


## uncomment to re-compute the hyper-CNM partition -- NB: this is slow!
#
#qC, PC = cnmAlgo(H, verbose=True)


# In[8]:


## Table1 with hypergraph-CNM
print('Table 1 with hypergraph-CNM')
print('H-modularity ',modularityH(H,PC))
print('G-modularity ',modularityG(H,PC))
print('hcut ',hcut(H,PC))
print('number of parts ',len(PC))


# In[9]:


def member(P):
    s = sum([len(x) for x in P])
    M = [-1]*s
    for i in range(len(P)):
        for j in list(P[i]):
            M[j]=i
    return M

def edgeLabels(g, gcomm):
    x = [(gcomm[x.tuple[0]]==gcomm[x.tuple[1]]) for x in g.es]
    return x
def AGRI(g, u, v):
    bu = edgeLabels(g, u)
    bv = edgeLabels(g, v)
    su = sum(bu)
    sv = sum(bv)
    suv = sum(np.array(bu)*np.array(bv))
    m = len(bu)
    return((suv - su*sv/m) / (0.5*(su+sv) - su*sv/m))

## cluster similarity
print('Partition similarity measures')
print('ARI ',ARI(member(PL),member(PC)))
print('AGRI ',AGRI(g,member(PL),member(PC)))


# In[10]:


def hcutPlus(H,P):
    hc = [-1]*len(H)
    S = 0
    L = 0
    for i in range(len(H)):
        l = len(H[i])
        L = L + l
        s = 0
        for j in range(len(P)):
            s = s + sum([x < set(P[j]) for x in H[i] ])
        S = S + s
        if l>0:
            hc[i] = (l-s)/l
    return hc


# In[11]:


## Louvain partition
hc = hcutPlus(H,PL)
print('With Louvain partition:')
for i in [2,3,4]:
    print('prop. of edges of size',i,'cut:',hc[i])


# In[12]:


## hyper-CNM partition
hc = hcutPlus(H,PC)
print('With hyper-CNM partition:')
for i in [2,3,4]:
    print('prop. of edges of size',i,'cut:',hc[i])


## hypergraph.py

# coding: utf-8

# # Strict and 2-section Hypergraph Clustering

# In[1]:

import pandas as pd
import numpy as np
import copy
import matplotlib.pyplot as plt
from sklearn.metrics import adjusted_rand_score as Rand
import igraph as ig
from scipy import stats

## All hypergraph functions are here
from strictModularity import *

np.random.seed(0)
import random
random.seed(0)
# In[2]:


## proportion of hyperedges cuted w.r.t. partition P
def hcut(H,P):
    s = 0
    l = 0
    for i in range(len(H)):
        l = l + len(H[i])
        for j in range(len(P)):
            s = s + sum([x < set(P[j]) for x in H[i] ])
    return (l-s)/l


# In[3]:


## number of hyperedge of each type
pointsPerLine = 30
numOutliers = 60
REP = 100
nl = [0]*pointsPerLine + [1]*pointsPerLine + [2]*pointsPerLine + [3]*numOutliers

## is this edge from the same line?
def lineEdge(e):
    s = set([nl[i] for i in e])
    return (len(s)==1 and 3 not in s)

L = []
mu = 0.33 ## proportion of noisy edges to keep

for z in [0.5,2,1]: ## ratios of 3-edges and 4-edges

    fn = './hypergraph_3uniform.csv'
    x3 = np.loadtxt(fn, delimiter=',', dtype='int')
    fn = './hypergraph_4uniform.csv'
    x4 = np.loadtxt(fn, delimiter=',', dtype='int')
    ## Downsample
    p3 = .1 * z
    base3 = [set(i) for i in x3 if np.random.random()<p3]
    p4 = .035 / z
    base4 = [set(i) for i in x4 if np.random.random()<p4]
    base = base3+base4

    for rep in range(REP):
        ## sample down False cases to get mu
        x = [lineEdge(i) for i in base]
        t = x.count(True)
        f = x.count(False)
        p = mu*t/((1-mu)*f)
        h = [base[i] for i in range(len(base)) if x[i] or np.random.sample()<p]
        H = list2H(h)
        ## 2-section edges and weights
        n, m = H_size(H)
        e, w = TwoSecEdges(H,m)
        ## make weighted igraph object
        g = ig.Graph()
        g.add_vertices(n)
        g.add_edges([tuple(x) for x in e])
        g.es["weight"] = w
        g = g.simplify(combine_edges=sum)
        ## Louvain partition
        ml = g.community_multilevel(weights="weight")
        P = [x for x in ml]
        l = [len(x) for x in H]
        ## modularities and hcut
        L.append([l[3]/(l[3]+l[4]),modularityH(H,P),modularityG(H,P),hcut(H,P)])


# In[5]:


D = pd.DataFrame(L, columns=['ratio_3','qH','qG','hcut'])
x = D['ratio_3']>.6
y = D['ratio_3']>.4
D['regime'] = [int(x[i])+int(y[i]) for i in range(len(x))]


# In[9]:


X = D[D['regime']==0]
plt.plot(X['hcut'],X['qG'],'*r',label='majority of 4-edges')
X = D[D['regime']==2]
plt.plot(X['hcut'],X['qG'],'*b',label='majority of 3-edges')
X = D[D['regime']==1]
plt.plot(X['hcut'],X['qG'],'*g',label='balanced')
plt.legend();
plt.xlabel('Hcut value',fontsize=14)
plt.ylabel('Graph modularity', fontsize=14)
slope, intercept, r, p, stderr = stats.linregress(X['hcut'],X['qG'])
print('Balanced case -- Slope:', slope,' R_squared:',r*r)
plt.savefig('lines_qG.eps')


# In[10]:


X = D[D['regime']==0]
plt.plot(X['hcut'],X['qH'],'*r',label='majority of 4-edges')
X = D[D['regime']==2]
plt.plot(X['hcut'],X['qH'],'*b',label='majority of 3-edges')
X = D[D['regime']==1]
plt.plot(X['hcut'],X['qH'],'*g',label='balanced')
plt.legend()
plt.xlabel('Hcut value',fontsize=14)
plt.ylabel('Hypergraph modularity', fontsize=14)
slope, intercept, r, p, stderr = stats.linregress(X['hcut'],X['qH'])
print('Balanced case -- Slope:', slope,' R_squared:',r*r)
plt.savefig('lines_qH.eps')


## strictModularity.py
##########################################################
## Hypergraph algorithms
## Quick Python implementation - small datasets only!
##
## Data representation
##
## h: list of sets (the hyperedges), with 0-based integer vertices
##    example: h = [{0,1},{1,2,3},{2,3,4},{4,6},{5,6}]
## H = list2H(h) (hypergraph)
##    example: H = [[], [], [{0, 1}, {4, 6}, {5, 6}], [{1, 2, 3}, {2, 3, 4}]]
##
## qH, PH = randomAlgo(H, steps=10, verbose=True, ddeg=False)
## qH, PH = cnmAlgo(H, verbose=True)
## qG, PG = randomAlgoTwoSec(H, steps=10, verbose=True)
##
##########################################################

from scipy.special import comb as choose
import itertools
from random import shuffle, randint, sample

##########################################################

## return: number of nodes (recall: 0-based), and
## number of edges of each cardinality
def H_size(H):
    M = len(H)
    m = []
    n = 0
    for i in range(M):
        m.append(len(H[i]))
        if(len(H[i])>0):
            j = max(set.union(*H[i]))
            if j>n:
                n = j
    return n+1, m

## vertex d-degrees for each d
def d_Degrees(H, n, m):
    M = len(H)
    d = [[]]*M
    for i in range(M):
        if(m[i]>0):
            x = [y for x in H[i] for y in list(x)]
            y = [x.count(i) for i in range(n)]
            d[i] = y
    return d

## vertex total degrees
def Degrees(H,n,m,d):
    M = len(H)
    D = [0]*n
    for i in range(M):
        if(m[i]>0):
            for j in range(n):
                D[j] = D[j] + d[i][j]
    return D

##########################################################

## edge contribution: given (H,A,m)
def EdgeContribution(H,A,m):
    ec = 0
    for i in range(len(H)):
        for j in range(len(H[i])):
            for k in range(len(A)):
                if(H[i][j].issubset(A[k])):
                    ec = ec + 1
                    break
    ec = ec / sum(m)
    return ec

##########################################################

## degree tax - with d-degrees as null model
def d_DegreeTax(A,m,d):
    dt = 0
    for i in range(len(m)):
        if (m[i]>0):
            S = 0
            for j in range(len(A)):
                s = 0
                for k in A[j]:
                    s = s + d[i][k]
                s = s ** i
                S = S + s
            S = S / (i**i * m[i]**(i-1) * sum(m))
            dt = dt + S
    return dt

## degree tax - with degrees as null model
def DegreeTax(A,m,D):
    dt = 0
    vol = sum(D)
    M = sum(m)
    ## vol(A_i)'s
    volA = [0]*len(A)
    for i in range(len(A)):
        for j in A[i]:
            volA[i] = volA[i] + D[j]
        volA[i] = volA[i] / vol
    ## sum over d
    S = 0
    for i in range(len(m)):
        if (m[i]>0):
            x = sum([a**i for a in volA]) * m[i] / M
            S = S + x
    return S

##########################################################

## 2-section: return extended list of edges and edge weights
def TwoSecEdges(H,m):
    e = []
    w = []
    for i in range(len(m)):
        if(m[i]>0 and i>1):
            den = choose(i,2)
            for j in range(len(H[i])):
                s = [set(k) for k in itertools.combinations(H[i][j],2)]
                x = [1/den]*len(s)
                e.extend(s)
                w.extend(x)
    return e,w

def TwoSecEdgeContribution(A,e,w):
    ec = 0
    for i in range(len(A)):
        for j in range(len(e)):
            if(e[j].issubset(A[i])):
                ec = ec + w[j]
    ec = ec / sum(w)
    return ec

def TwoSecDegrees(n,e,w):
    d = [0]*n
    for i in range(len(e)):
        d[list(e[i])[0]] = d[list(e[i])[0]] + w[i]
        d[list(e[i])[1]] = d[list(e[i])[1]] + w[i]
    return d

def TwoSecDegreeTax(A,d):
    dt = 0
    for i in range(len(A)):
        s = 0
        for j in list(A[i]):
            s = s + d[j]
        s = s**2
        dt = dt + s
    dt = dt / (4*(sum(d)/2)**2)
    return dt

##########################################################

## take a partition and an edge (set)
## return new partition with new edge "active"
def newPart(A,s):
    P = []
    for i in range(len(A)):
        if(len(s.intersection(A[i])) == 0):
            P.append(A[i])
        else:
            s = s.union(A[i])
    P.append(s)
    return P

##########################################################

def cnmAlgo(H, verbose=False, ddeg=False):
    ## get degrees from H
    n, m = H_size(H)
    d = d_Degrees(H,n,m)
    D = Degrees(H,n,m,d)
    ## get all edges in a list
    e = []
    for i in range(len(H)):
        e.extend(H[i])
    ## initialize modularity, partition
    A_opt = []
    for i in range(n):
        A_opt.extend([{i}])
    if ddeg:
        q_opt = EdgeContribution(H,A_opt,m) - d_DegreeTax(A_opt,m,d)
    else:
        q_opt = EdgeContribution(H,A_opt,m) - DegreeTax(A_opt,m,D)
    ## e contains the edges NOT yet in a part
    while len(e)>0:
        q0 = -1
        e0 = -1
        if verbose:
            print('best overall:',q_opt, 'edges left: ',len(e))
        ## pick best edge to add .. this is slow as is!
        for i in range(len(e)):
            P = newPart(A_opt,e[i])
            if ddeg:
                q = EdgeContribution(H,P,m) - d_DegreeTax(P,m,d)
            else:
                q = EdgeContribution(H,P,m) - DegreeTax(P,m,D)
            if q>q0:
                e0 = i
                q0 = q
        ## add best edge found if any
        if(q0 > q_opt):
            q_opt = q0
            A_opt = newPart(A_opt,e[e0])
            ## remove all 'active' edges
            r = []
            for i in range(len(e)):
                for j in range(len(A_opt)):
                        if(e[i].issubset(A_opt[j])):
                            r.append(e[i])
                            break
            for i in range(len(r)):
                e.remove(r[i])
        ## early stop if no immediate improvement
        else:
            break
    return q_opt, A_opt

##########################################################

## random algorithm - start from singletons, add edges w.r.t. permutation IF q improves
def randomAlgoTwoSec(H, steps=10, verbose=False):
    ## get degrees from H
    n, m = H_size(H)
    ed, w = TwoSecEdges(H, m)
    d = TwoSecDegrees(n, ed, w)
    ## get all edges in H
    e = []
    for i in range(len(H)):
        e.extend(H[i])
    ## initialize modularity, partition
    q_opt = -1
    A_opt= []
    ## algorithm - go through random permutations
    for ctr in range(steps):
        ## Loop here
        shuffle(e)
        ## list of singletons
        A = []
        for i in range(n):
            A.extend([{i}])
        ## starting (degree) modularity
        q0 = TwoSecEdgeContribution(A, ed, w) - TwoSecDegreeTax(A, d)
        for i in range(len(e)):
            P = newPart(A,e[i])
            q = TwoSecEdgeContribution(P,ed, w) - TwoSecDegreeTax(P, d)
            if q > q0:
                A = P
                q0 = q
        if q0 > q_opt:
            q_opt = q0
            A_opt = A
        if verbose:
            print('step',ctr,':',q_opt)
    return q_opt, A_opt


## random algorithm - start from singletons, add edges w.r.t. permutation IF q improves
def randomAlgo(H, steps=10, verbose=False, ddeg=False):
    ## get degrees from H
    n, m = H_size(H)
    d = d_Degrees(H,n,m)
    D = Degrees(H,n,m,d)
    ## get all edges in H
    e = []
    for i in range(len(H)):
        e.extend(H[i])
    ## initialize modularity, partition
    q_opt = -1
    A_opt= []
    ## algorithm - go through random permutations
    for ctr in range(steps):
        ## Loop here
        shuffle(e)
        ## list of singletons
        A = []
        for i in range(n):
            A.extend([{i}])
        ## starting (degree) modularity
        if ddeg:
            q0 = EdgeContribution(H,A,m) - d_DegreeTax(A,m,d)
        else:
            q0 = EdgeContribution(H,A,m) - DegreeTax(A,m,D)
        for i in range(len(e)):
            P = newPart(A,e[i])
            if ddeg:
                q = EdgeContribution(H,P,m) - d_DegreeTax(P,m,d)
            else:
                q = EdgeContribution(H,P,m) - DegreeTax(P,m,D)
            if q > q0:
                A = P
                q0 = q
        if q0 > q_opt:
            q_opt = q0
            A_opt = A
        if verbose:
            print('step',ctr,':',q_opt)
    return q_opt, A_opt

##########################################################

## Map vertices 0 .. n-1 to their respective 0-based part number
def PartitionLabels(P):
    n = 0
    for i in range(len(P)):
        n = n + len(P[i])
    label = [-1]*n
    for i in range(len(P)):
        l = list(P[i])
        for j in range(len(l)):
            label[l[j]] = i
    return label

##########################################################

## generate m edges between [idx1,idx2] inclusively
## of size between [size1,size2] inclusively
## Store in a list of lists of sets
def generateEdges(m,idx1,idx2,size1,size2):
    ## init
    L = [[]]*(size2+1)
    for i in range(size2+1):
        L[i]=[]
    v = list(range(idx1,idx2+1))
    if size2>len(v):
        size2 = len(v)
    ## generate - never mind repeats for now
    for i in range(m):
        size = randint(size1,size2)
        L[size].append(set(sample(v,size)))
    return L

## merge two lists of lists of sets
def mergeEdges(L1,L2):
    l = max(len(L1),len(L2))
    L = [[]]*l
    for i in range(len(L1)):
        L[i] = L1[i]
    for i in range(len(L2)):
        L[i] = L[i] + L2[i]
    ## uniquify
    for i in range(l):
        L[i] = [set(j) for j in set(frozenset(i) for i in L[i])]
    return L

##########################################################

## format Hypergraph given list of hyperedges (list of sets of 0-based integers)
def list2H(h):
    ml = max([len(x) for x in h])
    H = [[]]*(ml+1)
    for i in range(ml+1):
        H[i] = []
    for i in range(len(h)):
        l = len(h[i])
        H[l].append(h[i])
    return H

## two section modularity
def modularityG(H,A):
    n, m = H_size(H)
    ed, w = TwoSecEdges(H, m)
    d = TwoSecDegrees(n, ed, w)
    return(TwoSecEdgeContribution(A, ed, w) - TwoSecDegreeTax(A, d))

## strict H-modularity
def modularityH(H,A,ddeg=False):
    n, m = H_size(H)
    d = d_Degrees(H,n,m)
    D = Degrees(H,n,m,d)
    if ddeg:
        return(EdgeContribution(H,A,m) - d_DegreeTax(A,m,d))
    else:
        return(EdgeContribution(H,A,m) - DegreeTax(A,m,D))

##########################################################

## zipped-data-files-Clustering_via_hypergraph_modularity.zip

      
    Raw
  

              zipped-data-files-Clustering_via_hypergraph_modularity.zip
            
          
            View raw

	# coding: utf-8

	# # Strict and 2-section Hypergraph Clustering

	# In[1]:

	import pandas as pd
	import numpy as np
	import igraph as ig
	from sklearn.metrics import adjusted_rand_score as ARI
	from strictModularity import *


	# In[2]:


	def hcut(H,P):
	s = 0
	l = 0
	for i in range(len(H)):
	l = l + len(H[i])
	for j in range(len(P)):
	s = s + sum([x < set(P[j]) for x in H[i] ])
	return (l-s)/l


	# In[3]:


	import pickle
	H, truth, partitions = pickle.load(open('dblp.pkl','rb'))
	PL = partitions[0] ## 2-section Louvain
	PC = partitions[1] ## hypergraph-CNM


	# In[4]:


	## make igraph object
	n, m = H_size(H)
	e, w = TwoSecEdges(H,m)
	g = ig.Graph()
	g.add_vertices(n)
	g.add_edges([tuple(x) for x in e])
	g.es["weight"] = w
	g = g.simplify(combine_edges=sum)


	# In[5]:


	## uncomment to re-compute the Louvain partition
	#ml = g.community_multilevel(weights="weight")
	#PL = [x for x in ml]


	# In[6]:


	## Table1 with Louvain
	print('Table 1 with Louvain')
	print('H-modularity ',modularityH(H,PL))
	print('G-modularity ',modularityG(H,PL))
	print('hcut ',hcut(H,PL))
	print('number of parts ',len(PL))


	# In[7]:


	## uncomment to re-compute the hyper-CNM partition -- NB: this is slow!
	#
	#qC, PC = cnmAlgo(H, verbose=True)


	# In[8]:


	## Table1 with hypergraph-CNM
	print('Table 1 with hypergraph-CNM')
	print('H-modularity ',modularityH(H,PC))
	print('G-modularity ',modularityG(H,PC))
	print('hcut ',hcut(H,PC))
	print('number of parts ',len(PC))


	# In[9]:


	def member(P):
	s = sum([len(x) for x in P])
	M = [-1]*s
	for i in range(len(P)):
	for j in list(P[i]):
	M[j]=i
	return M

	def edgeLabels(g, gcomm):
	x = [(gcomm[x.tuple[0]]==gcomm[x.tuple[1]]) for x in g.es]
	return x
	def AGRI(g, u, v):
	bu = edgeLabels(g, u)
	bv = edgeLabels(g, v)
	su = sum(bu)
	sv = sum(bv)
	suv = sum(np.array(bu)*np.array(bv))
	m = len(bu)
	return((suv - susv/m) / (0.5(su+sv) - su*sv/m))

	## cluster similarity
	print('Partition similarity measures')
	print('ARI ',ARI(member(PL),member(PC)))
	print('AGRI ',AGRI(g,member(PL),member(PC)))


	# In[10]:


	def hcutPlus(H,P):
	hc = [-1]*len(H)
	S = 0
	L = 0
	for i in range(len(H)):
	l = len(H[i])
	L = L + l
	s = 0
	for j in range(len(P)):
	s = s + sum([x < set(P[j]) for x in H[i] ])
	S = S + s
	if l>0:
	hc[i] = (l-s)/l
	return hc


	# In[11]:


	## Louvain partition
	hc = hcutPlus(H,PL)
	print('With Louvain partition:')
	for i in [2,3,4]:
	print('prop. of edges of size',i,'cut:',hc[i])


	# In[12]:


	## hyper-CNM partition
	hc = hcutPlus(H,PC)
	print('With hyper-CNM partition:')
	for i in [2,3,4]:
	print('prop. of edges of size',i,'cut:',hc[i])
	##########################################################
	## Hypergraph algorithms
	## Quick Python implementation - small datasets only!
	##
	## Data representation
	##
	## h: list of sets (the hyperedges), with 0-based integer vertices
	## example: h = [{0,1},{1,2,3},{2,3,4},{4,6},{5,6}]
	## H = list2H(h) (hypergraph)
	## example: H = [[], [], [{0, 1}, {4, 6}, {5, 6}], [{1, 2, 3}, {2, 3, 4}]]
	##
	## qH, PH = randomAlgo(H, steps=10, verbose=True, ddeg=False)
	## qH, PH = cnmAlgo(H, verbose=True)
	## qG, PG = randomAlgoTwoSec(H, steps=10, verbose=True)
	##
	##########################################################

	from scipy.special import comb as choose
	import itertools
	from random import shuffle, randint, sample

	##########################################################

	## return: number of nodes (recall: 0-based), and
	## number of edges of each cardinality
	def H_size(H):
	M = len(H)
	m = []
	n = 0
	for i in range(M):
	m.append(len(H[i]))
	if(len(H[i])>0):
	j = max(set.union(*H[i]))
	if j>n:
	n = j
	return n+1, m

	## vertex d-degrees for each d
	def d_Degrees(H, n, m):
	M = len(H)
	d = [[]]*M
	for i in range(M):
	if(m[i]>0):
	x = [y for x in H[i] for y in list(x)]
	y = [x.count(i) for i in range(n)]
	d[i] = y
	return d

	## vertex total degrees
	def Degrees(H,n,m,d):
	M = len(H)
	D = [0]*n
	for i in range(M):
	if(m[i]>0):
	for j in range(n):
	D[j] = D[j] + d[i][j]
	return D

	##########################################################

	## edge contribution: given (H,A,m)
	def EdgeContribution(H,A,m):
	ec = 0
	for i in range(len(H)):
	for j in range(len(H[i])):
	for k in range(len(A)):
	if(H[i][j].issubset(A[k])):
	ec = ec + 1
	break
	ec = ec / sum(m)
	return ec

	##########################################################

	## degree tax - with d-degrees as null model
	def d_DegreeTax(A,m,d):
	dt = 0
	for i in range(len(m)):
	if (m[i]>0):
	S = 0
	for j in range(len(A)):
	s = 0
	for k in A[j]:
	s = s + d[i][k]
	s = s ** i
	S = S + s
	S = S / (i*i m[i]*(i-1) sum(m))
	dt = dt + S
	return dt

	## degree tax - with degrees as null model
	def DegreeTax(A,m,D):
	dt = 0
	vol = sum(D)
	M = sum(m)
	## vol(A_i)'s
	volA = [0]*len(A)
	for i in range(len(A)):
	for j in A[i]:
	volA[i] = volA[i] + D[j]
	volA[i] = volA[i] / vol
	## sum over d
	S = 0
	for i in range(len(m)):
	if (m[i]>0):
	x = sum([a*i for a in volA]) m[i] / M
	S = S + x
	return S

	##########################################################

	## 2-section: return extended list of edges and edge weights
	def TwoSecEdges(H,m):
	e = []
	w = []
	for i in range(len(m)):
	if(m[i]>0 and i>1):
	den = choose(i,2)
	for j in range(len(H[i])):
	s = [set(k) for k in itertools.combinations(H[i][j],2)]
	x = [1/den]*len(s)
	e.extend(s)
	w.extend(x)
	return e,w

	def TwoSecEdgeContribution(A,e,w):
	ec = 0
	for i in range(len(A)):
	for j in range(len(e)):
	if(e[j].issubset(A[i])):
	ec = ec + w[j]
	ec = ec / sum(w)
	return ec

	def TwoSecDegrees(n,e,w):
	d = [0]*n
	for i in range(len(e)):
	d[list(e[i])[0]] = d[list(e[i])[0]] + w[i]
	d[list(e[i])[1]] = d[list(e[i])[1]] + w[i]
	return d

	def TwoSecDegreeTax(A,d):
	dt = 0
	for i in range(len(A)):
	s = 0
	for j in list(A[i]):
	s = s + d[j]
	s = s**2
	dt = dt + s
	dt = dt / (4(sum(d)/2)*2)
	return dt

	##########################################################

	## take a partition and an edge (set)
	## return new partition with new edge "active"
	def newPart(A,s):
	P = []
	for i in range(len(A)):
	if(len(s.intersection(A[i])) == 0):
	P.append(A[i])
	else:
	s = s.union(A[i])
	P.append(s)
	return P

	##########################################################

	def cnmAlgo(H, verbose=False, ddeg=False):
	## get degrees from H
	n, m = H_size(H)
	d = d_Degrees(H,n,m)
	D = Degrees(H,n,m,d)
	## get all edges in a list
	e = []
	for i in range(len(H)):
	e.extend(H[i])
	## initialize modularity, partition
	A_opt = []
	for i in range(n):
	A_opt.extend([{i}])
	if ddeg:
	q_opt = EdgeContribution(H,A_opt,m) - d_DegreeTax(A_opt,m,d)
	else:
	q_opt = EdgeContribution(H,A_opt,m) - DegreeTax(A_opt,m,D)
	## e contains the edges NOT yet in a part
	while len(e)>0:
	q0 = -1
	e0 = -1
	if verbose:
	print('best overall:',q_opt, 'edges left: ',len(e))
	## pick best edge to add .. this is slow as is!
	for i in range(len(e)):
	P = newPart(A_opt,e[i])
	if ddeg:
	q = EdgeContribution(H,P,m) - d_DegreeTax(P,m,d)
	else:
	q = EdgeContribution(H,P,m) - DegreeTax(P,m,D)
	if q>q0:
	e0 = i
	q0 = q
	## add best edge found if any
	if(q0 > q_opt):
	q_opt = q0
	A_opt = newPart(A_opt,e[e0])
	## remove all 'active' edges
	r = []
	for i in range(len(e)):
	for j in range(len(A_opt)):
	if(e[i].issubset(A_opt[j])):
	r.append(e[i])
	break
	for i in range(len(r)):
	e.remove(r[i])
	## early stop if no immediate improvement
	else:
	break
	return q_opt, A_opt

	##########################################################

	## random algorithm - start from singletons, add edges w.r.t. permutation IF q improves
	def randomAlgoTwoSec(H, steps=10, verbose=False):
	## get degrees from H
	n, m = H_size(H)
	ed, w = TwoSecEdges(H, m)
	d = TwoSecDegrees(n, ed, w)
	## get all edges in H
	e = []
	for i in range(len(H)):
	e.extend(H[i])
	## initialize modularity, partition
	q_opt = -1
	A_opt= []
	## algorithm - go through random permutations
	for ctr in range(steps):
	## Loop here
	shuffle(e)
	## list of singletons
	A = []
	for i in range(n):
	A.extend([{i}])
	## starting (degree) modularity
	q0 = TwoSecEdgeContribution(A, ed, w) - TwoSecDegreeTax(A, d)
	for i in range(len(e)):
	P = newPart(A,e[i])
	q = TwoSecEdgeContribution(P,ed, w) - TwoSecDegreeTax(P, d)
	if q > q0:
	A = P
	q0 = q
	if q0 > q_opt:
	q_opt = q0
	A_opt = A
	if verbose:
	print('step',ctr,':',q_opt)
	return q_opt, A_opt


	## random algorithm - start from singletons, add edges w.r.t. permutation IF q improves
	def randomAlgo(H, steps=10, verbose=False, ddeg=False):
	## get degrees from H
	n, m = H_size(H)
	d = d_Degrees(H,n,m)
	D = Degrees(H,n,m,d)
	## get all edges in H
	e = []
	for i in range(len(H)):
	e.extend(H[i])
	## initialize modularity, partition
	q_opt = -1
	A_opt= []
	## algorithm - go through random permutations
	for ctr in range(steps):
	## Loop here
	shuffle(e)
	## list of singletons
	A = []
	for i in range(n):
	A.extend([{i}])
	## starting (degree) modularity
	if ddeg:
	q0 = EdgeContribution(H,A,m) - d_DegreeTax(A,m,d)
	else:
	q0 = EdgeContribution(H,A,m) - DegreeTax(A,m,D)
	for i in range(len(e)):
	P = newPart(A,e[i])
	if ddeg:
	q = EdgeContribution(H,P,m) - d_DegreeTax(P,m,d)
	else:
	q = EdgeContribution(H,P,m) - DegreeTax(P,m,D)
	if q > q0:
	A = P
	q0 = q
	if q0 > q_opt:
	q_opt = q0
	A_opt = A
	if verbose:
	print('step',ctr,':',q_opt)
	return q_opt, A_opt

	##########################################################

	## Map vertices 0 .. n-1 to their respective 0-based part number
	def PartitionLabels(P):
	n = 0
	for i in range(len(P)):
	n = n + len(P[i])
	label = [-1]*n
	for i in range(len(P)):
	l = list(P[i])
	for j in range(len(l)):
	label[l[j]] = i
	return label

	##########################################################

	## generate m edges between [idx1,idx2] inclusively
	## of size between [size1,size2] inclusively
	## Store in a list of lists of sets
	def generateEdges(m,idx1,idx2,size1,size2):
	## init
	L = [[]]*(size2+1)
	for i in range(size2+1):
	L[i]=[]
	v = list(range(idx1,idx2+1))
	if size2>len(v):
	size2 = len(v)
	## generate - never mind repeats for now
	for i in range(m):
	size = randint(size1,size2)
	L[size].append(set(sample(v,size)))
	return L

	## merge two lists of lists of sets
	def mergeEdges(L1,L2):
	l = max(len(L1),len(L2))
	L = [[]]*l
	for i in range(len(L1)):
	L[i] = L1[i]
	for i in range(len(L2)):
	L[i] = L[i] + L2[i]
	## uniquify
	for i in range(l):
	L[i] = [set(j) for j in set(frozenset(i) for i in L[i])]
	return L

	##########################################################

	## format Hypergraph given list of hyperedges (list of sets of 0-based integers)
	def list2H(h):
	ml = max([len(x) for x in h])
	H = [[]]*(ml+1)
	for i in range(ml+1):
	H[i] = []
	for i in range(len(h)):
	l = len(h[i])
	H[l].append(h[i])
	return H

	## two section modularity
	def modularityG(H,A):
	n, m = H_size(H)
	ed, w = TwoSecEdges(H, m)
	d = TwoSecDegrees(n, ed, w)
	return(TwoSecEdgeContribution(A, ed, w) - TwoSecDegreeTax(A, d))

	## strict H-modularity
	def modularityH(H,A,ddeg=False):
	n, m = H_size(H)
	d = d_Degrees(H,n,m)
	D = Degrees(H,n,m,d)
	if ddeg:
	return(EdgeContribution(H,A,m) - d_DegreeTax(A,m,d))
	else:
	return(EdgeContribution(H,A,m) - DegreeTax(A,m,D))

	##########################################################