It seems DRIFT search uses vector similarity search with top-k communities in the top-most hierarchy, and then drills down to the lower hierarchy levels. They seem to use follow-up questions for this purpose. The answers are ranked based on relavance to the query. Need to check the paper for more details.
Unlike Baseline RAG, which uses embedding search from a vector database to find matching query points in the source text, GraphRAG builds a knowledge graph from the text, which is summarized hierarchically based on community clusters.
From what I understand, the knowledge graph is built by extracting entities and relations from the text, and the community clusters are formed based on the similarity of the entities and relations. The hierarchical summarization is done by summarizing the clusters at different levels of abstraction.
GraphRAG seems to use GPT-4-turbo to build the knowledge graph. However, how are the edge weights calculated? Do the summaries generated affect how the weights are calculated in the next hierarchical level?
I was trying out the Misra-Gries algorithm for finding heavy hitters in a list of numbers, using vector instructions. But let me fill in some context first. I am trying to minimize the memory needed by Louvain/Leiden algorithms for community detection, and hopefully a bit of performance too. Currently the algorithms use a full-size per-thread hashtables, with each thread using |V|
space for storing the associated weights for the hashtable. However, we could instead store a small hashtable, using the Misra-Gries algorithm, in the cache - obviously this might affect the performance (in terms of quality of the returned communities).
It was cool to go through step-by-step and be able to minimize the number of lines of generated machine code (with minimal conditional jumps). When you understand the vector instructions, you can write them in a higher-level logic without resorting to writing the instructions yourself, as these can be quite complicated. We let the compiler do its thing, but guide it with a short f
Notes on the paper Joint Partitioning and Sampling Algorithm for Scaling Graph Neural Network.
See below.
DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication; Nunes et al. (2023)
Metrics for set similarity are a core aspect of several data mining tasks. To remove duplicate results in a Web search, for example, a common approach looks at the Jaccard index between all pairs of pages. In social network analysis, a much-celebrated metric is the Adamic-Adar index, widely used to compare node neighborhood sets in the important problem of predicting links. However, with the increasing amount of data to be processed, calculating the exact similarity between all pairs can be intractable. The challenge of working at this scale has motivated research into efficient estimators for set similarity metrics. The two most popular estimators, MinHash and SimHash, are indeed used in applications such as document deduplication and recommender systems where large volumes of data need to be processed. Given the importance of these tasks, the demand for advancing estimators is evident. We pro