BURG3R5/GSoC 2022 Project Report.md

## GSoC 2022 Project Report.md

      
    Raw
  

              GSoC 2022 Project Report.md
            
          
    GSoC'22 Project Report

Aditya Rajput | Matrix.org

The following report summarizes the work done by me during Google Summer of Code 2021 along with the results, scope for improvements, and future work.
This also serves as the final project report with all the contributions.
Outline


Personal Details


The Project


Links


Idea


Goals


Library


The Journey


The Future


Acknowledgments


Parting Words


Personal Details


Name
Gender
GitHub
Email
Matrix
LinkedIn


Aditya Rajput
Male (he/him)
BURG3R5
adiraj20072002
@burgers:matrix.org
aditya-rajput-2072


The Project


Repository
Project Blog
GSoC Page


matrix-encrypted-search
burgers-gsoc
projects/xjCmlvMW


Matrix Room
Mentor


#encrypted-search:matrix.org
Charles V. Wright (@cvwright:matrix.org)


Idea

Currently, keeping privacy in mind, Matrix clients use seshat to store a searchable database on the user’s device itself. So when a new device is added, it has to recreate the searchable index. Additionally, as the number of messages increases, creating, updating and searching through an index on the client side becomes tougher.
The solution to these problems is a Searchable Symmetric Encryption (SSE) scheme that lets the client safely store an encrypted database on an untrusted server and search through it without leaking any data. This project implements, in particular, the keyword search scheme outlined in a
Demertzis et al. paper titled “Fast Searchable Encryption with Tunable Locality”. This paper is chosen because, as evident in the name, it allows full control over the locality of the index, which means users of the database can determine the speed/privacy tradeoff themselves.
Goals


Implement the keyword search SSE scheme outlined in the aforementioned Demertzis et al. paper.


Create encrypted indices of a few public Matrix rooms.


Implement an updation procedure for the above SSE i.e. provide a way to add new messages/documents to the encrypted index.


Document the library and write exhaustive unit and integration tests.


Integrate the project into a suitable client.


Perform performance evaluation of the created library.


Of these goals, number 5 was changed to be "Create a demonstration that showcases how to use the library's various features". Further, number 6 could not be completed due to the previous goals being more complex than was predicted.
Library

The library has four main modules:
Index

index.py contains the EncryptedIndex class, which is used to convert a list of room events into a structurally encrypted index according to the aforementioned SSE scheme. The index created thusly consists of two data structures: a lookup table and a datastore.
The lookup table is a mapping from keywords to locations in the datastore where their results can be found. Meanwhile, the datastore is a multi-dimensional array which contains random batches of event ids for events in which the keywords occur.
Storage

storage.py contains the IndexStorage class, which is used to transform the single massive datastore into smaller blobs that can be stored as files on a Matrix homeserver. It exposes an iterator for the client to store the blobs on the server and then updates the lookup table of the index to point to remote locations using the MXC URIs instead of local array positions.
Merge

merge.py contains the IndexMerge class, which is used to merge two or more remote encrypted indices. It exposes an iterator for the client to fetch data blobs from one or more homeservers and merges them into a new encrypted index.
Search

search.py contains the EncryptedSearch class, which is used to search for a query across one or more encrypted indices at once. It performs this operation in two steps: First identifying the files to be fetched, then locating the relevant chunks within those files.
An in-depth technical explanation of the project will be uploaded as a Wiki on the repository soon.
The Journey

I had already implemented an unpolished, mathematically identical version of this project while writing my GSoC proposal, so I had a vague idea of the structure and the time-consuming parts of the project (or so I thought).
I was pretty confident that the data-wrangling portion of the SSE scheme (i.e. the Index module) would be the most difficult task, so I dedicated much of my pre-coding period and two whole coding weeks to it. It went swimmingly and was my rock-solid foundation for this journey.
Next up, I took up the search module. This was when I fully realized the problems that dealing with remote files would bring. I started to expand models and plan how I wanted the client to interact with these methods. One of this library's cornerstones was that it wouldn't directly communicate with the homeservers. That's why I had to be clever with how the Search module was designed.
Also with Search came the question of tokenization: how and when to break down words and character-sequences. After a little discussion with my mentor in the project room (link above), I finalized that too, and proceeded to the next step.
The natural next step was to deal with the issue of remote files. Now, the paper deals with an ideal, mathematical world. Its only goal is to make searching fast and secure.
The real world, I came to realize, has file limits.
(Also, it isn't efficient to fetch the entire datastore just to read one bucket.)
So I had to break up the huge datastore object into file-blobs that could fit in a homeserver with 64MB limits. I discuss in this blog post how I had to be careful with this operation. Basically, we don't want to break the scheme entirely by revealing storage location and search patterns. Plus, if we try to break down the datastore structure to its atoms, there's still a huge variance in the size of those atoms. They can be tiny like Hydrogen or huge like Plutonium.
But I had a discussion with my mentor and successfully implemented a storage scheme in the fifth week. And I found a bug. It took two weeks to fix that one.
Now we were deep in uncharted territory: the guidance of the SSE paper had run out. But thankfully, my earlier work from this project saved me. This class was an amalgamation of the Search and Storage modules, with some tweaked internal details. Which felt great!
And immediately I found the King of Bugs. Details are here because I don't want to relive that.
By the way, debugging any portion of this project meant stepping through 9000-line-long JSON files that would change their order on every re-run because of the inherent randomness of the SSE scheme.
In the final week of the coding period, I worked on integration tests that double as a reference implementation and a tour of the library's capabilities.
For more details, weekly progress blogs can be found here. Plus, project PRs made by me can be found here.
This project was a huge period of learning and growth for me. I polished old skills and learned some new ones. I found bugs and fixed them and found insecurities in the scheme and got help from my mentor and overall just... achieved more than
The Future


Integrating the library into matrix-nio


Increasing test coverage and testing edge cases.


Making a JavaScript implementation using this library as a reference


Post-GSoC, I intend to keep working on this project and doing my part in improving the vast ecosystem that is Matrix.
Acknowledgments

I would like to thank my mentor Charles V. Wright for helping and guiding me in the project throughout the GSoC journey.
I am thankful to Google Summer Of Code for providing me with an opportunity to work with Matrix.org.
Parting Words

Encrypted server-stored search is an important tool that the Matrix ecosystem needs in order to support the ever-growing community that relies on it. With this project, I hope that I have helped provide not only a reference for how that can be achieved, but also a very real tool that can be used by existing clients directly.