GSoC 2021 Summary Report - Implement Thread support for JMAP for Apache James
Date of report: 19/08/2021
My name’s Tran Hong Quan. I’m final year student at Hanoi University of Science and Technology. My major is Computer Engineering.
If you are interested in this project or Google Summer of Code, you can contact me by Email: firstname.lastname@example.org
JMAP is an email application protocol to modernise IMAP, on top of HTTP using a JSON format. JMAP is designed to make efficient use of limited network resources and to be horizontally scalable to a very large number of users. Apache James is one of the first implementations of this new standard.
Mail user agents generally allow displaying emails grouped by conversations (replies, forward, etc...). As a part of JMAP RFC-8621 implementation, there is a dedicated concept: thread. JMAP Threads is already implemented In Apache James in a rather naive way: each email is a thread of its own. This naive implementation is specification compliant but defeats the overall purposes of threads: emails which related to a topic should belong to a thread.
James’s data models, storage APIs, and some JMAP methods at HTTP level need to be changed to make sure the purpose of the thread is reached.
How I did it?
Firstly we need to know that some message belong to a thread if they have the identical thread identifier (threadId). My work is around this threadId.
We need a dedicated module to guess new messages's threadId. Here I call it ThreadIdGuessingAlgorithm.
My idea is firstly adding a threadId property to James's Message model so I can query the threadId of a message.
When there is a new coming message, I will query all old messages of that user to see if there is any related message to this new message. If a new coming message relates to a old message, I decide that the new message should have same threadId with the old message. Otherwise the new message should have a new generated threadId. I did implement two ways to query old messages:
First way: use search engine (ElasticSearch, Lucene).
This mostly is for experiment purpose. Every time there is a new message, we need to query to search engine and that is expensive. That is why I need the Second way to do this for production environment.
Second way: implement dedicated Cassandra table to save old message's thread data and base on that data to see if new message is related.
Cassandra known queries have really fast query time. That's why it is good enough for production.
So now I need to qualify if two message related to each other. Firstly I have to read and follow JMAP specification carefully, in that the specification defines the conditions to qualify this case:
The exact algorithm for determining whether two Emails belong to the same Thread is not mandated in this spec to allow for compatibility with different existing systems. For new implementations, it is suggested that two messages belong in the same Thread if both of the following conditions apply: 1. An identical message id [@!RFC5322] appears in both messages in any of the Message-Id, In-Reply-To, and References header fields. 2. After stripping automatically added prefixes such as “Fwd:”, “Re:”, “[List-Tag]”, etc., and ignoring white space, the subjects are the same. This avoids the situation where a person replies to an old message as a convenient way of finding the right recipient to send to but changes the subject and starts a new conversation.
My idea is get these above header fields of old message and new message, stripping subject and then see if them sastified the conditions. In James there already have a piece of code to do this stripping subject job so I can leverage it.
After handling this guessing threadId stuff successfully, we can base on that work to implement and develop further thread tasks.
What work was done?
At the ending of Google Summer of Code, I finished "Implement Thread/get method" task - the last task in my proposal's schedule. All of my code has been reviewed carefully by my mentors and got merged into Apache James master branch.
Here is summary of my work with related Pull Requests:
Change James data models
- JAMES-3516 Add threadId property to MessageResult POJO, MailboxMessage POJO
- JAMES-3516 Using the MessageResult::getThreadId property at the JMAP level
- JAMES-3516 Exposing AppendResult::getThreadId allow advertising the threadId as part of Email/set and Email/Import results
Implement ThreadId guessing logic
- JAMES-3516 Implement naive ThreadIdGuessingAlgorithm
- JAMES-3516 Add threadId column to cassandra tables
- JAMES-3516 Allow setting threadId through MessageFactory::createMessage params
- JAMES-3516 Plug threadIdGuessingAlgorithm to MessageManager
- JAMES-3516 Implement SearchThreadIdGuessingAlgorithm
- JAMES-3516 Implement MailboxManager::getThread API and plug it into JMAP Thread/get
- JAMES-3516 Enable threadId search query for MessageSearchIndex
- JAMES-3516 Update upgrade instructions follow adding threadId to ElasticSearch breaking change
- JAMES-3516 Implement Thread Cassandra table
- JAMES-3516 Thread related data should be cleaned when MailboxDeletion or Expunged event happens
Implement non-naive Thread/get method
What is left to do?
The original objective of the GSOC project is fully achieved (use of Thread/get).
While GsoC enabled a basic User experience with threads on top of JMAP, several enhancements would likely not be covered due to out of GSoC's schedule. There are some work related to Threads topic that I will continue to contribute to Apache James after Google Summer of Code:
- Implement Thread/changes based on previous table
- Push state changes for threads
- Investigate ElasticSearch aggregation for collapse thread
- Implement collapse thread option on top of the message search index
- Implement collapse thread for memory
- Email/query should expose thread options
Based on above work progress, it can be consider that I finished my Google Summer of Code 2021 project successfully.
Firstly I want to say thank to my mentors (Rene Cordier and Benoit Tellier), Google and The Apache Software Foundation for offering this oppoturnity to me. Then I want to say thank you again to my mentors. They are very enthusiasm to guide me throughout the project. I learn a lot from them.
Of course throughout the project, I met a lot challenges. Here is some main challenges:
- Solving unknown and complicated problems
- Work with a large codebase, complicated system (sometime I have to do bottom-up approach: reading code and tests to understand how the thing works first)
- Write clean and working code
- Learning new system design, new technology stacks in a short time
But to me, facing these above challenges is a great oppoturnity to learn from that and grow up quickly. Here is some main things I did learn from doing this project:
- Problem solving
- Improve my system design view
- Develop a good mindset about writting clean code and performance-oriented
- Learn new technology stacks: Cassandra, ElasticSearch, functional reactive programming
- Learn how to interact with open source community
In conclusion, I did learn a lot doing my GSoC project. I am looking forward to learning and working on more challenging stuff in the future.