markpapadakis/Trinity intro.md

## Trinity intro.md

      
    Raw
  

              Trinity intro.md
            
          
    Trinity is a modern C++ information-retrieval library for building queries, indexing documents and other content, running queries and scoring documents matching them. It facilitates the development of search engines and other systems and applications that depend on that functionality, and has been designed with simplicity, performance, modularity, extensibility, and elegance in mind.
This is the initial release and as such, new features and API extensions may come in later releases. It has been under development for 2 weeks, and will be updated frequently with improvements, enhancements and fixes to any potential issues reported.
It's named after Trinity in the Matrix movies, and it's also about the Trinity of (query, index, search) that are the core functions supported by the library.

Trinity makes it easy to index documents and access all documents that match a query. It solves the recall problem and everything else about it, and it gives you all the information you would need in your score/ranking function to compute a score for each matched document (precision).
You can ask very elaborate, intricate questions(queries) and Trinity will make sure to return all documents matching it as fast as possible.
The effectiveness of a search engine is a function of precision/recall. With Trinity, a query can be very complex(or maybe as simple as one word/token long). It is up to you to ask the right question, that is, construct the right queries. It is also up to you to compute a rank/score for each document.
So, having solved all the low-level problems with Trinity, your job is to build the right queries(e.g construct queries, or rewrite input queries), and come up with a great score function -- although depending on your application, you may not even want to score any document, only access some or all matched documents.
The final score of a matched document should depend on the Trinity match score, and potentially multiple other scores that may be specific to the document, the user (personalisation) and other contexts. You should build a model where all dynamic and static information about the query and the documents and other state is taken into account.
Consider for example a 1billion documents index, and a query such as [lord OR lords of the ring OR rings ("the return of the king" OR "the fellowship of the ring") NOT "bilbo baggins" gandalf AND (legolas OR aragorn)]. A Trinity query can be much, much longer and more intricate than this example, but suppose a user either directly issues such a query, or you end up rewriting an original query to this.
Trinity will return the, say, all 5200 documents that match this query by means of invoking your consider() function and providing you information about the matched terms and the document ID. You may decide to, for example, compute the match score based on word proximity and wether the hits are for title or body content (hint: you may want to check the various queries.h functions for some advanced functionality and ideas). You may want to consider the document and, for example, its popularity and a personalisation score, and then fuse all those different scores together to compute the final score for that document. You may then want to keep the top-K documents, or maybe all of them.