Centralized search is great, but it is an avenue for censoship
Problems to solve:
- Have indexing of data be performed by everyone to avoid centralization
- Map key words to hashes of data that contains those keywords
- Ability to sort results based on some criteria
- Who viewed this page before
- How does it use the keywords
- Number of times
- If in a webpage, which section is it in
- What links to this data
- Size of data
- Ability to search for multiple keywords
- Automatically fix for common spelling errors and word synonyms
- Prevent spam from crippling the network
- Prevent malicious nodes from serving invalid content
- Every result should have data for proving that they actually have something
Plugin should passively listen to all the webistes a person visits, and queue up jobs for indexing the pages.
When seeing a new web page:
- Query the DHT to see if somebody has indexed it before
- If it has been seen before, have a percent chance of ignoring it
- Process it to get the main textual content out using semantic HTML rules
- Filter out common words that don't add to the meaning
- Use NLP libraries for English as a start, community should contribute other languages
- Maybe ignore numbers?
- Build up a map of word counts
- unique word : occurance count
and a map of unique word pair : occurance count
- Word pairds should be alphaneumerically sorted
- Iterate through words and word pairs and save it to the DB
- word hash:count as 32 bit hex:content hash
- content hash:count as 32 but hex:word hash
- This will allow for searching for pages by word, or searching for words by page (and then sorting by count)
- Iterate through all links in the document
- Save a key of target:source
- Publish to the DHT that this source links to that page
- Publish to the DHT that this page has been indexed (don't include source of who did it for privacy)
Periodically rank the confidence in which pages - Search through the locally seen words and order them by number of pages that contain them - Iterate through words with the least numbers of pages - Sort the pages by which one is more relevant - Number of times the word shows up relative to other words - Huge counts of keywords should be discarded - Number of pages that link to this page - A huge number of outgoing links should be discarded - Publish to the DHT pointing the word to the hash - Include the number of pages that reference this - Send the actual hashes if possible - Otherwise, publish a block with the list and link to that
Performing a search - Process search terms using NLP from before - Build up pairs - Query for results locally and display confident matches right away - Query the DHT for matches for keyword pairs - If no results for pairs exist, search for individual words and do a union - Build up results in a list, take a second or two to search for a bunch of results - Rank initial search by confidence (links to pages, number of times it appeared in the DHT) - Start fetching results using streams - Parse enough to read the <title> of the page before showing the result - Otherwise take the first n characters of text and display that - Add verification step to results in order to verify that the links are valid, and that the keywords actually exist (Not for local results, of course) - Cache verified results locally - Invalid items should be filtered out - Might make sense to have a DB of blacklists which will ignore results for keywords (prevent having to verify page again)
By the way, to have some indication of how difficult the distributed search problem really is - and possibly also as a point of collaboration - I recommend you have a look at YaCy, a really serious effort at building a distributed search engine. https://yacy.net/
Perhaps it would make sense to extend this project with IPFS integration, both as a store for the index as well as with an IPFS crawler.