Centralized search is great, but it is an avenue for censoship
Problems to solve:
- Have indexing of data be performed by everyone to avoid centralization
- Map key words to hashes of data that contains those keywords
- Ability to sort results based on some criteria
- Who viewed this page before
- How does it use the keywords
- Number of times
- If in a webpage, which section is it in
- What links to this data
- Size of data
- Ability to search for multiple keywords
- Automatically fix for common spelling errors and word synonyms
- Prevent spam from crippling the network
- Prevent malicious nodes from serving invalid content
- Every result should have data for proving that they actually have something
Plugin should passively listen to all the webistes a person visits, and queue up jobs for indexing the pages.
When seeing a new web page:
- Query the DHT to see if somebody has indexed it before
- If it has been seen before, have a percent chance of ignoring it
- Process it to get the main textual content out using semantic HTML rules
- Filter out common words that don't add to the meaning
- Use NLP libraries for English as a start, community should contribute other languages
- Maybe ignore numbers?
- Build up a map of word counts
- unique word : occurance count
and a map of unique word pair : occurance count
- Word pairds should be alphaneumerically sorted
- Iterate through words and word pairs and save it to the DB
- word hash:count as 32 bit hex:content hash
- content hash:count as 32 but hex:word hash
- This will allow for searching for pages by word, or searching for words by page (and then sorting by count)
- Iterate through all links in the document
- Save a key of target:source
- Publish to the DHT that this source links to that page
- Publish to the DHT that this page has been indexed (don't include source of who did it for privacy)
Periodically rank the confidence in which pages - Search through the locally seen words and order them by number of pages that contain them - Iterate through words with the least numbers of pages - Sort the pages by which one is more relevant - Number of times the word shows up relative to other words - Huge counts of keywords should be discarded - Number of pages that link to this page - A huge number of outgoing links should be discarded - Publish to the DHT pointing the word to the hash - Include the number of pages that reference this - Send the actual hashes if possible - Otherwise, publish a block with the list and link to that
Performing a search - Process search terms using NLP from before - Build up pairs - Query for results locally and display confident matches right away - Query the DHT for matches for keyword pairs - If no results for pairs exist, search for individual words and do a union - Build up results in a list, take a second or two to search for a bunch of results - Rank initial search by confidence (links to pages, number of times it appeared in the DHT) - Start fetching results using streams - Parse enough to read the <title> of the page before showing the result - Otherwise take the first n characters of text and display that - Add verification step to results in order to verify that the links are valid, and that the keywords actually exist (Not for local results, of course) - Cache verified results locally - Invalid items should be filtered out - Might make sense to have a DB of blacklists which will ignore results for keywords (prevent having to verify page again)
One big Problem that I see is:
A file is just on one node or the client-node gets disconnected from the part of the net where the file is.
The file could not be found error.
This Problem will digress when the network has grown to a certain size. But for the beginning, this will be the most critical error that will frustrate non-it-people and drive them from using the system. IPFS is promising: no more 404 Errors no broken Links. But there are going to be broken links all the time, broken because the file can't be reached.
Another Problem is ... if you decentralize the centralized everybody has to take on the weight of the "Center-nodes". So these Center-nodes are servers of Google, Facebook etc.
That means, everybody will have to store all the data, or at least a part of it, on their pc's.
Possible solution ideas:
A Meta-distributed structure: have search nodes running. They do the indexing and other stuff.
Have the Search-Database as a part of a IPFS node and a protocol so that they can communicate and share the weight.
As I wrote ... just ideas
You talk a lot about HTML what about other types of files?