RangerMauve/ipfs-search.md Secret

## ipfs-search.md

      
    Raw
  

              ipfs-search.md
            
          
    Distributed search on IPFS

Centralized search is great, but it is an avenue for censoship
Problems to solve:

Have indexing of data be performed by everyone to avoid centralization
Map key words to hashes of data that contains those keywords
Ability to sort results based on some criteria

Who viewed this page before
How does it use the keywords

Number of times
If in a webpage, which section is it in


What links to this data
Size of data


Ability to search for multiple keywords
Automatically fix for common spelling errors and word synonyms
Prevent spam from crippling the network
Prevent malicious nodes from serving invalid content

Every result should have data for proving that they actually have something


Plugin should passively listen to all the webistes a person visits, and queue up jobs for indexing the pages.
When seeing a new web page:
- Query the DHT to see if somebody has indexed it before
- If it has been seen before, have a percent chance of ignoring it
- Process it to get the main textual content out using semantic HTML rules
- Filter out common words that don't add to the meaning
- Use NLP libraries for English as a start, community should contribute other languages
- Maybe ignore numbers?
- Build up a map of word counts
- unique word : occurance count and a map of unique word pair : occurance count
- Word pairds should be alphaneumerically sorted
- Iterate through words and word pairs and save it to the DB
- word hash:count as 32 bit hex:content hash
- content hash:count as 32 but hex:word hash
- This will allow for searching for pages by word, or searching for words by page (and then sorting by count)
- Iterate through all links in the document
- Save a key of target:source
- Publish to the DHT that this source links to that page
- Publish to the DHT that this page has been indexed (don't include source of who did it for privacy)
Periodically rank the confidence in which pages
- Search through the locally seen words and order them by number of pages that contain them
- Iterate through words with the least numbers of pages
- Sort the pages by which one is more relevant
- Number of times the word shows up relative to other words
- Huge counts of keywords should be discarded
- Number of pages that link to this page
- A huge number of outgoing links should be discarded
- Publish to the DHT pointing the word to the hash
- Include the number of pages that reference this
- Send the actual hashes if possible
- Otherwise, publish a block with the list and link to that
Performing a search
- Process search terms using NLP from before
- Build up pairs
- Query for results locally and display confident matches right away
- Query the DHT for matches for keyword pairs
- If no results for pairs exist, search for individual words and do a union
- Build up results in a list, take a second or two to search for a bunch of results
- Rank initial search by confidence (links to pages, number of times it appeared in the DHT)
- Start fetching results using streams
- Parse enough to read the <title> of the page before showing the result
- Otherwise take the first n characters of text and display that
- Add verification step to results in order to verify that the links are valid, and that the keywords actually exist (Not for local results, of course)
- Cache verified results locally
- Invalid items should be filtered out
- Might make sense to have a DB of blacklists which will ignore results for keywords (prevent having to verify page again)