Skip to content

Instantly share code, notes, and snippets.

@RangerMauve
Created January 22, 2018 15:14
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save RangerMauve/81432928f7aa85086884f083bde1e093 to your computer and use it in GitHub Desktop.
Save RangerMauve/81432928f7aa85086884f083bde1e093 to your computer and use it in GitHub Desktop.
Distributed search indexing over IPFS

Distributed search on IPFS

Centralized search is great, but it is an avenue for censoship

Problems to solve:

  • Have indexing of data be performed by everyone to avoid centralization
  • Map key words to hashes of data that contains those keywords
  • Ability to sort results based on some criteria
    • Who viewed this page before
    • How does it use the keywords
      • Number of times
      • If in a webpage, which section is it in
    • What links to this data
    • Size of data
  • Ability to search for multiple keywords
  • Automatically fix for common spelling errors and word synonyms
  • Prevent spam from crippling the network
  • Prevent malicious nodes from serving invalid content
    • Every result should have data for proving that they actually have something

Plugin should passively listen to all the webistes a person visits, and queue up jobs for indexing the pages. When seeing a new web page: - Query the DHT to see if somebody has indexed it before - If it has been seen before, have a percent chance of ignoring it - Process it to get the main textual content out using semantic HTML rules - Filter out common words that don't add to the meaning - Use NLP libraries for English as a start, community should contribute other languages - Maybe ignore numbers? - Build up a map of word counts - unique word : occurance count and a map of unique word pair : occurance count - Word pairds should be alphaneumerically sorted - Iterate through words and word pairs and save it to the DB - word hash:count as 32 bit hex:content hash - content hash:count as 32 but hex:word hash - This will allow for searching for pages by word, or searching for words by page (and then sorting by count) - Iterate through all links in the document - Save a key of target:source - Publish to the DHT that this source links to that page - Publish to the DHT that this page has been indexed (don't include source of who did it for privacy)

Periodically rank the confidence in which pages - Search through the locally seen words and order them by number of pages that contain them - Iterate through words with the least numbers of pages - Sort the pages by which one is more relevant - Number of times the word shows up relative to other words - Huge counts of keywords should be discarded - Number of pages that link to this page - A huge number of outgoing links should be discarded - Publish to the DHT pointing the word to the hash - Include the number of pages that reference this - Send the actual hashes if possible - Otherwise, publish a block with the list and link to that

Performing a search - Process search terms using NLP from before - Build up pairs - Query for results locally and display confident matches right away - Query the DHT for matches for keyword pairs - If no results for pairs exist, search for individual words and do a union - Build up results in a list, take a second or two to search for a bunch of results - Rank initial search by confidence (links to pages, number of times it appeared in the DHT) - Start fetching results using streams - Parse enough to read the <title> of the page before showing the result - Otherwise take the first n characters of text and display that - Add verification step to results in order to verify that the links are valid, and that the keywords actually exist (Not for local results, of course) - Cache verified results locally - Invalid items should be filtered out - Might make sense to have a DB of blacklists which will ignore results for keywords (prevent having to verify page again)

@RangerMauve
Copy link
Author

Hi @garyee and @dokterbob, I didn't get notifications for your posts so I totally missed them until now.

A file is just on one node or the client-node gets disconnected from the part of the net where the file is

This is definitely a problem with IPFS at the moment, but I think that popular search results are more likely to have been viewed by other people and are more likely to be in the cache of other nodes.

if you decentralize the centralized everybody has to take on the weight of the "Center-nodes"

This isn't entirely true, since the load is distributed across the system individual nodes won't have too much load on them. Plus, whenever somebody loads a portion of the search index, the nodes hosting it will see less and less load for that information.

what about other types of files?

I'm focusing on HTML first because it's easy to crawl an index, and that's what current search engines focus on, but I don't think it would be hard to support txt files and pdfs in the future.

@dokterbob

Would you be interested in chatting some time to discuss ideas? I've been looking into how the elasticsearch portion of your system works, and I think it wouldn't been to hard to replace it with IPFS and IPNS.

@Wiretrip
Copy link

Hi Guys,
Are you still looking into this? If so I would be very interested in collaborating! I have been banging on about distributed search for years now and have been following the progress of YaCY and Faroo (now defunct it seems). I got as far as a Java PoC strapping together Kademlia and Lucene but didn't get much further...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment