Skip to content

Instantly share code, notes, and snippets.

@RangerMauve
Created January 22, 2018 15:14
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save RangerMauve/81432928f7aa85086884f083bde1e093 to your computer and use it in GitHub Desktop.
Save RangerMauve/81432928f7aa85086884f083bde1e093 to your computer and use it in GitHub Desktop.
Distributed search indexing over IPFS

Distributed search on IPFS

Centralized search is great, but it is an avenue for censoship

Problems to solve:

  • Have indexing of data be performed by everyone to avoid centralization
  • Map key words to hashes of data that contains those keywords
  • Ability to sort results based on some criteria
    • Who viewed this page before
    • How does it use the keywords
      • Number of times
      • If in a webpage, which section is it in
    • What links to this data
    • Size of data
  • Ability to search for multiple keywords
  • Automatically fix for common spelling errors and word synonyms
  • Prevent spam from crippling the network
  • Prevent malicious nodes from serving invalid content
    • Every result should have data for proving that they actually have something

Plugin should passively listen to all the webistes a person visits, and queue up jobs for indexing the pages. When seeing a new web page: - Query the DHT to see if somebody has indexed it before - If it has been seen before, have a percent chance of ignoring it - Process it to get the main textual content out using semantic HTML rules - Filter out common words that don't add to the meaning - Use NLP libraries for English as a start, community should contribute other languages - Maybe ignore numbers? - Build up a map of word counts - unique word : occurance count and a map of unique word pair : occurance count - Word pairds should be alphaneumerically sorted - Iterate through words and word pairs and save it to the DB - word hash:count as 32 bit hex:content hash - content hash:count as 32 but hex:word hash - This will allow for searching for pages by word, or searching for words by page (and then sorting by count) - Iterate through all links in the document - Save a key of target:source - Publish to the DHT that this source links to that page - Publish to the DHT that this page has been indexed (don't include source of who did it for privacy)

Periodically rank the confidence in which pages - Search through the locally seen words and order them by number of pages that contain them - Iterate through words with the least numbers of pages - Sort the pages by which one is more relevant - Number of times the word shows up relative to other words - Huge counts of keywords should be discarded - Number of pages that link to this page - A huge number of outgoing links should be discarded - Publish to the DHT pointing the word to the hash - Include the number of pages that reference this - Send the actual hashes if possible - Otherwise, publish a block with the list and link to that

Performing a search - Process search terms using NLP from before - Build up pairs - Query for results locally and display confident matches right away - Query the DHT for matches for keyword pairs - If no results for pairs exist, search for individual words and do a union - Build up results in a list, take a second or two to search for a bunch of results - Rank initial search by confidence (links to pages, number of times it appeared in the DHT) - Start fetching results using streams - Parse enough to read the <title> of the page before showing the result - Otherwise take the first n characters of text and display that - Add verification step to results in order to verify that the links are valid, and that the keywords actually exist (Not for local results, of course) - Cache verified results locally - Invalid items should be filtered out - Might make sense to have a DB of blacklists which will ignore results for keywords (prevent having to verify page again)

@garyee
Copy link

garyee commented Jan 22, 2018

One big Problem that I see is:
A file is just on one node or the client-node gets disconnected from the part of the net where the file is.
The file could not be found error.

This Problem will digress when the network has grown to a certain size. But for the beginning, this will be the most critical error that will frustrate non-it-people and drive them from using the system. IPFS is promising: no more 404 Errors no broken Links. But there are going to be broken links all the time, broken because the file can't be reached.

Another Problem is ... if you decentralize the centralized everybody has to take on the weight of the "Center-nodes". So these Center-nodes are servers of Google, Facebook etc.
That means, everybody will have to store all the data, or at least a part of it, on their pc's.

Possible solution ideas:
A Meta-distributed structure: have search nodes running. They do the indexing and other stuff.
Have the Search-Database as a part of a IPFS node and a protocol so that they can communicate and share the weight.
As I wrote ... just ideas

You talk a lot about HTML what about other types of files?

@dokterbob
Copy link

Have quickly skimmed through it. At this moment, it is quite hard already to keep ipfs-search.com functioning properly without dealing with the very complicated issue of censorship - or even implementing the actual store or indexing.

At this point in time, our strategy for preventing censorship would be to provide daily uploads of index snapshots on IPFS, for everyone to download and/or fork. This should not make any censorship easy to detect and would allow anyone to continue our efforts should, for any reasons, ipfs-search.com go down. Plus it being a manifold easier than building a fully decentralised infrastructure.

However, the middle to long term future most definetely is: Decentralise All The Things

I feel that the current ipfs-search.com would be redundant the moment a similarly functional decentralised version of the service would be available and I would hereby officially endorse such an effort. However, at this stage, I think it would be a parallel effort to ours, in that the technical differences between centralised (ES-based) search and decentralised search so immense that I do not possibly see a smooth transition between both.

Mathijs de Bruin
Initiator of ipfs-search.com
https://github.com/ipfs-search/ipfs-search

@dokterbob
Copy link

By the way, to have some indication of how difficult the distributed search problem really is - and possibly also as a point of collaboration - I recommend you have a look at YaCy, a really serious effort at building a distributed search engine. https://yacy.net/

Perhaps it would make sense to extend this project with IPFS integration, both as a store for the index as well as with an IPFS crawler.

@RangerMauve
Copy link
Author

Hi @garyee and @dokterbob, I didn't get notifications for your posts so I totally missed them until now.

A file is just on one node or the client-node gets disconnected from the part of the net where the file is

This is definitely a problem with IPFS at the moment, but I think that popular search results are more likely to have been viewed by other people and are more likely to be in the cache of other nodes.

if you decentralize the centralized everybody has to take on the weight of the "Center-nodes"

This isn't entirely true, since the load is distributed across the system individual nodes won't have too much load on them. Plus, whenever somebody loads a portion of the search index, the nodes hosting it will see less and less load for that information.

what about other types of files?

I'm focusing on HTML first because it's easy to crawl an index, and that's what current search engines focus on, but I don't think it would be hard to support txt files and pdfs in the future.

@dokterbob

Would you be interested in chatting some time to discuss ideas? I've been looking into how the elasticsearch portion of your system works, and I think it wouldn't been to hard to replace it with IPFS and IPNS.

@Wiretrip
Copy link

Hi Guys,
Are you still looking into this? If so I would be very interested in collaborating! I have been banging on about distributed search for years now and have been following the progress of YaCY and Faroo (now defunct it seems). I got as far as a Java PoC strapping together Kademlia and Lucene but didn't get much further...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment