Following is the report of my work done on Ahmia(https://ahmia.fi) during Google Summer of Code 2017.
Ahmia is a search engine that indexes, searches, and catalogs content published on Tor Hidden Services. First of all I would like to thank my mentor, Juha Nurmi, and the Tor community for their constant guidance. It has been a productive summer for me and a great experience working with the Tor community.
The codebase is divided into three parts :
- Index : It refers to the data that Ahmia has collected.
- Crawler : It is the part that crawls onions on the Tor network and feed it to the index.
- Site : It is the backbone of Ahmia that includes the design of the website and makes the search engine work.
My work was distributed among these three parts. Below is the list of tasks I accomplished during GSoC followed by ideas for future works.
Ahmia blacklists onions containing child abuse media. I wrote a script[1] that fetches these websites from HiddeWiki and blacklists them from our database. I also created a cronjob that automatically runs once a week and blacklists these onions.
Add page now allows users and hidden service operators to add onions to Ahmia index. It now supports addition of subdomains too. My work included adding a model for add service and saving the added domains to the local database of Django. Domains are periodically added to Ahmia index once it passes our blacklist filter. These domains are removed from local databse periodically using cronjob.
- Add page model
- Add page data
- Add page updates and cronjob
- Fixed redirections
- Sub-domain support in 'add'
Ahmia now supports latest version of Elasticsearch i.e 5.4 for indexing.
The statistics was broken and there were several redirections required in the codebase to fix it. I also added a linking structure of domians on Ahmia database on the statistics page(https://ahmia.fi/stats/link_graph/). I used libraries like Gephi and sigma.js for popularity graph.
- Linking structure
- Fixed broken 'Stats' page and linking graph structure
- Working link graph at '/stats/link_graph'
- Stats page upgrades
Ahmia now has a detailed documentation at the website as well as Github page. Also the 'about' page has been updated and several outdated and expired pages have been removed.
- Advanced search options
I mentioned in my initial proposal to implement advanced search features like "",|| and && as a secondary task if time permits. Although I started working on adding support for these features but could not complete it. I plan to complete these features in my free time after GSoC.
- Replacing polipo with torsocks
Initially I planned to replace polipo with torsocks for crawling purposes since polipo has been discontinued. But when we tested torsocks it did not provide any significant improvement over polipo for now. After discussion with my mentor we decided to drop this idea for now.
- Scoring algorithm for better search results
Tweaking the scoring algorithm has been one of the frequently discussed tasks and a high priority task for better search results. As of now Ahmia ranks results based on high authority value. Authority is a value computed by the PageRank algorithm. High authority means a page is linked a lot by other high authority pages. However this approach prioritizes well known domains like Hidden Wiki in favour of individual forums or guides even though the latter better matches the search queries.
For an instance, for request like “how to operate a tor hidden service”, relevant results with low authority (like an unknown forum post called “how to operate a hidden service”) will be in the bottom of the results list when a high authority less relevant page (like “Tor hidden wiki”) will be in the top of the results. This need to be handled.
One possible way is to use Elastic boosting which uses '_score'. Refer https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html https://www.elastic.co/guide/en/elasticsearch/guide/current/boosting-by-popularity.html
- More statistics and better API
Apart from the linking structure, we could add more visualizations and statictics. I made a popularity graph earlier but since the API I used is no longer available we had to remove it. We can create similar API's to visualize stats such as popularity and uptime of onions.
- Preview of the onion when pointer is hovered upon it
Often users are hesitant about opening an unknown link on tor network. Preview of onions on search page can be very handy for users as they can have a look at the webpage and decide if they want to open it. This can easily be implemented using jquery. Have a look at https://github.com/lonekorean/mini-preview for instance.
I plan to contribute to Ahmia in my free time even after GSoC. Ahmia is already considered one of the best search engines in its domain. If more time and work is put in its development, it can achieve much greater heights.