Skip to content

Instantly share code, notes, and snippets.

@mdhash
Last active August 28, 2017 09:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mdhash/2bc637ff15545fa58366ee0bb1d93680 to your computer and use it in GitHub Desktop.
Save mdhash/2bc637ff15545fa58366ee0bb1d93680 to your computer and use it in GitHub Desktop.
The Tor Project - Ahmia GSoC 2017

Following is the report of my work done on Ahmia(https://ahmia.fi) during Google Summer of Code 2017.

Ahmia is a search engine that indexes, searches, and catalogs content published on Tor Hidden Services. First of all I would like to thank my mentor, Juha Nurmi, and the Tor community for their constant guidance. It has been a productive summer for me and a great experience working with the Tor community.

Overview

The codebase is divided into three parts :

  • Index : It refers to the data that Ahmia has collected.
  • Crawler : It is the part that crawls onions on the Tor network and feed it to the index.
  • Site : It is the backbone of Ahmia that includes the design of the website and makes the search engine work.

My work was distributed among these three parts. Below is the list of tasks I accomplished during GSoC followed by ideas for future works.

Tasks completed

Automating blacklisting of child abuse onions

Ahmia blacklists onions containing child abuse media. I wrote a script[1] that fetches these websites from HiddeWiki and blacklists them from our database. I also created a cronjob that automatically runs once a week and blacklists these onions.

Major Commits

Remodeled 'Add' page

Add page now allows users and hidden service operators to add onions to Ahmia index. It now supports addition of subdomains too. My work included adding a model for add service and saving the added domains to the local database of Django. Domains are periodically added to Ahmia index once it passes our blacklist filter. These domains are removed from local databse periodically using cronjob.

Major commits

Upgrade support from Elastic 2.4.0 to 5.4

Ahmia now supports latest version of Elasticsearch i.e 5.4 for indexing.

Major commits

Data visualization

The statistics was broken and there were several redirections required in the codebase to fix it. I also added a linking structure of domians on Ahmia database on the statistics page(https://ahmia.fi/stats/link_graph/). I used libraries like Gephi and sigma.js for popularity graph.

Major commits

Detailed Documentation and updated software dependencies

Ahmia now has a detailed documentation at the website as well as Github page. Also the 'about' page has been updated and several outdated and expired pages have been removed.

Major commits

What work is missing?

  • Advanced search options

I mentioned in my initial proposal to implement advanced search features like "",|| and && as a secondary task if time permits. Although I started working on adding support for these features but could not complete it. I plan to complete these features in my free time after GSoC.

  • Replacing polipo with torsocks

Initially I planned to replace polipo with torsocks for crawling purposes since polipo has been discontinued. But when we tested torsocks it did not provide any significant improvement over polipo for now. After discussion with my mentor we decided to drop this idea for now.

Future ideas

  • Scoring algorithm for better search results

Tweaking the scoring algorithm has been one of the frequently discussed tasks and a high priority task for better search results. As of now Ahmia ranks results based on high authority value. Authority is a value computed by the PageRank algorithm. High authority means a page is linked a lot by other high authority pages. However this approach prioritizes well known domains like Hidden Wiki in favour of individual forums or guides even though the latter better matches the search queries.

For an instance, for request like “how to operate a tor hidden service”, relevant results with low authority (like an unknown forum post called “how to operate a hidden service”) will be in the bottom of the results list when a high authority less relevant page (like “Tor hidden wiki”) will be in the top of the results. This need to be handled.

One possible way is to use Elastic boosting which uses '_score'. Refer https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html https://www.elastic.co/guide/en/elasticsearch/guide/current/boosting-by-popularity.html

  • More statistics and better API

Apart from the linking structure, we could add more visualizations and statictics. I made a popularity graph earlier but since the API I used is no longer available we had to remove it. We can create similar API's to visualize stats such as popularity and uptime of onions.

  • Preview of the onion when pointer is hovered upon it

Often users are hesitant about opening an unknown link on tor network. Preview of onions on search page can be very handy for users as they can have a look at the webpage and decide if they want to open it. This can easily be implemented using jquery. Have a look at https://github.com/lonekorean/mini-preview for instance.

What now?

I plan to contribute to Ahmia in my free time even after GSoC. Ahmia is already considered one of the best search engines in its domain. If more time and work is put in its development, it can achieve much greater heights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment