Skip to content

Instantly share code, notes, and snippets.

@k4u5h1k
Last active February 17, 2023 14:47
Show Gist options
  • Save k4u5h1k/f01d8d277e223d39dfca9c1fee3b829c to your computer and use it in GitHub Desktop.
Save k4u5h1k/f01d8d277e223d39dfca9c1fee3b829c to your computer and use it in GitHub Desktop.
GSoC '21 report

Google Summer of Code 2021 Final Report

Name: Kaushik Sivashankar • Project: OWASP Maryam

Proposal Topic

Dark Web Exploration (for Cyber Threat Analysis) And Expansion of Data Sources

Milestones Achieved

  • Designed and implemented a text document clustering module using KMeans and FP-growth. This shall be utilized within Iris; Maryam's metasearch engine. (Commit) (Personal repo link)

    • Generate text data using,

      ./maryam.py -e google -q 'Marvel' -l 10 --api --format > test.json

    • Pass it to cluster module with,

      ./maryam.py -e cluster --json test.json

      Screenshot 2021-08-17 at 12.49.58 PM

      Screenshot 2021-08-17 at 12.50.48 PM

      Screenshot 2021-08-17 at 12.51.08 PM

      Screenshot 2021-08-17 at 12.51.23 PM

      Screenshot 2021-08-17 at 12.51.43 PM

  • Designed and implemented a smart dark web crawler module, with a custom TFIDF text retriever class using cosine similarity to rank best pages to crawl per Snowball Sampling iteration. (Commit) (Note: explicit results not shown unless searched for)

    In progress:

    Screenshot 2021-08-17 at 12.55.42 PM

    Results after reaching target depth:

    Screenshot 2021-08-17 at 12.58.04 PM

  • Implemented various search modules over diverse sources, namely,

    • Phone Number Search using NumVerify (PR).

      Screenshot 2021-08-17 at 12.16.23 PM

    • Dictionary module using Google Dictionary (PR).

      Screenshot 2021-08-17 at 12.14.48 PM

    • SanctionSearch (PR)

      Screenshot 2021-08-17 at 12.16.02 PM

    • Gigablast (PR)

      Screenshot 2021-08-17 at 12.17.08 PM

    • Reddit Search (without official API or scraping) (Commit).

      Screenshot 2021-08-17 at 12.18.26 PM

    • Twitter Tweet Search (without official API or scraping) w/ Sentiment Analysis (Commit).

      Screenshot 2021-08-17 at 12.20.22 PM

      Screenshot 2021-08-17 at 12.20.52 PM

    • ActiveSearchResults (PR)

      Screenshot 2021-08-17 at 12.21.13 PM

    • PirateBay (PR) (Later updated to use undocumented backend API)

      Screenshot 2021-08-17 at 12.21.55 PM

    • Google Scholar (PR)

      Screenshot 2021-08-17 at 12.22.45 PM

    • ArXiv (PR)

      Screenshot 2021-08-17 at 12.23.27 PM

    • PubMed (PR)

      Screenshot 2021-08-17 at 12.25.07 PM

    • Core.ac.uk Search (PR)

      Screenshot 2021-08-17 at 12.26.27 PM

    • Famous Person Search (Commit)

      Screenshot 2021-08-17 at 12.28.36 PM

    • Article Search (Commit)

      Screenshot 2021-08-17 at 12.29.30 PM

  • And standalone utility classes, namely,

    • Web Page Term Frequency Histogram class. (Brought to Maryam from an extension repo (PR))

      Jen

      Image taken from famous person module output for Jennifer Aniston.

    • Safe Search Class (manages captcha and evades engine specific errors using rotation). (Commit (previously named CaptchaManager))

  • Discovered startup lag due to heavy imports such as matplotlib and implemented optimization with cleanup resulting in significant reduction in startup time. (Commit 1, Commit 2)

  • Restructured and cleaned up Maryam's file tree in order to make it suitable for packaging and distribution. (PR (closed but later rechecked and commited manually by mentor saeeddhqan))

  • Packaged and deployed Maryam to PyPi. (link)

  • Fixed critical bug affecting OSX on Python3.8 and 3.9. (Issue)

  • Made numerous bug fixes, all of which can be accessed from the list of my commits.

To Continue My Work

  • Implement frontend for Web API.
  • A way to test module utils (at least engines) without module_api or module_run.
  • Iris is key. The ultimate goal of Maryam is to improve Iris to the extent at which it can smartly leverage collaboratively, the capabilities of all modules and present its output intuitively.
    • This requires us to classify an input query into the module that (we think) can handle it best.
    • Output could be formatted as accordion of most suitable module outputs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment