Skip to content

Instantly share code, notes, and snippets.

@0xHamy
Last active April 28, 2025 14:33
Show Gist options
  • Save 0xHamy/596884b3fb6ba551ff532c3abf21c6dc to your computer and use it in GitHub Desktop.
Save 0xHamy/596884b3fb6ba551ff532c3abf21c6dc to your computer and use it in GitHub Desktop.
A structure for an upcoming threat hunting course by Cyber Mounties Canada and 0xHamy (Hamed Kohi).

Course Title: Cyber Threat Intelligence - From Forums to Frontlines

  • Subtitle: Build a Cyber Threat Intelligence Pipeline with Scraping, AI, and Real-Time Alerts
  • Target Audience: Intermediate learners with basic coding (Python preferred) and curiosity about cybersecurity (and a lot of other things not listed)
  • Prereqs: Python basics, willingness to get Exercise
  • Duration: 80 hours
  • Languages: English | French | Spanish | German | Portuguese
  • Format: Text & images
  • Cost: Free

This course will be on cyber threat intelligence by scraping data from a simulated hacking forum. Learners will be introduced to all types of forums: clearnet forums, TOR forums & the beloved I2P network.

This course is still under development and its structure may change, you have any suggestions, please contact me on Discord @0xHamy.

Module 1: Laying the Groundwork

Goal: Get learners excited and equipped with the threat hunting mindset and core tools.
Duration: 1 week

  • Intro to Threat Intel

    • What is threat intel? (Proactive vs. reactive security)
    • Real-world example: Tracking initial access sales on dark web forums
    • Why it matters for Canada (e.g., protecting local businesses)
    • Exercise: Analyze a fake forum post (provided as text) and spot a threat keyword (e.g., “RDP sale”).
  • Intro to AI, Machine Learning & LLMs

    • Quick rundown: AI vs. ML vs. LLMs (keep it simple)
    • How LLMs can spot patterns humans miss (e.g., threat lingo)
    • Exercise: Run a pre-trained LLM (e.g., via Hugging Face) on sample text and see it flag something.
  • Intro to Web Scraping

    • Basics: What’s scraping, why use it for intel?
    • Explain legal issues regarding scraping
    • Tools: Requests, BeautifulSoup, Selenium, electronjs, playwright, headless browsers, puppeteer, browser-based extension (quick compare)
    • Anti-scraping tech: IP bans, rate-limiting, captchas, account lockouts
    • Exercise: Scrape a dummy webpage (e.g., a mock forum you provide) and extract a post title.
    • Into to proxies; Proxychains & SmartProxy
    • Exercise: Scrape the sim site using proxies & show logs coming from different IPs
  • **Intro to data analytics

    • Basics: What’s big data?
    • Strategy: strategies for scraping data continuously by creating batches based on date ranges (e.g. Jan-May 2024)
    • Categorizing data
    • Exercise: Scrape a site & categorize
  • Malware Forum Simulation

    • Set up a safe, local forum (e.g., a Docker container)
    • To make it harder to scrape data set protection to differentiate between a real browser and a bot
    • Also make your site JS based so that data is retrieved with APIs and DOM is updated, the source code must not contain actual page's data
    • Set custom captcha for login, something bypassable
    • Add scripts so the forum's chat is active and people should talking there; new posts are added every 6 hours
    • Exercise: Scrape one post from the sim using Python—feel the thrill.
  • Cracking Forum Simulation

    • Setup another forum but make it JavaScript heavy to mimic modern sites, making scraping harder
    • Exercise: Scrape posts using JS friendly scrapers

Milestone: Learners scrape a fake forum post and understand the threat hunting mission.


Module 2: Building the Web Interface

Goal: Create a simple UI to display threat data later.
Duration: 1 week

  • Picking Frameworks & Libraries for Threat Intel

    • Flask vs. Django vs. FastAPI
    • PostgreSQL vs. MongoDB (start with PostgreSQL)
    • Exercise: Install FastAPI and create a simple website scraper
  • Databases

    • Why store threat data? (e.g., tracking over time)
    • Set up a basic PostgreSQL DB for forum posts
    • Exercise: Save a website data from website scanner to db
    • Exercise: Showcase database CRUD operations
  • External & Internal APIs

    • What’s an API? Quick example (e.g., fetching IP geolocation)
    • Turn website scanner to an API that can be called from JS
    • Exercise: Call your API and show a post in JSON.
  • Task managers

    • Intro to background tasks, threading, celery, in-memory tasks
    • Exercise: Show running concurrent scans on thousands of sites simultanously in the background; use celery to start at a specific time
  • Bonus: Local Deployment

    • Run it locally with FastAPI’s dev server
    • Exercise: See your UI live at localhost:8000.

Milestone: Learners have a working web app that displays threat intel from the tor forum.


Module 3: Building the Data Scraper

Goal: Craft a stealthy scraper to gather forum data.
Duration: 1-2 weeks

  • Traffic Anonymizing Through Proxies

    • Why? (Avoid bans, stay hidden)
    • Tools: Data center proxies, residential proxies (e.g., Luminati)
    • Exercise: Route a simple request through a free proxy and scrape a test page.
  • Polling Between Accounts, IPs, and Captcha Bypass

    • Rotate IPs with datacenter proxies, add sleep delays for rate-limits
    • Exercise: Scrape the Tor sim, rotating between two IPs with a 5-second sleep & rotate between accounts when one gets locked out
  • Creating account profile

    • Interface to add a forum account's credentials, set a datacenter IP for it that won't change, set its user-agent and browser info to make it a distinct looking user
    • This will be used with playright to create long-term sessions for scraping until a batch is completed
    • Exercise: Scrape the site to determine the total number of posts in two categories using two distinct accounts
  • Collecting Data

    • Target: Post titles, post content, post comments, usernames, timestamps from the forum sim (or real target later).

    • Tiered Scanning:

      • Use UIDs to keep track of posts
      • Surface Scan: Check titles for keywords (e.g., “RDP,” “leak”) to flag posts worth digging into.
      • Content Scan: Scrape full post text for flagged posts.
      • Deep Scan: Pull comments/replies only for high-priority hits (e.g., confirmed threats).
      • Why: Saves time—don’t deep-scan irrelevant rants about crypto scams.
      • Exercise: Write a function to filter titles (regex: r"RDP|exploit"), then scrape deeper if it matches.
    • Scalability Basics:

      • Determine the date of first post in a category and get a count of all posts in that category
      • Divide the counts into little batches and add them to the list of tasks that has to be sorted (scraped & useful data ex-filtrated)
      • Every scraping is going to have a start timestamp because if a post was added AFTER or mid scanning then that's going to be skipped by the current post-counter and the next counter will start from where it left off
      • Additionally, the usage of timestamps will help us avoid re-enumerating posts again
      • There has to be two types of scraping bots: one for determining number of posts in a category & creating batches; set max number of posts for every batch
      • And the second group are responsible for scanning posts in "Tiered scanning" style
      • Add threading to run multiple scraping operations simultaneously
      • Rate limit: Random sleep (2-5 sec) between requests
      • Exercise: Spin up two threads to scrape data from the site and then add more posts during the scanning to ensure that the bots will grab the newly added data.
    • Resuming, tracking

      • Keep track of gathered data properly so that you can start from where you left off especially in batches
      • Exercise: Close the web app while performing a scan to ensure it resumes from where it left off
    • Error Resilience:

      • Add retries (e.g., 3 attempts) if a page fails (use try/except with requests).
      • Log errors to a file (e.g., scrape_errors.log).
      • Why: Forums break—teach learners to handle it.
      • Exercise: Simulate a 404, retry, and log it.
  • Data priority queing:

    • Surface level scanning of posts might sometimes be useful but a proper data gathering can be done by looking into content of all posts; so when the title of a post isn't clear but it's inside seller's category, you must add it to a queue to be scanned later
    • Prioritize target posts that have clear titles but those that don't add them to a queue, don't skip data
    • Exercise: Create a generic post title called "Got 2 RDPs for sale"; this would be discussing initial access sales
  • Categorizing Data

    • Basics: Regex for keywords (e.g., “malware,” “exploit”)
    • Teaser: ML can do this better, categorize data by the content they discuss, "selling access", "data leak", and add little description to it so that we can search for it later
    • Exercise: Sort CSV data into “malware” vs. “misc” using regex.
  • Translating posts

    • Demonstrate how to identify Russian posts by using DeepL translation API
    • Exercise: Identify all Russian posts on the forum

Milestone: Learners scrape the Tor sim, anonymize traffic, and categorize posts into a CSV.


Module 4: Fine-Tuning an LLM

Goal: Train an LLM to spot threat patterns like a pro.
Duration: 1-2 weeks

  • Translating posts, content or comments

    • Using LLMs & DeepL to translate data into English
    • Exercise: Run on Russian posts and translate them into English to extract actionable intel
  • Positive, Negative & Neutral Labels

    • What’s labeling? Why do it? (e.g., “RDP sale” = positive threat)
    • Exercise: Label 5 fake forum posts manually (e.g., in a spreadsheet).
  • Labeling Data Manually

    • Tool: Label Studio (free, easy)
    • Exercise: Import your CSV, label 10 more posts in Label Studio.
  • Labeling Data Automatically with Advanced LLMs

    • Use a pre-trained model (e.g., BERT via Hugging Face) to auto-label
    • Exercise: Run BERT on your CSV and compare to manual labels.
  • Training the Model to Detect Specific Patterns

    • Focus: Initial access sales (e.g., “RDP,” “VPN creds”)
    • Use a lightweight model (e.g., DistilBERT) for simplicity
    • Exercise: Fine-tune it on your labeled data (provide a Colab notebook).
  • Testing the Model with the Scraper

    • Feed live scraped data into the model
    • Exercise: Scrape the forum sim, run the model, and see it flag a “threat.”

Milestone: Learners have an LLM that flags threat posts from scraped data.


Module 5: Cyber Threat Intelligence Gathering

Goal: Turn raw data into actionable Canadian-focused intel.
Duration: 1 week

  • Scraping the Forum with Data Scraper

    • Scale up: Scrape 20+ posts from the forum sim
    • Add a sanity check (e.g., skip duplicates)
    • Exercise: Save to the PostgreSQL DB via the web app.
  • Identifying Initial Access Sales with Fine-Tuned Model

    • Run the LLM on the new data
    • Exercise: Display flagged posts in the web UI.
  • Cross-Referencing Hacked Businesses with Canada

    • Use a mock Canadian business list (provide as CSV)
    • Match emails/domains from forum leaks
    • Exercise: Find one “breached” Canadian company in the sim data.
  • Creating Breach Reporting Templates

    • Simple format: “Company X, your data was leaked on [date]”
    • Exercise: Draft a template in Python.
  • One-Click Breach Reporting via Email

    • Use Python’s smtplib for email
    • Exercise: Send a test breach email to yourself (e.g., Gmail).

Milestone: Learners scrape, categorize, and report a fake Canadian breach via email.


Module 6: Intelligence watchlists

Goal: Create a watchlist to keep track of threat actors.
Duration: 1 week

Operators can use this functionality to flag a threat actor.

  • Create a new database for keeping track of threat actors

    • Keep track of threat actor activity such as all posts, comments, and profile information
    • Specify how often should we extract info about the target; introducing priorities
      • mission critical, high, medium, low, custom
  • Cross-reference activities

    • Cross-reference user's activity on this forum vs all other forums
    • Create a scraper for a secondary site where user info is stored in a JSON
    • Scrape the data from that dummy site and cross-check activity with the current data and perform linguistic analysis to see similarities in patterns and activity based on timestamp

Milestone: Learners scrape, categorize, and report a fake Canadian breach via email.


Module 7: Custom Alert System

Goal: Build real-time alerts for critical threats.
Duration: 1 week

  • Configuring Email Notifications

    • Set up smtplib with a Gmail account (or similar)
    • Exercise: Send an alert when a post discusses selling access to a Canadian business
  • Bonus: SMS Alerts (Optional)

    • Intro to Twilio (free tier)
    • Exercise: Send a test SMS to yourself (if they opt in).

Milestone: Learners get an email alert when the scraper finds a prioritized threat.


Module 8: Conclusion & Next Steps

Goal: Wrap up with a bang and point them forward.
Duration: Half a week

  • Course Recap

    • “You scraped a forum, built a UI, fine-tuned an LLM, and alerted a Canadian company—all in 8 weeks.”
    • Show off their final product: a running threat hunting pipeline.
  • Real-World Application

    • “Try this on a safe, legal target (e.g., bug bounty forums).”
    • Canada tie-in: “Protect local businesses with your skills.”
  • What’s Next?

    • Join Mission Cyber Sentinel
    • Take an advanced course (e.g., “Malware Analysis 101”)
    • Exercise: Share their UI screenshot on your platform’s community (if you build one).
  • Reward:

    • Digital badge: “Cyber Mounties Threat Hunter”
    • Course specific badges: 5 Eyes badges; NATO badge; neutral badges

Milestone: Learners finish with a working tool and a sense of purpose.

Comments are disabled for this gist.