- Subtitle: Build a Cyber Threat Intelligence Pipeline with Scraping, AI, and Real-Time Alerts
- Target Audience: Intermediate learners with basic coding (Python preferred) and curiosity about cybersecurity (and a lot of other things not listed)
- Prereqs: Python basics, willingness to get Exercise
- Duration: 80 hours
- Languages: English | French | Spanish | German | Portuguese
- Format: Text & images
- Cost: Free
This course will be on cyber threat intelligence by scraping data from a simulated hacking forum. Learners will be introduced to all types of forums: clearnet forums, TOR forums & the beloved I2P network.
This course is still under development and its structure may change, you have any suggestions, please contact me on Discord @0xHamy.
Goal: Get learners excited and equipped with the threat hunting mindset and core tools.
Duration: 1 week
-
Intro to Threat Intel
- What is threat intel? (Proactive vs. reactive security)
- Real-world example: Tracking initial access sales on dark web forums
- Why it matters for Canada (e.g., protecting local businesses)
- Exercise: Analyze a fake forum post (provided as text) and spot a threat keyword (e.g., “RDP sale”).
-
Intro to AI, Machine Learning & LLMs
- Quick rundown: AI vs. ML vs. LLMs (keep it simple)
- How LLMs can spot patterns humans miss (e.g., threat lingo)
- Exercise: Run a pre-trained LLM (e.g., via Hugging Face) on sample text and see it flag something.
-
Intro to Web Scraping
- Basics: What’s scraping, why use it for intel?
- Explain legal issues regarding scraping
- Tools: Requests, BeautifulSoup, Selenium, electronjs, playwright, headless browsers, puppeteer, browser-based extension (quick compare)
- Anti-scraping tech: IP bans, rate-limiting, captchas, account lockouts
- Exercise: Scrape a dummy webpage (e.g., a mock forum you provide) and extract a post title.
- Into to proxies; Proxychains & SmartProxy
- Exercise: Scrape the sim site using proxies & show logs coming from different IPs
-
**Intro to data analytics
- Basics: What’s big data?
- Strategy: strategies for scraping data continuously by creating batches based on date ranges (e.g. Jan-May 2024)
- Categorizing data
- Exercise: Scrape a site & categorize
-
Malware Forum Simulation
- Set up a safe, local forum (e.g., a Docker container)
- To make it harder to scrape data set protection to differentiate between a real browser and a bot
- Also make your site JS based so that data is retrieved with APIs and DOM is updated, the source code must not contain actual page's data
- Set custom captcha for login, something bypassable
- Add scripts so the forum's chat is active and people should talking there; new posts are added every 6 hours
- Exercise: Scrape one post from the sim using Python—feel the thrill.
-
Cracking Forum Simulation
- Setup another forum but make it JavaScript heavy to mimic modern sites, making scraping harder
- Exercise: Scrape posts using JS friendly scrapers
Milestone: Learners scrape a fake forum post and understand the threat hunting mission.
Goal: Create a simple UI to display threat data later.
Duration: 1 week
-
Picking Frameworks & Libraries for Threat Intel
- Flask vs. Django vs. FastAPI
- PostgreSQL vs. MongoDB (start with PostgreSQL)
- Exercise: Install FastAPI and create a simple website scraper
-
Databases
- Why store threat data? (e.g., tracking over time)
- Set up a basic PostgreSQL DB for forum posts
- Exercise: Save a website data from website scanner to db
- Exercise: Showcase database CRUD operations
-
External & Internal APIs
- What’s an API? Quick example (e.g., fetching IP geolocation)
- Turn website scanner to an API that can be called from JS
- Exercise: Call your API and show a post in JSON.
-
Task managers
- Intro to background tasks, threading, celery, in-memory tasks
- Exercise: Show running concurrent scans on thousands of sites simultanously in the background; use celery to start at a specific time
-
Bonus: Local Deployment
- Run it locally with FastAPI’s dev server
- Exercise: See your UI live at
localhost:8000
.
Milestone: Learners have a working web app that displays threat intel from the tor forum.
Goal: Craft a stealthy scraper to gather forum data.
Duration: 1-2 weeks
-
Traffic Anonymizing Through Proxies
- Why? (Avoid bans, stay hidden)
- Tools: Data center proxies, residential proxies (e.g., Luminati)
- Exercise: Route a simple request through a free proxy and scrape a test page.
-
Polling Between Accounts, IPs, and Captcha Bypass
- Rotate IPs with datacenter proxies, add sleep delays for rate-limits
- Exercise: Scrape the Tor sim, rotating between two IPs with a 5-second sleep & rotate between accounts when one gets locked out
-
Creating account profile
- Interface to add a forum account's credentials, set a datacenter IP for it that won't change, set its user-agent and browser info to make it a distinct looking user
- This will be used with playright to create long-term sessions for scraping until a batch is completed
- Exercise: Scrape the site to determine the total number of posts in two categories using two distinct accounts
-
Collecting Data
-
Target: Post titles, post content, post comments, usernames, timestamps from the forum sim (or real target later).
-
Tiered Scanning:
- Use UIDs to keep track of posts
- Surface Scan: Check titles for keywords (e.g., “RDP,” “leak”) to flag posts worth digging into.
- Content Scan: Scrape full post text for flagged posts.
- Deep Scan: Pull comments/replies only for high-priority hits (e.g., confirmed threats).
- Why: Saves time—don’t deep-scan irrelevant rants about crypto scams.
- Exercise: Write a function to filter titles (regex:
r"RDP|exploit"
), then scrape deeper if it matches.
-
Scalability Basics:
- Determine the date of first post in a category and get a count of all posts in that category
- Divide the counts into little batches and add them to the list of tasks that has to be sorted (scraped & useful data ex-filtrated)
- Every scraping is going to have a start timestamp because if a post was added AFTER or mid scanning then that's going to be skipped by the current post-counter and the next counter will start from where it left off
- Additionally, the usage of timestamps will help us avoid re-enumerating posts again
- There has to be two types of scraping bots: one for determining number of posts in a category & creating batches; set max number of posts for every batch
- And the second group are responsible for scanning posts in "Tiered scanning" style
- Add threading to run multiple scraping operations simultaneously
- Rate limit: Random sleep (2-5 sec) between requests
- Exercise: Spin up two threads to scrape data from the site and then add more posts during the scanning to ensure that the bots will grab the newly added data.
-
Resuming, tracking
- Keep track of gathered data properly so that you can start from where you left off especially in batches
- Exercise: Close the web app while performing a scan to ensure it resumes from where it left off
-
Error Resilience:
- Add retries (e.g., 3 attempts) if a page fails (use
try/except
withrequests
). - Log errors to a file (e.g.,
scrape_errors.log
). - Why: Forums break—teach learners to handle it.
- Exercise: Simulate a 404, retry, and log it.
- Add retries (e.g., 3 attempts) if a page fails (use
-
-
Data priority queing:
- Surface level scanning of posts might sometimes be useful but a proper data gathering can be done by looking into content of all posts; so when the title of a post isn't clear but it's inside seller's category, you must add it to a queue to be scanned later
- Prioritize target posts that have clear titles but those that don't add them to a queue, don't skip data
- Exercise: Create a generic post title called "Got 2 RDPs for sale"; this would be discussing initial access sales
-
Categorizing Data
- Basics: Regex for keywords (e.g., “malware,” “exploit”)
- Teaser: ML can do this better, categorize data by the content they discuss, "selling access", "data leak", and add little description to it so that we can search for it later
- Exercise: Sort CSV data into “malware” vs. “misc” using regex.
-
Translating posts
- Demonstrate how to identify Russian posts by using DeepL translation API
- Exercise: Identify all Russian posts on the forum
Milestone: Learners scrape the Tor sim, anonymize traffic, and categorize posts into a CSV.
Goal: Train an LLM to spot threat patterns like a pro.
Duration: 1-2 weeks
-
Translating posts, content or comments
- Using LLMs & DeepL to translate data into English
- Exercise: Run on Russian posts and translate them into English to extract actionable intel
-
Positive, Negative & Neutral Labels
- What’s labeling? Why do it? (e.g., “RDP sale” = positive threat)
- Exercise: Label 5 fake forum posts manually (e.g., in a spreadsheet).
-
Labeling Data Manually
- Tool: Label Studio (free, easy)
- Exercise: Import your CSV, label 10 more posts in Label Studio.
-
Labeling Data Automatically with Advanced LLMs
- Use a pre-trained model (e.g., BERT via Hugging Face) to auto-label
- Exercise: Run BERT on your CSV and compare to manual labels.
-
Training the Model to Detect Specific Patterns
- Focus: Initial access sales (e.g., “RDP,” “VPN creds”)
- Use a lightweight model (e.g., DistilBERT) for simplicity
- Exercise: Fine-tune it on your labeled data (provide a Colab notebook).
-
Testing the Model with the Scraper
- Feed live scraped data into the model
- Exercise: Scrape the forum sim, run the model, and see it flag a “threat.”
Milestone: Learners have an LLM that flags threat posts from scraped data.
Goal: Turn raw data into actionable Canadian-focused intel.
Duration: 1 week
-
Scraping the Forum with Data Scraper
- Scale up: Scrape 20+ posts from the forum sim
- Add a sanity check (e.g., skip duplicates)
- Exercise: Save to the PostgreSQL DB via the web app.
-
Identifying Initial Access Sales with Fine-Tuned Model
- Run the LLM on the new data
- Exercise: Display flagged posts in the web UI.
-
Cross-Referencing Hacked Businesses with Canada
- Use a mock Canadian business list (provide as CSV)
- Match emails/domains from forum leaks
- Exercise: Find one “breached” Canadian company in the sim data.
-
Creating Breach Reporting Templates
- Simple format: “Company X, your data was leaked on [date]”
- Exercise: Draft a template in Python.
-
One-Click Breach Reporting via Email
- Use Python’s
smtplib
for email - Exercise: Send a test breach email to yourself (e.g., Gmail).
- Use Python’s
Milestone: Learners scrape, categorize, and report a fake Canadian breach via email.
Goal: Create a watchlist to keep track of threat actors.
Duration: 1 week
Operators can use this functionality to flag a threat actor.
-
Create a new database for keeping track of threat actors
- Keep track of threat actor activity such as all posts, comments, and profile information
- Specify how often should we extract info about the target; introducing priorities
- mission critical, high, medium, low, custom
-
Cross-reference activities
- Cross-reference user's activity on this forum vs all other forums
- Create a scraper for a secondary site where user info is stored in a JSON
- Scrape the data from that dummy site and cross-check activity with the current data and perform linguistic analysis to see similarities in patterns and activity based on timestamp
Milestone: Learners scrape, categorize, and report a fake Canadian breach via email.
Goal: Build real-time alerts for critical threats.
Duration: 1 week
-
Configuring Email Notifications
- Set up
smtplib
with a Gmail account (or similar) - Exercise: Send an alert when a post discusses selling access to a Canadian business
- Set up
-
Bonus: SMS Alerts (Optional)
- Intro to Twilio (free tier)
- Exercise: Send a test SMS to yourself (if they opt in).
Milestone: Learners get an email alert when the scraper finds a prioritized threat.
Goal: Wrap up with a bang and point them forward.
Duration: Half a week
-
Course Recap
- “You scraped a forum, built a UI, fine-tuned an LLM, and alerted a Canadian company—all in 8 weeks.”
- Show off their final product: a running threat hunting pipeline.
-
Real-World Application
- “Try this on a safe, legal target (e.g., bug bounty forums).”
- Canada tie-in: “Protect local businesses with your skills.”
-
What’s Next?
- Join Mission Cyber Sentinel
- Take an advanced course (e.g., “Malware Analysis 101”)
- Exercise: Share their UI screenshot on your platform’s community (if you build one).
-
Reward:
- Digital badge: “Cyber Mounties Threat Hunter”
- Course specific badges: 5 Eyes badges; NATO badge; neutral badges
Milestone: Learners finish with a working tool and a sense of purpose.