Skip to content

Instantly share code, notes, and snippets.

@eliquious
Last active March 5, 2024 21:08
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save eliquious/cc4fba1c84d31b31225bc5e765462b78 to your computer and use it in GitHub Desktop.
Save eliquious/cc4fba1c84d31b31225bc5e765462b78 to your computer and use it in GitHub Desktop.
Virus Total Clone

Virus Total Design Project

VirusTotal is a online tool for cybersecurity professionals that allows for files and URLs to be scanned for malware. It also manages and displays the metadata after the analysis has been completed.

This document outlines a possible design for implementation.

Architecture

TODO: Add a diagram

Technologies

The proposed architecture consists of several managed systems using Google's Cloud Platform.

  • Loan Balancer: Nginx or API Gateway
  • API: Google Cloud Functions
  • File Storage: Google Cloud Storage
  • Event Stream: Google PubSub
  • Metadata Storage: Google BigTable
  • Session Storage: Managed Redis for auth sessions
  • Cluster Management: Kubernetes
    • Run web UI containers
    • Run scripts in Docker containers if possible and/or in Windows
  • Container: Docker
  • Monitor/Metrics
  • Real-time logs: StackDriver

High-level step-by-step

  1. A load balancer handles initial handshake and routing for cloud functions.
  2. The serverless functions handle file uploads to cloud storage.
  3. Send message to pub/sub.
  4. Kubernetes cluster is running containers to read from the pub/sub and run the scripts for the user.
  5. Containers/Virtual Machines read the file and run the scripts, then store the metadata in BigTable.

Scripts

Scripts are written in Python and bundled as a Python package. This allows for self-contained scripts to be versioned and bundled with any/all resources. This allows for scripts to be self-hosted using a Python respository and for all the VT scripts to comply with standard Python ditribution practices.

When a file is uploaded, a message is sent over pub/sub. The message contains file info (or hash) and a user id at a minimum. Machines/containers subscribed to the topic then download the file and fetch the user data.

Based on the user's preferences the scripts are ran against the file. If the scripts do not exist locally, they are downloaded into the local repository.

All scripts adhere to an API and implement an interface for processing the file and saving results. All script logs are uploaded for the end user to view.

Note: Some scripts may require running on a particular environment. If this is the case then perhaps more than one pub/sub topic is required. Perhaps one for each OS.

Metrics / System Status

Many metrics are included in the chosen cloud tools, but additional metrics and monitoring will need to be added.

For instance, each container or virtual machine that runs the scripts will need to be monitored for uptime.

Individual scripts could also be reported on for crashes and errors. Script writters could be notified on script failures.

Web UI

The web interface allows for users to easily manage and scan files as well as view the metadata for completed scans.

The interface could function much like a build system in that real-time progress of the scan could be monitored by watching the logs as each script is ran. If a script fails the file continues being processed however, the error will be reported and logged.

Users should be able to browse all their files and explore the history of each upload. They should be able to click on a file and see a list of all the scans performed on the file. They should also be able to access the results of a specific scan from the list.

Each scan can outline all the script results as well as the historical logs for each script.

Data model

Data layer consists of a managed Redis instance for quick session authentication and BigTable for storing scan results and file metadata.

BigTable: FileScanMetadata

This table contains all the metadata for each file upload and scan results. RowID consists of the user id, the file sha256 and the reverse upload timestamp (2^64 - unixtime). This metadata for files to be quickly accessed for each user, for each file and for each scan.

Individual column families exist for each section of metadata for VirusTotal. The key/values relations vary based on the metadata contained. This allows for all the results to be returned for a single scan.

Ideally column families represent a single script and its results in key-value format. Scripts are written with Python and use a custom module for writing results for each scan.

  • RowID: ::
    • Example: 21ad54b0601455aa3ee1d3088d33cb88:0bf01094f5c699046d8228a8f5d5754ea454e5e58b4e6151fef37f32c83f6497:16902907443452761750
  • Column Families:
    • General File Info
      • SHA256
      • MD5
      • Filename(s)
      • Size
      • Tags
      • FileType
      • SHA1
      • TrlD
    • Analysis
      • Key: Antivirus + Version
      • Value: Result:Date
    • Packers
      • Key: Packer
      • Value: Data
    • PE header basic information
    • PE Sections
      • Key: MD5
      • Value: JSON
    • PE Imports
      • Key: DLL
      • Value: CSV
    • Number of PE resources by type
      • Key: Type
      • Value: Count
    • PE Resources
    • ExifTool file metadata
    • Files
      • Opened Files
        • CSV of filenames
      • Read Files
        • CSV of filenames
      • Written Files
        • CSV of filenames
      • Moved Files
        • CSV of filenames
    • Created Processes
    • Shell Commands
    • DLLs
    • Service Managers
    • Opened / Created Mutexes
    • DNS Requests
    • TCP Connections
    • Comments
    • Scripts

BigTable: Users

The user table consists of all the user account data. This includes the user id, the API keys, scripts used by the user, ect.

Users have a unique username similar to VirusTotal. The row key is a hash of their username.

  • RowID: UserID (SHA256 of username)
  • Column Families:
    • User Info
      • Username
      • Email(s)
      • CreatedAt
      • LastLogin
    • API Keys
      • Key: Public Access Token
      • Value: Secret Access Token:Enabled
    • Scripts
      • Key: Script Hash or Name
      • Value: Version/Latest:Enabled/Disabled

Including scripts allows the user to select which scripts to run and if they want to pin down a specific version of a script as well as disabling a particular script.

BigTable: API Tokens

This table stores API tokens for quick lookup. Storing all the API keys in the user table for a given user is convenient, however, does not allow for fast access to API tokens if needed.

So the API keys are duplicated in a table that is better suited for the query.

  • RowID: Public Access Token
  • Column Families:
    • Token Info
      • Private Access Token
      • User ID
      • Enabled

Redis: APITokens

API tokens are stored in Redis for fast access during API authentication. Access tokens need to return the user ID and secret access token.

Authentication / Authorization

JSON Web Tokens are used for authentication in the HTTP Bearer header. Each user has a user ID, a public API access token and a private access key for signing each JWT using HMAC SHA256. For each request, the API access token is given as a JWT claim inside the JWT.

Using the token ID as a JWT claim allows for API keys to be revoked and/or rotated for each user as well as allowing for more than one token for each user account.

When the user makes a request on the API, the JWT signature is verified using the access token in the claims. The cloud function handler requests the user id and private key from Redis for the given claim.

API

All API routes are outlined in this section.

File upload

Uploading files for scanning is the primary function for Virus Total. This route handles the file uploads and starts a scan.

Step-by-step

  1. Cloud functions verify the auth JWT for each request. Reads user ID and access key from Redis.
  2. Writes file to cloud storage.
    • The MD5 and SHA256 are calculated while uploading the file.
    • The file is initially stored using a random temporary file ID. We don't know the MD5 yet so we can't store the file in its proper place.
    • If the file already exists (by MD5), the temporary file is deleted and a scan of the existing file is started.
    • Upon successfull upload the file is renamed to <UserID>:<MD5>.
    • If upload fails, the temp file is deleted.
  3. Store initial metadata for file after successful upload.
    • General information includes the filename, MD5, SHA256 and the filesize.
  4. Rename the file to the <UserID>.<MD5> on success.
  5. Send pub/sub request after the file has completed.
  6. Respond with file metadata, scan id and status. If failed upload, respond with error.
Request
PUT /api/v1/files

Bearer: <JWT>
Request Body: <Multi-part upload>
Response
{
	“status: “in-progress”,
	“scanid”: “16902907443452761750”,
	“md5”: “950848fa87f4b1e5e7b633c7f7973f59”,
	“Sha256”: ”0bf01094f5c699046d8228a8f5d5754ea454e5e58b4e6151fef37f32c83f6497”,
	“sha1”: “7915d76633663b86143a4250e291576a74adb15c”,
	“size”: 1024,
}

File download

Users may occasionally want to download a file thay have scanned. This route allows them to do so.

Request
GET /api/v1/files/0bf01094f5c699046d8228a8f5d5754ea454e5e58b4e6151fef37f32c83f6497/download

Bearer: <JWT>
Response
Response Body: <Data>

Scan status

Returns the status of a particular scan.

Request
GET /api/v1/files/0bf01094f5c699046d8228a8f5d5754ea454e5e58b4e6151fef37f32c83f6497/scans/16902907443452761750

Bearer: <JWT>
Response
{
	“scanid”: “16902907443452761750”,
	“timestamp”: “2018-12-03T11:30:30.256789865Z”,
	“status”: “complete”,
}

List Scans

Users may want to get a list of all the scans for a particular file. This also allows for the UI to show a list of all the scans for a file.

Request
GET /api/v1/files/0bf01094f5c699046d8228a8f5d5754ea454e5e58b4e6151fef37f32c83f6497/scans

Bearer: <JWT>
Response
[
	{
		“scanid”: “16902907443452761750”,
		“timestamp”: “2018-12-03T11:30:30.256789865Z”,
		“status”: “in-progress”,
    },
    {
		“scanid”: “16908935832071616417”,
		“timestamp”: “2018-09-24T16:57:21.637935198Z”,
    	“status”: “complete”,
    },
]

Get Scan Results

This gets all the results for a given scan. The output for every script is in this response. This allows for efficient queries for API users and for the web interface.

Request
GET /api/v1/files/0bf01094f5c699046d8228a8f5d5754ea454e5e58b4e6151fef37f32c83f6497/scans/16902907443452761750/results

Bearer: <JWT>
Response
{
	"scanid": "16902907443452761750",
	"scantime":"2018-12-03T11:30:30.256789865Z",
	"status": "complete",
	"results": {
		"General File Info": {
			"Filename(s)": "1.exe",
			"File Type": "Win32 EXE",
			"File Size": 6611762,
			"Tags": "nsis,peexe,upx",
			"MD5": "950848fa87f4b1e5e7b633c7f7973f59",
			"SHA1": "7915d76633663b86143a4250e291576a74adb15c",
			"SHA256": "0bf01094f5c699046d8228a8f5d5754ea454e5e58b4e6151fef37f32c83f6497",
			"TrlD": "NSIS - Nullsoft Scriptable Install System (94.8%),Win32 Executable MS Visual C++ (generic) (3.4%),Win32 Dynamic Link Library (generic) (0.7%),Win32 Executable (generic) (0.5%),Generic Win/DOS Executable (0.2%)",
		},
		"Additional Info": {
			"Scripts": "general-info:v1.1.0,packers:v0.2.3",
		},
		"Analysis": [
			{
				"antivirus": "Ad-Aware", 
				"update": "20140530",
				"result": "Gen:Variant.Kazy.361122"
			},
			{
				"antivirus": "Yandex", 
				"update": "20140530",
				"result": "Trojan.Zapchast!21QsdueCu4I"
			},
			{
				"antivirus": "AntiVir",	
				"update": "20140530",
				"result": "TR/Kazy.361122.18"
			},
			{
				"antivirus": "AVG", 
				"update": "20140530",
				"result": "MSIL3.BKMY"
			},
			{
				"antivirus": "DrWeb", 
				"update": "20140530",
				"result": "Trojan.DownLoader11.12097"
			},
			{
				"antivirus": "ESET-NOD32",	
				"update": "20140530",
				"result": "a variant of MSIL/Kryptik.VS"
			},
			{
				"antivirus": "F-Secure", 
				"update": "20140530",
				"result": "Gen:Variant.Kazy.338069"
			}
		],
		"Packers": {
			"F-PROT":"NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, UPX, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, Unicode, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, Unicode, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, UPX, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, Unicode, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, Unicode, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, Unicode, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, Unicode, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, Unicode, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, Unicode, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, Unicode, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, Unicode, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, Unicode, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, Unicode, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, Unicode, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS, NSIS"
		},
		"PE header basic information": {
			"Target machine": "Intel 386 or later processors and compatible processors",
			"Compilation timestamp": "2007-03-31 15:09:46",
			"Entry Point": "0x0000312E",
			"Number of sections": 5
		},
		"PE Sections": {
			".text": "4096,22590,23040,6.38,d0113efab792d21a17b8a72aa38325df",
			".rdata": "28672,4324,4608,5.04,9a4c5d765a28fb9f7efb6896024d70dd",
			".data": "36864,111572,1024,4.99,b2a6f118512f7708eee73c9b4cb2c653",
			".ndata": "151552,32768,0,0.00,d41d8cd98f00b204e9800998ecf8427e",
			".rsrc": "184320,100832,100864,2.14,d4fd3087cfeceaf70618b97eced96912"
		},
		"PE Imports": {
			"VERSION.dll": "GetFileVersionInfoSizeA, GetFileVersionInfoA, VerQueryValueA"
			"ole32.dll": "OleUninitialize, CoTaskMemFree, OleInitialize, CoCreateInstance"
		},
		"Number of PE resources by type": {
			"RT_ICON": 5,
			"RT_DIALOG": 3,
			"RT_GROUP_ICON": 1
		},
		"PE Resources": {
			"1266caf3e91abe03b4e7e484df39644129a21204a1ea8946321705fbe5ba2360": "data",
			"338e659d045c0514763f9f4a86714ebda05d5f30045e73c057452f61e1c4a4d9": "data",
			"69897c784f1491eb3024b0d52c2897196a2e245974497fda1915db5fefcf8729": "data",
			"85025c8556952f6a651c2468c8a0d58853b0ba482be9ad5cd3060f216540dfc0": "data",
			"9799bb1967cd2e7b516017844b2b12ddc56c959804445fb058de5ea230e892b7": "data",
			"a15214b528e54ce70d9e0f767bdfea097b66d20e12698be32d162b7e1edca68e": "data",
			"c355dcdfa3e1dce90450ceb7d436472e54341c92b6d014ab90b8a1a95228b210": "data",
			"f4764d2d9673399ab75524314a3ba694597cc6a3e13c945cd64c9e9377ba0b86": "data",
			"fecdb955f8d7f1c219ff8167f90b64f3cb52e53337494577ff73c0ac1dafcd96": "data"
		},
		"ExifTool file metadata": {
			"MIMEType": "application/octet-stream",
			"Subsystem": "Windows GUI",
			"MachineType": "Intel 386 or later, and compatibles",
			"TimeStamp": "2007:03:31 16:09:46+01:00",
			"FileType": "Win32 EXE",
			"PEType": "PE32",
			"CodeSize": "23040",
			"LinkerVersion": "6.0",
			"FileAccessDate": "2014:05:30 20:32:50+01:00",
			"EntryPoint": "0x312e",
			"InitializedDataSize": "120832",
			"SubsystemVersion": "4.0",
			"ImageVersion": "0.0",
			"OSVersion": "4.0",
			"FileCreateDate": "2014:05:30 20:32:50+01:00",
			"UninitializedDataSize": "1024"
		},
		"Opened Files": {
			"C:\\DOCUME~1\\<USER>~1\\LOCALS~1\\Temp\\nsv1.tmp" : "successful",
			"C:\\0bf01094f5c699046d8228a8f5d5754ea454e5e58b4e6151fef37f32c83f6497" : "successful",
			"C:\\DOCUME~1\\<USER>~1\\LOCALS~1\\Temp\\TeamViewer_Setup_ar.exe" : "successful",
			"C:\\DOCUME~1\\<USER>~1\\LOCALS~1\\Temp\\2.exe" : "successful"
		},
		"Read Files": {
			"C:\\0bf01094f5c699046d8228a8f5d5754ea454e5e58b4e6151fef37f32c83f6497": "successful",
			"C:\\WINDOWS\\Registration\\R000000000007.clb": "successful",
			"C:\\DOCUME~1\\<USER>~1\\LOCALS~1\\Temp\\TeamViewer_Setup_ar.exe": "successful",
		},
		"Written Files": {
			"C:\\DOCUME~1\\<USER>~1\\LOCALS~1\\Temp\\TeamViewer_Setup_ar.exe": "successful",
			"C:\\DOCUME~1\\<USER>~1\\LOCALS~1\\Temp\\2.exe": "successful",
			"C:\\DOCUME~1\\<USER>~1\\LOCALS~1\\Temp\nsn3.tmp\\TvGetVersion.dll": "successful",
		},
		"Moved Files": {
			"SRC:C:\\WINDOWS\\Microsoft.NET\\Framework\\v2.0.50727\\config\\security.config.cch": "failed:DST:C:\\WINDOWS\\Microsoft.NET\\Framework\\v2.0.50727\\config\\security.config.cch.1152.76069"
		},
		"Deleted Files": {
		},
		"Created Processes": {
			"C:\\DOCUME~1\\<USER>~1\\LOCALS~1\\Temp\\TeamViewer_Setup_ar.exe C:\\DOCUME~1\<USER>~1\LOCALS~1\Temp\TeamViewer_Setup_ar.exe" :"successful"
		},
		"Shell Commands": {
		},
		"Code Injections": {
			"TeamViewer_.exe": "successful",
			"netsh.exe": "successful"
		},
		"DLLs": {
			"ole32.dll": "successful",
			"setupapi.dll": "successful",
			"rpcrt4.dll": "successful",
			"shell32.dll": "successful",
			"netapi32:" "successful"
		},
		"DNS Requests": {
			"info123.no-ip.biz" :"41.44.85.156",
		},
		"TCP Connections": {
			"41.44.85.156:5552":""
		},
	}
}

Rescan file

Rescanning the file re-processes the file using the user's list of scripts. This allows the user to upload additional scripts or modify which scripts to use when scanning.

Request
POST /api/v1/files/0bf01094f5c699046d8228a8f5d5754ea454e5e58b4e6151fef37f32c83f6497/rescan
Response
{
    “scanid”: “16902807650510813342”,
    “timestamp”: "2018-12-04T15:13:43.198738273Z",
    “status”: “in-progress”,
}
@selekkala
Copy link

Can you please add Architecture diagram also?

@orenbenya1
Copy link

Any idea how they run Yara on millions of files? and how they do it live?

@powerdai
Copy link

Can you please add Architecture diagram also? thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment