This gist will explain how to install and setup Hashcat brain
on a Raspberry Pi based cluster.
I've initially tried to use VC4CL instead of POCL but I could not compile it on Ubuntu Server 18.04.5.
Even if I've also compiled CMake as requested, the compilation failed anyway...
sudo apt install build-essential cmake
# Clone the repo
git clone https://github.com/hashcat/hashcat.git
# Move to the project folder
cd hashcat
# Compile the code
make -j `nproc`
# Install everything
sudo make install
# Clone the repo
git clone https://github.com/raspberrypi/userland.git rpi-userland
# Move to the project folder
cd rpi-userland
# Check your kernel version
uname -a
# For 32bit ARM
./buildme
# Create required symlinks (32bit)
cd /opt/vc/lib
for F in $(ls -1) ; do sudo ln -sfvn $PWD/$F /usr/lib/armhf-linux-gnu/$F ; done
sudo rm -fv /usr/lib/armhf-linux-gnu/pkgconfig/pkgconfig
sudo cp -rv pkgconfig/* /usr/lib/armhf-linux-gnu/pkgconfig/
# For 64bit ARM
./buildme --aarch64
# Create required symlinks (64bit)
cd /opt/vc/lib
for F in $(ls -1) ; do sudo ln -sfvn $PWD/$F /usr/lib/aarch64-linux-gnu/$F ; done
sudo rm -fv /usr/lib/aarch64-linux-gnu/pkgconfig/pkgconfig
sudo cp -rv pkgconfig/* /usr/lib/aarch64-linux-gnu/pkgconfig/
Even if the Raspberry Pi 4B has enough RAM to be a good cluster node, it will help to have more workunits per nodes.
On the Raspberry Pi 3B+, it is necessary to enable Zram memory compression to increase the available memory size.
Now, let's go technical! 😁
Create the loading script:
sudo nano /usr/bin/zram.sh
And place this content:
#!/bin/bash
echo -e "\nExpanding available memory with zRAM...\n"
cores=$(nproc --all)
modprobe zram num_devices=$cores
modprobe zstd
modprobe lz4hc_compress
swapoff -a
totalmem=`free | grep -e "^Mem:" | awk '{print $2}'`
#mem=$(( ($totalmem / $cores)* 1024 ))
mem=$(( ($totalmem * 4 / 3 / $cores)* 1024 ))
core=0
while [ $core -lt $cores ]; do
echo zstd > /sys/block/zram$core/comp_algorithm 2>/dev/null ||
echo lz4hc > /sys/block/zram$core/comp_algorithm 2>/dev/null ||
echo lz4 > /sys/block/zram$core/comp_algorithm 2>/dev/null
echo $mem > /sys/block/zram$core/disksize
mkswap /dev/zram$core
swapon -p 5 /dev/zram$core
let core=core+1
done
The zstd compression algorithm has been used for better performance results.
It might not be supported on all systems, that's why I've added some other compression algorithms.
Then save it with [Ctrl+O]
and [Ctrl+X]
.
Make it executable:
sudo chmod -v +x /usr/bin/zram.sh
Then create the boot script:
sudo nano /etc/rc.local
And place this content:
#!/bin/bash
/usr/bin/zram.sh &
exit 0
Then save it with [Ctrl+O]
and [Ctrl+X]
.
Make it executable:
sudo chmod -v +x /etc/rc.local
To finish, run the script to create the additional memory. To see the available memory and the compression stats, run the following commands:
# Manual start
sudo /usr/bin/zram.sh
# Show memory compression stats
zramctl
# Show available memory
free -mlht
If you don't increase the memory with Zram, the POCL compilation will simply fail.
This is required for running Hashcat.
# Install required packages
sudo apt install -y build-essential ocl-icd-libopencl1 cmake git pkg-config libclang-dev clang llvm make ninja-build ocl-icd-libopencl1 ocl-icd-dev ocl-icd-opencl-dev libhwloc-dev zlib1g zlib1g-dev clinfo dialog apt-utils
# Clone the repo
git clone https://github.com/pocl/pocl.git
# Move to the project folder
cd pocl
# Create build folder
mkdir -v build
# Move to the build folder
cd build
# Get / set configuration (the default one worked for me)
cmake ..
# Compile the code
make -j `nproc`
# Install everything
sudo make install
# Load new installed libraries
sudo ldconfig
# Verify loaded libraries
ldconfig --print | grep local
# Create required symlink to /etc/OpenCL
sudo ln -sfvn /usr/local/etc/OpenCL /etc/OpenCL
If you don't create the symlink, the OpenCL ICD driver will not be found and
clinfo
orhashcat
will detect nothing.
Now edit your raspberry pi config.txt
file or usercfg.txt
for later Ubuntu Server versions.
- SDCARD Path:
/boot/config.txt
or/boot/usercfg.txt
- Mounted Path:
/boot/firmware/config.txt
or/boot/firmware/usercfg.txt
For Raspberry Pi's 3B / 3B+:
dtoverlay=vc4-fkms-v3d
max_framebuffers=2
gpu_mem=512
If you get some troubles with the
vc4-fkms-v3d
driver, use thevc4-kms-v3d
driver instead.
For Raspberry Pi's 4B:
dtoverlay=vc4-kms-v3d-pi4
max_framebuffers=2
gpu_mem=1024
hdmi_enable_4kp60=1
You can also use memory splitting
and CMA allocation
if you need it:
# Replace this line:
dtoverlay=vc4-fkms-v3d
# By:
dtoverlay=vc4-fkms-v3d, cma-128
Same for all other drivers.
Now you have to reboot to apply your changes.
Once you have restarted your Raspberry Pi, you can verify the result of your work by running clinfo
, you should get a similar output in case of success:
Number of platforms 1
Platform Name Portable Computing Language
Platform Vendor The pocl project
Platform Version OpenCL 1.2 pocl 1.6-pre master-0-g984525e1, Debug+Asserts, LLVM 6.0.0, RELOC, SLEEF, FP16, POCL_DEBUG
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd
Platform Extensions function suffix POCL
Platform Name Portable Computing Language
Number of devices 1
Device Name pthread-cortex-a53
Device Vendor ARM
Device Vendor ID 0x13b5
Device Version OpenCL 1.2 pocl HSTR: pthread-aarch64-unknown-linux-gnu-cortex-a53
Driver Version 1.6-pre master-0-g984525e1
Device OpenCL C Version OpenCL C 1.2 pocl
Device Type CPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 4
Max clock frequency 1200MHz
Device Partition (core)
Max number of sub-devices 4
Supported partition types equally, by counts
Max work item dimensions 3
Max work item sizes 4096x4096x4096
Max work group size 4096
Preferred work group size multiple 8
Preferred / native vector sizes
char 16 / 16
short 8 / 8
int 4 / 4
long 2 / 2
half 8 / 8 (cl_khr_fp16)
float 4 / 4
double 2 / 2 (cl_khr_fp64)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs No
Round to nearest No
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 671941632 (640.8MiB)
Error Correction support No
Max memory allocation 268435456 (256MiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Global Memory cache type None
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 16777216 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 8192x8192 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 128
Max number of write image args 128
Local memory type Global
Local memory size 4194304 (4MiB)
Max number of constant args 8
Max constant buffer size 4194304 (4MiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 1ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels Yes
printf() buffer size 16777216 (16MiB)
Built-in kernels
Device Extensions cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_fp16 cl_khr_fp64
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Portable Computing Language
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [POCL]
clCreateContext(NULL, ...) [default] Success [POCL]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name Portable Computing Language
Device Name pthread-cortex-a53
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) Success (1)
Platform Name Portable Computing Language
Device Name pthread-cortex-a53
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name Portable Computing Language
Device Name pthread-cortex-a53
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.11
ICD loader Profile OpenCL 2.1
I've made my tests with Ubuntu Server 64bit version 18.04.5.
If it has failed, then it will returns:
Number of platforms 0
Now let's see if Hashcat is able to see our CPU/GPU by running: hashcat -I --force
.
hashcat (v6.1.1-47-gb8a09615) starting...
You have enabled --force to bypass dangerous warnings and errors!
This can hide serious problems and should only be done when debugging.
Do not report hashcat issues encountered when using --force.
OpenCL Info:
============
OpenCL Platform ID #1
Vendor..: The pocl project
Name....: Portable Computing Language
Version.: OpenCL 1.2 pocl 1.6-pre master-0-g984525e1, Debug+Asserts, LLVM 6.0.0, RELOC, SLEEF, FP16, POCL_DEBUG
Backend Device ID #1
Type...........: CPU
Vendor.ID......: 2147483648
Vendor.........: ARM
Name...........: pthread-cortex-a53
Version........: OpenCL 1.2 pocl HSTR: pthread-aarch64-unknown-linux-gnu-cortex-a53
Processor(s)...: 4
Clock..........: 1200
Memory.Total...: 640 MB (limited to 256 MB allocatable in one block)
Memory.Free....: 576 MB
OpenCL.Version.: OpenCL C 1.2 pocl
Driver.Version.: 1.6-pre master-0-g984525e1
The
--force
argument is required otherwise Hashcat will stop and complain about the outdated driver...
Without the --force
argument:
hashcat (v6.1.1-47-gb8a09615) starting...
* Device #1: Outdated POCL OpenCL driver detected!
No devices found/left.
To be sure that your current OpenCL installation should run correctly, you can download and compile trivial_opencl_program.c.
# Download the test code
wget https://raw.githubusercontent.com/wimvanderbauwhede/limited-systems/master/OpenCL/trivial_opencl_program.c
# Compile the code
gcc -Wno-deprecated-declarations -o trivial_opencl_program trivial_opencl_program.c -lOpenCL
# Run the test
./trivial_opencl_program
It should return Success
. If not, then you might have some compilation issues...
You can also run the clpeak synthetic benchmark.
It only measures the peak metrics that can be achieved using vector operations and does not represent a real-world use case.
# Clone the repo
git clone https://github.com/krrishnarraj/clpeak.git
# Move to the project folder
cd clpeak
# Create build folder
mkdir -v build
# Move to the build folder
cd build
# Create makefiles
cmake ..
# Compile the code
make -j `nproc`
# Install everything
sudo make install
Now to run the benchmark, simply execute clpeak
:
ubuntu@rpi-3b-01:~/clpeak/build$ clpeak
Platform: Portable Computing Language
Device: pthread-cortex-a53
Driver version : 1.6-pre master-0-g984525e1 (Linux ARM64)
Compute units : 4
Clock frequency : 1200 MHz
Global memory bandwidth (GBPS)
float : 1.13
float2 : 1.03
float4 : 1.19
float8 : 1.12
float16 : 1.40
Single-precision compute (GFLOPS)
float : 1.19
float2 : 2.37
float4 : 4.72
float8 : 9.31
float16 : 18.23
Half-precision compute (GFLOPS)
half : 0.40
half2 : 0.79
half4 : 1.58
half8 : 2.59
half16 : 1.41
Double-precision compute (GFLOPS)
double : 1.19
double2 : 2.37
double4 : 4.72
double8 : 9.27
double16 : 9.36
Integer compute (GIOPS)
int : 3.15
int2 : 3.78
int4 : 7.51
int8 : 12.46
int16 : 18.07
Integer compute Fast 24bit (GIOPS)
int : 3.15
int2 : 3.78
int4 : 7.50
int8 : 12.46
int16 : 18.07
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 1.25
enqueueReadBuffer : 1.25
enqueueWriteBuffer non-blocking : 1.25
enqueueReadBuffer non-blocking : 1.25
enqueueMapBuffer(for read) : 717.17
memcpy from mapped ptr : 1.25
enqueueUnmap(after write) : 1870.63
memcpy to mapped ptr : 1.24
Kernel launch latency : 40.28 us
These results are coming from a Raspberry Pi 3 Model B.
- With 512MB allocated to the GPU memory.
- And a total of 1.9GB of global memory with Zram.
Run clpeak --help
for more options.
Now we are entering into the most interesting part of this gist 😁.
[TODO]
- https://github.com/hashcat/hashcat
- https://github.com/hashcat/hashcat/blob/master/docs/hashcat-brain.md
- https://github.com/hashcat/hashcat/blob/master/BUILD.md
- hashcat/hashcat#2398
- https://gist.github.com/Jiab77/4dc1f8bed339336e02b70c7b0b135a11#increase-ram
- https://github.com/pocl/pocl
- http://portablecl.org/docs/html/install.html
- https://github.com/wimvanderbauwhede/limited-systems/wiki/Installing-OpenCL--on-the-Raspberry-Pi-3
- https://github.com/doe300/VC4CL
- https://github.com/raspberrypi/userland
- https://www.dedoimedo.com/computers/rpi4-ubuntu-mate-hw-video-acceleration.html
- https://www.raspberrypi.org/documentation/configuration/config-txt/README.md
- https://www.raspberrypi.org/documentation/configuration/config-txt/memory.md
- https://www.raspberrypi.org/documentation/configuration/config-txt/video.md
Hey!
Did you ever get to doing any more on this?
I actually want to do this very thing as well and have just started playing around with it