Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save Jiab77/5da5f13579134cebc52cf29a3f5f3a02 to your computer and use it in GitHub Desktop.
Save Jiab77/5da5f13579134cebc52cf29a3f5f3a02 to your computer and use it in GitHub Desktop.
Hashcat brain on Raspberry Pi 3B / 3B+ and 4B

Hashcat brain on Raspberry Pi 3B / 3B+ and 4B

This gist will explain how to install and setup Hashcat brain on a Raspberry Pi based cluster.

I've initially tried to use VC4CL instead of POCL but I could not compile it on Ubuntu Server 18.04.5.

Even if I've also compiled CMake as requested, the compilation failed anyway...

Install build dependencies

sudo apt install build-essential cmake

Compile hashcat

# Clone the repo
git clone https://github.com/hashcat/hashcat.git

# Move to the project folder
cd hashcat

# Compile the code
make -j `nproc`

# Install everything
sudo make install

Get Raspberry Pi Userland (skip if you are using Raspbian)

# Clone the repo
git clone https://github.com/raspberrypi/userland.git rpi-userland

# Move to the project folder
cd rpi-userland

# Check your kernel version
uname -a

# For 32bit ARM
./buildme

# Create required symlinks (32bit)
cd /opt/vc/lib
for F in $(ls -1) ; do sudo ln -sfvn $PWD/$F /usr/lib/armhf-linux-gnu/$F ; done
sudo rm -fv /usr/lib/armhf-linux-gnu/pkgconfig/pkgconfig
sudo cp -rv pkgconfig/* /usr/lib/armhf-linux-gnu/pkgconfig/

# For 64bit ARM
./buildme --aarch64

# Create required symlinks (64bit)
cd /opt/vc/lib
for F in $(ls -1) ; do sudo ln -sfvn $PWD/$F /usr/lib/aarch64-linux-gnu/$F ; done
sudo rm -fv /usr/lib/aarch64-linux-gnu/pkgconfig/pkgconfig
sudo cp -rv pkgconfig/* /usr/lib/aarch64-linux-gnu/pkgconfig/

Increase RAM

Even if the Raspberry Pi 4B has enough RAM to be a good cluster node, it will help to have more workunits per nodes.

On the Raspberry Pi 3B+, it is necessary to enable Zram memory compression to increase the available memory size.

Now, let's go technical! 😁

Create the loading script:

sudo nano /usr/bin/zram.sh

And place this content:

#!/bin/bash

echo -e "\nExpanding available memory with zRAM...\n"
cores=$(nproc --all)
modprobe zram num_devices=$cores
modprobe zstd
modprobe lz4hc_compress

swapoff -a

totalmem=`free | grep -e "^Mem:" | awk '{print $2}'`
#mem=$(( ($totalmem / $cores)* 1024 ))
mem=$(( ($totalmem * 4 / 3 / $cores)* 1024 ))

core=0
while [ $core -lt $cores ]; do
    echo zstd > /sys/block/zram$core/comp_algorithm 2>/dev/null ||
    echo lz4hc > /sys/block/zram$core/comp_algorithm 2>/dev/null ||
    echo lz4 > /sys/block/zram$core/comp_algorithm 2>/dev/null
    echo $mem > /sys/block/zram$core/disksize
    mkswap /dev/zram$core
    swapon -p 5 /dev/zram$core
    let core=core+1
done

The zstd compression algorithm has been used for better performance results.

It might not be supported on all systems, that's why I've added some other compression algorithms.

Then save it with [Ctrl+O] and [Ctrl+X].

Make it executable:

sudo chmod -v +x /usr/bin/zram.sh

Then create the boot script:

sudo nano /etc/rc.local

And place this content:

#!/bin/bash

/usr/bin/zram.sh &

exit 0

Then save it with [Ctrl+O] and [Ctrl+X].

Make it executable:

sudo chmod -v +x /etc/rc.local

To finish, run the script to create the additional memory. To see the available memory and the compression stats, run the following commands:

# Manual start
sudo /usr/bin/zram.sh

# Show memory compression stats
zramctl

# Show available memory
free -mlht

If you don't increase the memory with Zram, the POCL compilation will simply fail.

Compile POCL

This is required for running Hashcat.

# Install required packages
sudo apt install -y build-essential ocl-icd-libopencl1 cmake git pkg-config libclang-dev clang llvm make ninja-build ocl-icd-libopencl1 ocl-icd-dev ocl-icd-opencl-dev libhwloc-dev zlib1g zlib1g-dev clinfo dialog apt-utils

# Clone the repo
git clone https://github.com/pocl/pocl.git

# Move to the project folder
cd pocl

# Create build folder
mkdir -v build

# Move to the build folder
cd build

# Get / set configuration (the default one worked for me)
cmake ..

# Compile the code
make -j `nproc`

# Install everything
sudo make install

# Load new installed libraries
sudo ldconfig

# Verify loaded libraries
ldconfig --print | grep local

# Create required symlink to /etc/OpenCL
sudo ln -sfvn /usr/local/etc/OpenCL /etc/OpenCL

If you don't create the symlink, the OpenCL ICD driver will not be found and clinfo or hashcat will detect nothing.

Enable GPU driver (skip if it is already loaded)

Now edit your raspberry pi config.txt file or usercfg.txt for later Ubuntu Server versions.

  • SDCARD Path: /boot/config.txt or /boot/usercfg.txt
  • Mounted Path: /boot/firmware/config.txt or /boot/firmware/usercfg.txt

For Raspberry Pi's 3B / 3B+:

dtoverlay=vc4-fkms-v3d
max_framebuffers=2
gpu_mem=512

If you get some troubles with the vc4-fkms-v3d driver, use the vc4-kms-v3d driver instead.

For Raspberry Pi's 4B:

dtoverlay=vc4-kms-v3d-pi4
max_framebuffers=2
gpu_mem=1024
hdmi_enable_4kp60=1

You can also use memory splitting and CMA allocation if you need it:

# Replace this line:
dtoverlay=vc4-fkms-v3d

# By:
dtoverlay=vc4-fkms-v3d, cma-128

Same for all other drivers.

Now you have to reboot to apply your changes.

Platform detection

Once you have restarted your Raspberry Pi, you can verify the result of your work by running clinfo, you should get a similar output in case of success:

Number of platforms                               1
  Platform Name                                   Portable Computing Language
  Platform Vendor                                 The pocl project
  Platform Version                                OpenCL 1.2 pocl 1.6-pre master-0-g984525e1, Debug+Asserts, LLVM 6.0.0, RELOC, SLEEF, FP16, POCL_DEBUG
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd
  Platform Extensions function suffix             POCL

  Platform Name                                   Portable Computing Language
Number of devices                                 1
  Device Name                                     pthread-cortex-a53
  Device Vendor                                   ARM
  Device Vendor ID                                0x13b5
  Device Version                                  OpenCL 1.2 pocl HSTR: pthread-aarch64-unknown-linux-gnu-cortex-a53
  Driver Version                                  1.6-pre master-0-g984525e1
  Device OpenCL C Version                         OpenCL C 1.2 pocl
  Device Type                                     CPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               4
  Max clock frequency                             1200MHz
  Device Partition                                (core)
    Max number of sub-devices                     4
    Supported partition types                     equally, by counts
  Max work item dimensions                        3
  Max work item sizes                             4096x4096x4096
  Max work group size                             4096
  Preferred work group size multiple              8
  Preferred / native vector sizes
    char                                                16 / 16
    short                                                8 / 8
    int                                                  4 / 4
    long                                                 2 / 2
    half                                                 8 / 8        (cl_khr_fp16)
    float                                                4 / 4
    double                                               2 / 2        (cl_khr_fp64)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     No
    Infinity and NANs                             No
    Round to nearest                              No
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              671941632 (640.8MiB)
  Error Correction support                        No
  Max memory allocation                           268435456 (256MiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Global Memory cache type                        None
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            16777216 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             8192x8192 pixels
    Max 3D image size                             2048x2048x2048 pixels
    Max number of read image args                 128
    Max number of write image args                128
  Local memory type                               Global
  Local memory size                               4194304 (4MiB)
  Max number of constant args                     8
  Max constant buffer size                        4194304 (4MiB)
  Max size of kernel argument                     1024
  Queue properties
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1ns
  Execution capabilities
    Run OpenCL kernels                            Yes
    Run native kernels                            Yes
  printf() buffer size                            16777216 (16MiB)
  Built-in kernels
  Device Extensions                               cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_fp16 cl_khr_fp64

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  Portable Computing Language
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [POCL]
  clCreateContext(NULL, ...) [default]            Success [POCL]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 Portable Computing Language
    Device Name                                   pthread-cortex-a53
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  Success (1)
    Platform Name                                 Portable Computing Language
    Device Name                                   pthread-cortex-a53
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 Portable Computing Language
    Device Name                                   pthread-cortex-a53

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.11
  ICD loader Profile                              OpenCL 2.1

I've made my tests with Ubuntu Server 64bit version 18.04.5.

If it has failed, then it will returns:

Number of platforms                               0

Now let's see if Hashcat is able to see our CPU/GPU by running: hashcat -I --force.

hashcat (v6.1.1-47-gb8a09615) starting...

You have enabled --force to bypass dangerous warnings and errors!
This can hide serious problems and should only be done when debugging.
Do not report hashcat issues encountered when using --force.
OpenCL Info:
============

OpenCL Platform ID #1
  Vendor..: The pocl project
  Name....: Portable Computing Language
  Version.: OpenCL 1.2 pocl 1.6-pre master-0-g984525e1, Debug+Asserts, LLVM 6.0.0, RELOC, SLEEF, FP16, POCL_DEBUG

  Backend Device ID #1
    Type...........: CPU
    Vendor.ID......: 2147483648
    Vendor.........: ARM
    Name...........: pthread-cortex-a53
    Version........: OpenCL 1.2 pocl HSTR: pthread-aarch64-unknown-linux-gnu-cortex-a53
    Processor(s)...: 4
    Clock..........: 1200
    Memory.Total...: 640 MB (limited to 256 MB allocatable in one block)
    Memory.Free....: 576 MB
    OpenCL.Version.: OpenCL C 1.2 pocl
    Driver.Version.: 1.6-pre master-0-g984525e1

The --force argument is required otherwise Hashcat will stop and complain about the outdated driver...

Without the --force argument:

hashcat (v6.1.1-47-gb8a09615) starting...

* Device #1: Outdated POCL OpenCL driver detected!

No devices found/left.

Running some OpenCL tests

To be sure that your current OpenCL installation should run correctly, you can download and compile trivial_opencl_program.c.

# Download the test code
wget https://raw.githubusercontent.com/wimvanderbauwhede/limited-systems/master/OpenCL/trivial_opencl_program.c

# Compile the code
gcc -Wno-deprecated-declarations -o trivial_opencl_program trivial_opencl_program.c -lOpenCL

# Run the test
./trivial_opencl_program

It should return Success. If not, then you might have some compilation issues...

Running synthetic benchmark

You can also run the clpeak synthetic benchmark.

It only measures the peak metrics that can be achieved using vector operations and does not represent a real-world use case.

# Clone the repo
git clone https://github.com/krrishnarraj/clpeak.git

# Move to the project folder
cd clpeak

# Create build folder
mkdir -v build

# Move to the build folder
cd build

# Create makefiles
cmake ..

# Compile the code
make -j `nproc`

# Install everything
sudo make install

Now to run the benchmark, simply execute clpeak:

ubuntu@rpi-3b-01:~/clpeak/build$ clpeak

Platform: Portable Computing Language
  Device: pthread-cortex-a53
    Driver version  : 1.6-pre master-0-g984525e1 (Linux ARM64)
    Compute units   : 4
    Clock frequency : 1200 MHz

    Global memory bandwidth (GBPS)
      float   : 1.13
      float2  : 1.03
      float4  : 1.19
      float8  : 1.12
      float16 : 1.40

    Single-precision compute (GFLOPS)
      float   : 1.19
      float2  : 2.37
      float4  : 4.72
      float8  : 9.31
      float16 : 18.23

    Half-precision compute (GFLOPS)
      half   : 0.40
      half2  : 0.79
      half4  : 1.58
      half8  : 2.59
      half16 : 1.41

    Double-precision compute (GFLOPS)
      double   : 1.19
      double2  : 2.37
      double4  : 4.72
      double8  : 9.27
      double16 : 9.36

    Integer compute (GIOPS)
      int   : 3.15
      int2  : 3.78
      int4  : 7.51
      int8  : 12.46
      int16 : 18.07

    Integer compute Fast 24bit (GIOPS)
      int   : 3.15
      int2  : 3.78
      int4  : 7.50
      int8  : 12.46
      int16 : 18.07

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 1.25
      enqueueReadBuffer               : 1.25
      enqueueWriteBuffer non-blocking : 1.25
      enqueueReadBuffer non-blocking  : 1.25
      enqueueMapBuffer(for read)      : 717.17
        memcpy from mapped ptr        : 1.25
      enqueueUnmap(after write)       : 1870.63
        memcpy to mapped ptr          : 1.24

    Kernel launch latency : 40.28 us

These results are coming from a Raspberry Pi 3 Model B.

  • With 512MB allocated to the GPU memory.
  • And a total of 1.9GB of global memory with Zram.

Run clpeak --help for more options.

Setup the brain server

Now we are entering into the most interesting part of this gist 😁.

[TODO]

References

@hostileblue2020
Copy link

Hey!

Did you ever get to doing any more on this?

I actually want to do this very thing as well and have just started playing around with it

@TypeNaN
Copy link

TypeNaN commented Dec 10, 2021

Raspberry pi 3B
hashcat compile error: make -j 4

/usr/bin/ld: obj/combined.NATIVE.a(7zCrc.LZMA.NATIVE.o): in function `CrcUpdateT0_32':
7zCrc.c:(.text+0x18): undefined reference to `__crc32b'
/usr/bin/ld: 7zCrc.c:(.text+0x4c): undefined reference to `__crc32w'
/usr/bin/ld: 7zCrc.c:(.text+0x54): undefined reference to `__crc32w'
/usr/bin/ld: 7zCrc.c:(.text+0x5c): undefined reference to `__crc32w'
/usr/bin/ld: 7zCrc.c:(.text+0x68): undefined reference to `__crc32w'
/usr/bin/ld: 7zCrc.c:(.text+0x98): undefined reference to `__crc32b'
/usr/bin/ld: obj/combined.NATIVE.a(7zCrc.LZMA.NATIVE.o): in function `CrcUpdateT0_64':
7zCrc.c:(.text+0xc0): undefined reference to `__crc32b'
/usr/bin/ld: 7zCrc.c:(.text+0xf4): undefined reference to `__crc32d'
/usr/bin/ld: 7zCrc.c:(.text+0xfc): undefined reference to `__crc32d'
/usr/bin/ld: 7zCrc.c:(.text+0x104): undefined reference to `__crc32d'
/usr/bin/ld: 7zCrc.c:(.text+0x110): undefined reference to `__crc32d'
/usr/bin/ld: 7zCrc.c:(.text+0x140): undefined reference to `__crc32b'
collect2: error: ld returned 1 exit status
make: *** [src/Makefile:610: hashcat] Error 1
make: *** Waiting for unfinished jobs....

@emmettprexus
Copy link

@TypeNaN I had the same issue. gcc was missing a flag for the arm cpus. You can check out the conversation here where we managed to build and run hashcat successfully on a RPi 3B+ on Rasbian Bullseye with OpenCL:

Hope it'll help you! :)

@jharriga
Copy link

jharriga commented Mar 9, 2022

Just ran through steps on Pi-OS Bullseye rpi4
I needed to make one modification to your procedure:

  1. In "Compile POCL / Install required packages" step, add pkg 'libclang-cpp-dev' to avoid error during 'cmake ..' step

POCL built and installed with 'clinfo' finding one device, pthread-cortex-a72

Thanks, John

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment