This Gist compares different matrix multiplication implementations like AVX, SSE4.1, Pure Loops for CPU. It also implements CuBLAS and kernel based approach for matrix multiplying in CUDA.
Note: This isn't any scientific test just a random test to see how well the implementations perform in a general sense rather than quantative
sid@sid-ubuntu ~> neofetch
.-/+oossssoo+/-. sid@sid-ubuntu
`:+ssssssssssssssssss+:` --------------
-+ssssssssssssssssssyyssss+- OS: Ubuntu 22.04.2 LTS x86_64
.ossssssssssssssssssdMMMNysssso. Host: Vivobook_ASUSLaptop K6500ZE_K6500ZE 1.0
/ssssssssssshdmmNNmmyNMMMMhssssss/ Kernel: 5.19.0-46-generic
+ssssssssshmydMMMMMMMNddddyssssssss+ Uptime: 4 hours, 48 mins
/sssssssshNMMMyhhyyyyhmNMMMNhssssssss/ Packages: 2531 (dpkg), 21 (flatpak), 17 (snap)
.ssssssssdMMMNhsssssssssshNMMMdssssssss. Shell: fish 3.3.1
+sssshhhyNMMNyssssssssssssyNMMMysssssss+ Resolution: 1920x1080
ossyNMMMNyMMhsssssssssssssshmmmhssssssso DE: GNOME 42.5
ossyNMMMNyMMhsssssssssssssshmmmhssssssso WM: Mutter
+sssshhhyNMMNyssssssssssssyNMMMysssssss+ WM Theme: Adwaita
.ssssssssdMMMNhsssssssssshNMMMdssssssss. Theme: Fluent-Dark-compact [GTK2/3]
/sssssssshNMMMyhhyyyyhdNMMMNhssssssss/ Icons: Fluent-green-dark [GTK2/3]
+sssssssssdmydMMMMMMMMddddyssssssss+ Terminal: xfce4-terminal
/ssssssssssshdmNNNNmyNMMMMhssssss/ Terminal Font: FiraCode Nerd Font Mono weight
.ossssssssssssssssssdMMMNysssso. CPU: 12th Gen Intel i5-12450H (12) @ 4.400GHz
-+sssssssssssssssssyyyssss+- GPU: NVIDIA GeForce RTX 3050 Ti Mobile
`:+ssssssssssssssssss+:` GPU: Intel Alder Lake-P GT1 [UHD Graphics]
.-/+oossssoo+/-. Memory: 5930MiB / 15709MiB