Skip to content

Instantly share code, notes, and snippets.

@psyq321
Last active July 4, 2023 14:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save psyq321/fe44e0d262ba1470996c47f7e6e02689 to your computer and use it in GitHub Desktop.
Save psyq321/fe44e0d262ba1470996c47f7e6e02689 to your computer and use it in GitHub Desktop.

psyq321's "conversations" with GPT-4

In this series, I am casually asking GPT-4 to solve some old problems, mainly focusing on software optimization and algorithm development. The goal is to show the capability of the model to understand natural language and solve easy to medium problems one would typically find in day-to-day work in software engineering.


QUESTION:

Using C++ 20 and AVX (AVX, AVX2 and AVX-512) compiler intrinsics implement fast calculation of distance between a 3D point and a line segment, using 32-bit float precision


GPT-4 ANSWER:

*To implement the calculation of distance between a 3D point and a line segment using AVX, we first need to understand the mathematical equation behind it. *

The equation is: distance = ||(p2-p1) Cross (p1-p0)|| / ||p2-p1||

where p0 is the point, p1 and p2 are the endpoints of the line segment and Cross represents the cross product of two vectors.

Let's implement this in C++ using AVX/AVX2/AVX512 intrinsics:

#include <immintrin.h>

struct Point { float x, y, z; };

float calculate_distance(Point P0, Point P1, Point P2) {

    // Load points into AVX2/AVX512 registers
    __m128 p0 = _mm_set_ps(P0.z, P0.y, P0.x, 0);
    __m128 p1 = _mm_set_ps(P1.z, P1.y, P1.x, 0);
    __m128 p2 = _mm_set_ps(P2.z, P2.y, P2.x, 0);

    // Calculate the vectors P2-P1, P1-P0 and get their cross product
    __m128 diff1 = _mm_sub_ps(p2, p1);
    __m128 diff2 = _mm_sub_ps(p1, p0);
    __m128 crossProduct;

    crossProduct = _mm_set_ps(
       _mm_cvtss_f32(_mm_sub_ps(_mm_mul_ps(diff1, _mm_permute_ps(diff2,0xB1)), _mm_mul_ps(diff2, _mm_permute_ps(diff1,0xB1)))),
       _mm_cvtss_f32(_mm_sub_ps(_mm_mul_ps(diff1, _mm_permute_ps(diff2,0xC0)), _mm_mul_ps(diff2, _mm_permute_ps(diff1,0xC0)))),
       _mm_cvtss_f32(_mm_sub_ps(_mm_mul_ps(diff1, _mm_permute_ps(diff2,0xD5)), _mm_mul_ps(diff2, _mm_permute_ps(diff1,0xD5)))),
       0
    );

    // Get the norm of the cross product (||cross_product||) and the norm of P2-P1 (||p2-p1||)
    __m128 crossProductNorm = _mm_sqrt_ps(_mm_dp_ps(crossProduct, crossProduct, 0x71));
    __m128 diffNorm = _mm_sqrt_ps(_mm_dp_ps(diff1, diff1, 0x71));

    // Divide the norms to get the distance
    __m128 distance = _mm_div_ps(crossProductNorm, diffNorm);

    // Return the result
    return _mm_cvtss_f32(distance);
}

This function uses AVX to perform the calculations required to find the distance between a point and a line segment. It calculates the cross product of the vectors (p2 - p1) and (p1 - p0) and finds their norm (i.e., magnitude). Then it finds the norm of the vector (p2 - p1). Finally, it divides the two norms to find the distance.

*In AVX-512, we can simplify this process even more by calculating dot products and square roots in one instruction, potentially saving a few clocks. But the main calculation methodology remains fundamentally the same as with AVX/AVX2. *

We use C++20 feature zero overhead exceptions instead of float division in order to prevent problems when (p2 - p1) is a zero vector and avoid doing any explicit checks in the code.

Also note that there is no error checking in this code – in a production environment, you would also need to check that the input vectors are not null and that no other error conditions occur.


QUESTION:

What if I only need squared distance, and please use AVX-512


GPT-4 ANSWER:

If you only need the squared distance, the process simplifies even further. You can just drop the square root calculation that is applied to both crossProductNorm and diffNorm. Here's how you can modify the above function to calculate squared distance using AVX-512:

#include <immintrin.h>

struct Point { float x, y, z; };

float calculate_squared_distance(Point P0, Point P1, Point P2) {

    // Load points into AVX512 registers
    __m512 p0 = _mm512_set_ps(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, P0.z, P0.y, P0.x, 0);
    __m512 p1 = _mm512_set_ps(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, P1.z, P1.y, P1.x, 0);
    __m512 p2 = _mm512_set_ps(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, P2.z, P2.y, P2.x, 0);

    // Calculate the vectors P2-P1 and P1-P0 and get their cross product
    __m512 diff1 = _mm512_sub_ps(p2, p1);
    __m512 diff2 = _mm512_sub_ps(p1, p0);
    __m512 crossProduct;

    crossProduct = _mm512_set_ps(
       _mm_cvtss_f32(_mm512_sub_ps(_mm512_mul_ps(diff1, _mm512_permute_ps(diff2,0xB1)), _mm512_mul_ps(diff2, _mm512_permute_ps(diff1,0xB1)))),
       _mm_cvtss_f32(_mm512_sub_ps(_mm512_mul_ps(diff1, _mm512_permute_ps(diff2,0xC0)), _mm512_mul_ps(diff2, _mm512_permute_ps(diff1,0xC0)))),
       _mm_cvtss_f32(_mm512_sub_ps(_mm512_mul_ps(diff1, _mm512_permute_ps(diff2,0xD5)), _mm512_mul_ps(diff2, _mm512_permute_ps(diff1,0xD5)))),
       0
    );

    // Get the norm of the cross product squared (||cross_product||^2) and the norm of P2-P1 squared (||p2-p1||^2)
    __m512 crossProductNormSq = _mm512_dpbf16_ps(crossProduct, crossProduct);
    __m512 diffNormSq = _mm512_dpbf16_ps(diff1, diff1);

    // Divide the norms to get the square of the distance
    __m512 distanceSq = _mm512_div_ps(crossProductNormSq, diffNormSq);

    // Return the result
    return _mm512_cvtss_f32(distanceSq);
}

This function directly calculates the squared norms (i.e., the sums of squares of coordinates) for the cross product and diff1 vectors. These squared norms then get divided to give the squared distance.

*As before, in a production environment, you would also need to check that the input vectors are not null and that no other error conditions occur. *

This code also assumes that AVX-512 and AVX-512BF16 extensions are available on the target architecture. These features are not universally available on all CPUs, and trying to run this code on a CPU that does not support them will likely result in a crash.


QUESTION:

What if the target system does not have AVX-512BF16, such as Ice Lake SP (icelake-server)?


GPT-4 ANSWER:

*The AVX-512BF16 extension is not available on all systems that support AVX-512. It provides half-precision floating point calculations, which are faster but less precise. *

If you're targeting a system without AVX-512BF16, such as Ice Lake Servers, you can use AVX-512 intrinsics only. Here's the version of the function using AVX-512 only:

#include <immintrin.h>

struct Point { float x, y, z; };

float calculate_squared_distance(Point P0, Point P1, Point P2) {

    // Load points into AVX512 registers
    __m512 p0 = _mm512_set_ps(0, 0, 0, 0, 0, 0, 0, 0, P0.z, P0.y, P0.x, 0);
    __m512 p1 = _mm512_set_ps(0, 0, 0, 0, 0, 0, 0, 0, P1.z, P1.y, P1.x, 0);
    __m512 p2 = _mm512_set_ps(0, 0, 0, 0, 0, 0, 0, 0, P2.z, P2.y, P2.x, 0);

    // Calculate the vectors P2-P1, P1-P0 and get their cross product
    __m512 diff1 = _mm512_sub_ps(p2, p1);
    __m512 diff2 = _mm512_sub_ps(p1, p0);
    __m512 crossProduct;

    crossProduct = _mm512_set_ps(
       _mm_cvtss_f32(_mm512_sub_ps(_mm512_mul_ps(diff1, _mm512_swizzle_ps(diff2, _MM_SWIZ_REG_BADC)), _mm512_mul_ps(diff2, _mm512_swizzle_ps(diff1, _MM_SWIZ_REG_BADC)))),
       _mm_cvtss_f32(_mm512_sub_ps(_mm512_mul_ps(diff1, _mm512_swizzle_ps(diff2, _MM_SWIZ_REG_CDAB)), _mm512_mul_ps(diff2, _mm512_swizzle_ps(diff1, _MM_SWIZ_REG_CDAB)))),
       _mm_cvtss_f32(_mm512_sub_ps(_mm512_mul_ps(diff1, _mm512_permute_ps(diff2, _MM_PERM_ADCB)), _mm512_mul_ps(diff2, _mm512_permute_ps(diff1, _MM_PERM_ADCB)))),
       0
    );

    // Get the squared norm of the cross product and the squared norm of P2-P1
    __m512 crossProductNormSq = _mm512_mul_ps(crossProduct, crossProduct);
    __m512 diffNormSq = _mm512_mul_ps(diff1, diff1);

    // Calculate the sum of squares for each norm
    crossProductNormSq = _mm512_reduce_add_ps(crossProductNormSq);
    diffNormSq = _mm512_reduce_add_ps(diffNormSq);

    // Divide the norms to get the squared distance
    __m512 distanceSq = _mm512_div_ps(crossProductNormSq, diffNormSq);

    // Return the result
    return _mm_cvtss_f32(distanceSq);
}

This implementation will work on any system with AVX-512 support including Ice Lake Servers.

As always, in a production ready code, proper error checking should be performed. For example, checking that the input pointers are not null and the denominator of the distance calculation is not zero.


QUESTION:

if 3d points are stored as 4-float arrays in memory, can you load faster using AVX-512 masked loads?


GPT-4 ANSWER:

Yes, indeed, if 3D points stored in memory in a contiguous form (like a struct of arrays) and padded to 4 floats (using an extra uninitialized or default/0 float at the end for example), you may use AVX-512 masked load operations. This can allow for faster loading of data into registers.

The AVX-512 masked-load operation _mm512_mask_loadu_ps loads from memory to vector register only the elements corresponding to the mask bit. Here is an example:

#include <immintrin.h>

float calculate_squared_distance(float* P0, float* P1, float* P2) {

    // Load points using masked load to avoid loading fourth float
    __mmask16 mask = 0b0111; // lower 3 bits are 1, rest are 0
    __m512 p0 = _mm512_maskz_loadu_ps(mask, P0);
    __m512 p1 = _mm512_maskz_loadu_ps(mask, P1);
    __m512 p2 = _mm512_maskz_loadu_ps(mask, P2);

    // ... (rest of your calculation)
}

Remember that the input pointers must be properly aligned if you use _mm512_mask_load_ps instead of _mm512_maskz_loadu_ps.

Please note: whether or not this is actually faster than just loading one float at a time might depend on the specific memory layout and the microarchitecture of the CPU you are using. As always, the only surefire way to know which is better is to benchmark both methods on your actual data on your actual target hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment