Instantly share code, notes, and snippets.

# maksbotan/1_text.md Secret

Last active September 10, 2023 23:16
Star You must be signed in to star a gist
python for blogpost

We will implement a very simple Signal/Image Processing task: smooth 1-d array with Gaussian kernel (in 2-d case this would be called "Gaussian blur").

This is achieved with a convolution: we move our kernel along the signal and compute sliding dot products.

## Pure python implementation

(See attached notebook for more details)

In pure python it is simply two nested loops:

```def smooth(signal, kernel):
smoothed =  * len(signal)
half_width = (len(kernel) - 1) // 2

for i in range(len(signal)):
left = max(0, i - half_width)
right = min(i + half_width, len(signal) - 1)

smoothed[i] = sum(
signal[j] * kernel[half_width - (i - j)] for j in range(left, right + 1)
)

return smoothed```

Here for each point of input signal we try to overlap the kernel and cut off the edges (assuming the signal is 0 out of its bounds).

If we measure it with IPython's default benchmarking tool, `%timeit`, we get these figures for signal of length 5000 and kernel of length 13:

``````14.6 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
``````

Not so nice!

## NumPy

Now, in Python it is standard to solve such tasks with NumPy: an efficient implementation of arrays and operations on them in C.

With numpy we just call `np.convolve` to solve our task. Let's benchmark it on the same input:

``````84.5 µs ± 371 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
``````

Much better than 15ms.

In standard haskell we can use only linked lists to represent our arrays. How do we write sensible convolution on linked lists?

We will use `Data.List.tails` function that for a list `[a]` will return a list of all of its tails: `[[a]]`. For example, `tails [1,2,3] = [[1,2,3], [2,3], , []]`. Then we apply `take kernel_width` to every tail and we get lists suitable for computing of the dot-product with our kernel. The only thing we need to account for is the beginning of the list, where we can't use the whole kernel. Let's just prepend `(kernel_width - 1) / 2` zeros and then our generic procedure is recovered.

The code:

```smooth :: [Double] -> [Double] -> [Double]
smooth signal kernel = take signal_length \$ map (\t -> take kernel_width t `dot` kernel) \$ tails padded
where
signal_length = length signal
kernel_width = length kernel
half_width = (kernel_width - 1) `div` 2

dot :: [Double] -> [Double] -> Double
dot a b = sum \$ zipWith (*) a b```

Take a not how this just reads off from English description of the algorithm, contrary to what we did in Python.

So, what about the performance? We will compile with optimizations (`-O2`) and use `criterion`, Haskell's de-facto standard tool for benchmarking:

``````benchmarking list/tails
time                 310.1 μs   (309.3 μs .. 310.7 μs)
1.000 R²   (1.000 R² .. 1.000 R²)
mean                 310.2 μs   (309.6 μs .. 311.0 μs)
std dev              2.282 μs   (1.749 μs .. 3.279 μs)
``````

So, naive and very simple Haskell implementation outperforms naive Python code 50 times!

Can we do even more faster?

## Vector

In Haskell there is a dedicated library for working with arrays: `vector`. Unlike NumPy, `vector` is implemented in pure Haskell, without any `C` code.

The cool thing about it is that we can write our code very similarly to plain lists:

```smoothVunfoldr :: V.Vector Double -> V.Vector Double -> V.Vector Double
smoothVunfoldr signal kernel = V.take signal_length \$ V.unfoldr calc padded
where
signal_length = V.length signal
kernel_width = V.length kernel
half_width = (kernel_width - 1) `div` 2

dot :: V.Vector Double -> V.Vector Double -> Double
dot a b = V.sum \$ V.zipWith (*) a b

calc :: V.Vector Double -> Maybe (Double, V.Vector Double)
calc vec
| V.null vec = Nothing
| otherwise = Just (V.take kernel_width vec `dot` kernel, V.tail vec)```

Here `V` is `import qualified Data.Vector.Unbox as V`: unboxed vectors, which represent our arrays as simple memory arrays, without any additional indirections through pointers.

Unfortunately, `vector` lacks `tails` function, so here we use `V.unfoldr`, which lets us generate our result one by one, while keeping some sort of state. In our case the state is the always shrinking tail of the input vector.

Benchmark result:

``````benchmarking vector/unfodlr
time                 132.8 μs   (132.3 μs .. 133.2 μs)
1.000 R²   (1.000 R² .. 1.000 R²)
mean                 131.7 μs   (131.3 μs .. 132.2 μs)
std dev              1.509 μs   (1.184 μs .. 1.936 μs)
``````

Almost 3 times better than linked lists, approaching the performance of Numpy while still writing pure and idiomatic Haskell.

## Vector with slices

Now, take another look at the code we've written. Actually, the only reason we used `tails` and `map` in the lists example is because indexing the list is `O(n)` in haskell, and thus we can't just port Python code as it is.

However, in `vector` indexing and taking slices is `O(1)`, so now we can recreate the Python implementation:

```smoothVslice :: V.Vector Double -> V.Vector Double -> V.Vector Double
smoothVslice signal kernel = V.generate signal_length calc
where
signal_length = V.length signal
kernel_width = V.length kernel
half_width = (kernel_width - 1) `div` 2

calc :: Int -> Double
calc i = V.sum \$ V.zipWith (*) signal_part kernel_part
where
left = max 0 (i - half_width)
right = min (i + half_width) (signal_length - 1)

signal_part = V.slice left (right - left + 1) signal
kernel_part = V.slice (half_width - (i - left)) (right - left + 1) kernel```

Here we just iterate through our input signal, take appropriate slices of it and of the kernel and calculate the dot product.

Recall that in Python we calculated the dot product by explicitly iterating through some indices in the inputs. And in Haskell we first take a slice of the vector and then calculate the product. Doesn't it involve memory copying?

No! Due to Haskell's immutable-by-default nature, `V.slice` can refer to the same memory as used by the parent vector and still be safe from any corruptions this sharing might introduce in an impure language.

Here's the benchmark:

``````benchmarking vector/slice
time                 107.0 μs   (106.8 μs .. 107.2 μs)
1.000 R²   (1.000 R² .. 1.000 R²)
mean                 106.6 μs   (106.4 μs .. 106.8 μs)
std dev              617.6 ns   (511.7 ns .. 779.0 ns)
``````

107 microseconds: we almost achieved the performance of hand-crafted highly optimized C code in NumPy.

Bonus point: if we rewrite our pure Python implementation to work with NumPy arrays and to use `np.dot` instead of manual loop for the dot product, we still get bad performance:

``````11.1 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
``````

as opposed to 88 microseconds of `np.convolve`.

This means that in Python the developer must always think how to offload as much computation as possible to pre-made C functions in NumPy instead of writing the usual loops. And in Haskell there is no such problem, since the optimizer will happily process pure Haskell code in the `vector` library and pure Haskell code in your project just the same and glue them nicely.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters