Skip to content

Instantly share code, notes, and snippets.

@cathalmccabe
Created November 29, 2021 10:38
Show Gist options
  • Save cathalmccabe/9144ed8a286512b24a2915ee14662105 to your computer and use it in GitHub Desktop.
Save cathalmccabe/9144ed8a286512b24a2915ee14662105 to your computer and use it in GitHub Desktop.
PYNQ Alveo lab exercise solutions
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Example Solutions for Exercises\n",
"\n",
"This notebook is designed to give lab helpers guidance for possible ways to complete the exercises. It is not meant to be a definitive set of \"correct\" solutions. Possible variants are noted in the comments. This notebook is also not self contained - the cells will need to be run in the context of the notebooks.\n",
"\n",
"## Lab 1 - Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 1\n",
"\n",
"This is a copy-paste exercise from the vadd code in the notebook"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vmult = ol.vmult_1\n",
"vmult.call(in_a, in_b, out_c, 1024*1024)\n",
"out_c.sync_from_device()\n",
"np.array_equal(in_a * in_b, out_c)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 2\n",
"\n",
"Main complexity here is creating the temporary buffer. This could be done inside `vmac` for more flexibility at the expense of performance."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp = pynq.allocate((1024,1024), 'u4')\n",
"def vmac(in_a, in_b, acc):\n",
" vmult.call(in_a, in_b, temp, in_a.size)\n",
" vadd.call(acc, temp, acc, in_a.size)\n",
" \n",
"out_c[:] = 1\n",
"out_c.sync_to_device()\n",
"vmac(in_a, in_b, out_c)\n",
"out_c.sync_from_device()\n",
"np.array_equal(in_a*in_b + 1, out_c)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 3\n",
"\n",
"Look up the `np.random.randint` function and the rest of the code is pretty self explanatory. I've used the `vmac` function to drive the hardware as that way we test both at the same time"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"in_a[:] = np.random.randint(0, 100, (1024,1024))\n",
"in_b[:] = np.random.randint(0, 100, (1024,1024))\n",
"out_c[:] = 1\n",
"\n",
"in_a.sync_to_device()\n",
"in_b.sync_to_device()\n",
"out_c.sync_to_device()\n",
"\n",
"vmac(in_a, in_b, out_c)\n",
"out_c.sync_from_device()\n",
"np.array_equal(in_a * in_b + 1, out_c)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lab 2 Optimization\n",
"\n",
"### Exercise 1\n",
"\n",
"This is mostly a straight copy and paste from the previous notebook applied to the new vmac APIs. The only challenge might be computing the expected result."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"True\n",
"True\n",
"True\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"def test(f):\n",
" in_a[:] = np.random.randint(0, 100, (8,1024,1024))\n",
" in_b[:] = np.random.randint(0, 100, (8,1024,1024))\n",
" in_a.sync_to_device()\n",
" in_b.sync_to_device()\n",
" expected = sum([a * b for a, b in zip(in_a, in_b)])\n",
"\n",
" # Make sure the acculmulator is cleared\n",
" acc[:] = 0\n",
" acc.sync_to_device()\n",
"\n",
" f(in_a, in_b, acc)\n",
" \n",
" acc.sync_from_device()\n",
" return np.array_equal(expected, acc)\n",
" \n",
"print(test(vmac_plain))\n",
"print(test(vmac_overlapped))\n",
"print(test(vmac_waitfor))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 2\n",
"\n",
"This is probably the most complicated of the exercises and involves really thinking about the timeline diagram for an overlapped vmac. Unfortunately at present this can't easily be done using `waitfor` but I'm hopefully we'll add that capability in 2.6"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def vmac_communication(a, b, acc):\n",
" # Perform the first data transfer\n",
" in_a[0].sync_to_device()\n",
" in_b[0].sync_to_device()\n",
" # Perform the first multiplication\n",
" wh = vmult.start(a[0], b[0], temp[0], 1024*1024)\n",
" # Copy the second block and the accumulator\n",
" in_a[1].sync_to_device()\n",
" in_b[1].sync_to_device()\n",
" acc.sync_to_device()\n",
" wh.wait()\n",
" # Loop over all of the additions\n",
" for i in range(8):\n",
" wh = vadd.start(acc, temp[i%2], acc, 1024*1024)\n",
" if i != 7:\n",
" wh2 = vmult.start(a[i+1], b[i+1], temp[(i+1)%2], 1024*1024)\n",
" if i != 6:\n",
" in_a[i+2].sync_to_device()\n",
" in_b[i+2].sync_to_device()\n",
" wh2.wait()\n",
" wh.wait()\n",
" acc.sync_from_device()\n",
" \n",
"in_a[:] = np.random.randint(0, 100, (8,1024,1024))\n",
"in_b[:] = np.random.randint(0, 100, (8,1024,1024))\n",
"expected = sum([a * b for a, b in zip(in_a, in_b)])\n",
"\n",
"acc[:] = 0\n",
"\n",
"vmac_communication(in_a, in_b, acc)\n",
" \n",
"np.array_equal(expected, acc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 3\n",
"\n",
"Main complexity here is picking a timing library and a plotting library. I've shown how to do this with `timeit` and `pandas` as I think this is the most succint approach"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x7ff2dc41d650>"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import timeit\n",
"import functools\n",
"\n",
"def time_vmac(dim, f):\n",
" in_a = pynq.allocate((8, dim, dim), 'u4')\n",
" in_b = pynq.allocate((8, dim, dim), 'u4')\n",
" acc = pynq.allocate((dim, dim), 'u4')\n",
" \n",
" in_a[:] = 100\n",
" in_b[:] = 200\n",
" \n",
" reps, duration = timeit.Timer(functools.partial(f, in_a, in_b, acc)).autorange()\n",
" \n",
" return duration / reps\n",
"\n",
"import pandas as pd\n",
"df = pd.DataFrame(columns=['plain', 'overlapped', 'waitfor'])\n",
"for i in range(11):\n",
" dim = 2 ** i\n",
" df.loc[dim] = [\n",
" time_vmac(dim, vmac_plain),\n",
" time_vmac(dim, vmac_overlapped),\n",
" time_vmac(dim, vmac_waitfor)\n",
" ]\n",
" \n",
"df.plot(loglog=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lab 3 Compression\n",
"\n",
"These exercises aren't too complex they just involve a fair number of moving parts. The main advice I would give is to focus on whole multiples of the buffer sizes rather than trying to handle arbitrary files.\n",
"\n",
"### Exercise 1\n",
"\n",
"This should be a case of applying the lessons of the previous lab to this one. The test data is big enough for 8 * 8 blocks which mirrors lab 2 exactly"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"in_buffers = pynq.allocate((8, 8, BLOCK_SIZE), 'u1', target=ol.bank1)\n",
"out_buffers = pynq.allocate((8, 8, BLOCK_SIZE), 'u1', target=ol.bank1)\n",
"compressed_size = pynq.allocate((8, 8), 'u4', target=ol.bank1)\n",
"\n",
"in_buffers.reshape((8*8*BLOCK_SIZE))[:] = memoryview(test_data[0:8*8*1024*1024])\n",
"\n",
"def sync_output(sizes, buffers):\n",
" sizes.sync_from_device()\n",
" for s, b in zip(sizes, buffers):\n",
" b[0:s].sync_from_device()\n",
"\n",
"def compress_overlapped():\n",
" in_buffers[0].sync_to_device()\n",
" for i in range(8):\n",
" wh = compress.start(in_buffers[i], out_buffers[i],\n",
" compressed_size[i], uncompressed_size,\n",
" 1024, 8*BLOCK_SIZE)\n",
" if i != 7:\n",
" in_buffers[i+1].sync_to_device()\n",
" if i != 0:\n",
" sync_output(compressed_size[i-1], out_buffers[i-1])\n",
" wh.wait()\n",
" sync_output(compressed_size[7], out_buffers[7])\n",
" \n",
"compress_overlapped()\n",
"\n",
"for i in range(8):\n",
" for j in range(8):\n",
" uncompressed = lz4.block.decompress(out_buffers[i,j,0:compressed_size[i,j]],\n",
" uncompressed_size=1024*1024)\n",
" if len(uncompressed) != BLOCK_SIZE:\n",
" print(\"Wrong block length\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 2\n",
"\n",
"This one is mainly an exercise in code organisation. Classes are basically essential to avoid variable overload but the actual interleaving should be manageable"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"class Compressor:\n",
" def __init__(self, kernel, bank):\n",
" self._kernel = kernel\n",
" self._in = pynq.allocate((8, BLOCK_SIZE), 'u1', target=bank)\n",
" self._out = pynq.allocate((8, BLOCK_SIZE), 'u1', target=bank)\n",
" self._comp_size = pynq.allocate((8,), 'u4', target=bank)\n",
" self._uncomp_size = pynq.allocate((8,), 'u4', target=bank)\n",
" \n",
" self._uncomp_size[:] = BLOCK_SIZE\n",
" self._uncomp_size.sync_to_device\n",
" \n",
" def transfer_in(self, data):\n",
" self._in.reshape((8*BLOCK_SIZE))[:] = memoryview(data)\n",
" self._in.sync_to_device()\n",
" \n",
" def transfer_out(self):\n",
" sync_output(self._comp_size, self._out)\n",
" return [b[0:s].copy() for s, b in zip(self._comp_size, self._out)]\n",
" \n",
" def start(self):\n",
" return self._kernel.start(self._in, self._out, self._comp_size, self._uncomp_size, 1024, 8*BLOCK_SIZE)\n",
" \n",
"c1 = Compressor(ol.xilLz4Compress_1, ol.bank0)\n",
"c2 = Compressor(ol.xilLz4Compress_2, ol.bank1)\n",
"\n",
"c1.transfer_in(test_data[0:8*BLOCK_SIZE])\n",
"wh1 = c1.start()\n",
"c2.transfer_in(test_data[8*BLOCK_SIZE:16*BLOCK_SIZE])\n",
"wh2 = c2.start()\n",
"wh1.wait()\n",
"result1 = c1.transfer_out()\n",
"wh2.wait()\n",
"result2 = c2.transfer_out()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 3\n",
"\n",
"Practise implementing an actual spec in Python and _hoepfully_ seeing it's not too bad. Each block is prefixed with a length and the file header is already provided"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import io\n",
"import struct\n",
"\n",
"stream = io.BytesIO()\n",
"\n",
"stream.write(LZ4_HEADER)\n",
"for i in range(8):\n",
" for j in range(8):\n",
" s = compressed_size[i,j]\n",
" subbuf = out_buffers[i,j,0:s]\n",
" stream.write(struct.pack('<I', s))\n",
" stream.write(subbuf)\n",
"stream.write(struct.pack('<I', 0))\n",
"\n",
"stream.seek(0)\n",
"\n",
"import lz4.frame\n",
"uncompressed = lz4.frame.decompress(stream.read())\n",
"uncompressed == test_data[0:64*BLOCK_SIZE]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment