Skip to content

Instantly share code, notes, and snippets.

@mino98
Created July 11, 2019 14:22
Show Gist options
  • Save mino98/e148d71b40c7d434441820c9f390e8c0 to your computer and use it in GitHub Desktop.
Save mino98/e148d71b40c7d434441820c9f390e8c0 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from scipy.stats import bernoulli\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mi dicevi che in totale sono 4096 bits, e stimavi:\n",
" * $\\frac{1}{5}$ dei bits hanno un valore, diciamo `1`\n",
" * $\\frac{4}{5}$ l'altro valore, diciamo `0`\n",
" \n",
"Dunque l'entropia di `0` è minore di un bit. Vediamo al volo, in pratica quando potresti risparmiare con vari metodi. Iniziamo gzippando (Huffman coding):"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import zlib"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def gzip_random_bitstring_and_return_compression_ratio(p, bitlength):\n",
" bitstring = bernoulli.rvs(p=p, size=bitlength)\n",
" \n",
" # Convertiamo in bytes e impacchettiamo per un totale di 512 bytes:\n",
" bitstring = np.packbits(bitstring).tobytes()\n",
" \n",
" bitstring = zlib.compress(bitstring, level=9)\n",
" compressed_size = len(bitstring) * 8\n",
"\n",
" return (bitlength - compressed_size) / bitlength"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Facciamo qualche migliaio di tentativi random (il ratio dipenderà dal fatto che gli `0` e `1` sono più o meno contigui):"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"compression_ratios = [gzip_random_bitstring_and_return_compression_ratio(1/5, 4096) for i in range(10000)]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.132551953125"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.average(compression_ratios)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In media, risparmi circa 13%. Invece con LZMA (`xz`):"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"import lzma"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"def lzma_random_bitstring_and_return_compression_ratio(p, bitlength):\n",
" bitstring = bernoulli.rvs(p=p, size=bitlength)\n",
" \n",
" # Convertiamo in bytes e impacchettiamo per un totale di 512 bytes:\n",
" bitstring = np.packbits(bitstring).tobytes()\n",
" \n",
" bitstring = lzma.compress(bitstring, \n",
" format=lzma.FORMAT_RAW, \n",
" filters=[{'id': lzma.FILTER_LZMA2, 'preset': 9 | lzma.PRESET_EXTREME}])\n",
"\n",
" compressed_size = len(bitstring) * 8\n",
"\n",
" return (bitlength - compressed_size) / bitlength\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"compression_ratios = [lzma_random_bitstring_and_return_compression_ratio(1/5, 4096) for i in range(1000)]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.057291015625"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.average(compression_ratios)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In media, risparmi circa 6% (`lzma.FORMAT_RAW` già esclude gli header) ed è molto più lento."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In `gzip`, i primi 2 bytes sono header e gli ultimi 4 son checksum. Potresti provare a stripparli. Ridefiniamo come segue:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def gzip_random_bitstring_and_return_compression_ratio(p, bitlength):\n",
" bitstring = bernoulli.rvs(p=p, size=bitlength)\n",
" \n",
" # Convertiamo in bytes e impacchettiamo per un totale di 512 bytes:\n",
" bitstring = np.packbits(bitstring).tobytes()\n",
" \n",
" compressor = zlib.compressobj(level=9, wbits=-15)\n",
" compressed_bitstring = compressor.compress(bitstring) + compressor.flush()\n",
" compressed_size = len(compressed_bitstring) * 8\n",
" \n",
" return (bitlength - compressed_size) / bitlength"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"compression_ratios = [gzip_random_bitstring_and_return_compression_ratio(1/5, 4096) for i in range(10000)]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.1442943359375"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.average(compression_ratios)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Come previsto, guadagni un po' di più, circa 14%. Hai comunque bisogno di 3.5KB abbondanti:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3523"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"round(4096 * (1 - 0.14))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Se non si possono fare altre assunzioni su questa bitmap di cui mi dici, dubito cambi molto..."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment