Created
July 11, 2019 14:22
-
-
Save mino98/e148d71b40c7d434441820c9f390e8c0 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from scipy.stats import bernoulli\n", | |
"import numpy as np" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Mi dicevi che in totale sono 4096 bits, e stimavi:\n", | |
" * $\\frac{1}{5}$ dei bits hanno un valore, diciamo `1`\n", | |
" * $\\frac{4}{5}$ l'altro valore, diciamo `0`\n", | |
" \n", | |
"Dunque l'entropia di `0` è minore di un bit. Vediamo al volo, in pratica quando potresti risparmiare con vari metodi. Iniziamo gzippando (Huffman coding):" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import zlib" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def gzip_random_bitstring_and_return_compression_ratio(p, bitlength):\n", | |
" bitstring = bernoulli.rvs(p=p, size=bitlength)\n", | |
" \n", | |
" # Convertiamo in bytes e impacchettiamo per un totale di 512 bytes:\n", | |
" bitstring = np.packbits(bitstring).tobytes()\n", | |
" \n", | |
" bitstring = zlib.compress(bitstring, level=9)\n", | |
" compressed_size = len(bitstring) * 8\n", | |
"\n", | |
" return (bitlength - compressed_size) / bitlength" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Facciamo qualche migliaio di tentativi random (il ratio dipenderà dal fatto che gli `0` e `1` sono più o meno contigui):" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"compression_ratios = [gzip_random_bitstring_and_return_compression_ratio(1/5, 4096) for i in range(10000)]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.132551953125" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.average(compression_ratios)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In media, risparmi circa 13%. Invece con LZMA (`xz`):" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import lzma" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def lzma_random_bitstring_and_return_compression_ratio(p, bitlength):\n", | |
" bitstring = bernoulli.rvs(p=p, size=bitlength)\n", | |
" \n", | |
" # Convertiamo in bytes e impacchettiamo per un totale di 512 bytes:\n", | |
" bitstring = np.packbits(bitstring).tobytes()\n", | |
" \n", | |
" bitstring = lzma.compress(bitstring, \n", | |
" format=lzma.FORMAT_RAW, \n", | |
" filters=[{'id': lzma.FILTER_LZMA2, 'preset': 9 | lzma.PRESET_EXTREME}])\n", | |
"\n", | |
" compressed_size = len(bitstring) * 8\n", | |
"\n", | |
" return (bitlength - compressed_size) / bitlength\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"compression_ratios = [lzma_random_bitstring_and_return_compression_ratio(1/5, 4096) for i in range(1000)]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.057291015625" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.average(compression_ratios)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In media, risparmi circa 6% (`lzma.FORMAT_RAW` già esclude gli header) ed è molto più lento." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In `gzip`, i primi 2 bytes sono header e gli ultimi 4 son checksum. Potresti provare a stripparli. Ridefiniamo come segue:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def gzip_random_bitstring_and_return_compression_ratio(p, bitlength):\n", | |
" bitstring = bernoulli.rvs(p=p, size=bitlength)\n", | |
" \n", | |
" # Convertiamo in bytes e impacchettiamo per un totale di 512 bytes:\n", | |
" bitstring = np.packbits(bitstring).tobytes()\n", | |
" \n", | |
" compressor = zlib.compressobj(level=9, wbits=-15)\n", | |
" compressed_bitstring = compressor.compress(bitstring) + compressor.flush()\n", | |
" compressed_size = len(compressed_bitstring) * 8\n", | |
" \n", | |
" return (bitlength - compressed_size) / bitlength" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"compression_ratios = [gzip_random_bitstring_and_return_compression_ratio(1/5, 4096) for i in range(10000)]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.1442943359375" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.average(compression_ratios)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Come previsto, guadagni un po' di più, circa 14%. Hai comunque bisogno di 3.5KB abbondanti:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"3523" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"round(4096 * (1 - 0.14))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Se non si possono fare altre assunzioni su questa bitmap di cui mi dici, dubito cambi molto..." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment