mino98/compress_biaso.ipynb

## compress_biaso.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import bernoulli\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mi dicevi che in totale sono 4096 bits, e stimavi:\n",
    " * $\\frac{1}{5}$ dei bits hanno un valore, diciamo `1`\n",
    " * $\\frac{4}{5}$ l'altro valore, diciamo `0`\n",
    " \n",
    "Dunque l'entropia di `0` è minore di un bit. Vediamo al volo, in pratica quando potresti risparmiare con vari metodi. Iniziamo gzippando (Huffman coding):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import zlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def gzip_random_bitstring_and_return_compression_ratio(p, bitlength):\n",
    "    bitstring = bernoulli.rvs(p=p, size=bitlength)\n",
    "    \n",
    "    # Convertiamo in bytes e impacchettiamo per un totale di 512 bytes:\n",
    "    bitstring = np.packbits(bitstring).tobytes()\n",
    "    \n",
    "    bitstring = zlib.compress(bitstring, level=9)\n",
    "    compressed_size = len(bitstring) * 8\n",
    "\n",
    "    return (bitlength - compressed_size) / bitlength"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Facciamo qualche migliaio di tentativi random (il ratio dipenderà dal fatto che gli `0` e `1` sono più o meno contigui):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "compression_ratios = [gzip_random_bitstring_and_return_compression_ratio(1/5, 4096) for i in range(10000)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.132551953125"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.average(compression_ratios)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In media, risparmi circa 13%. Invece con LZMA (`xz`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import lzma"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def lzma_random_bitstring_and_return_compression_ratio(p, bitlength):\n",
    "    bitstring = bernoulli.rvs(p=p, size=bitlength)\n",
    "    \n",
    "    # Convertiamo in bytes e impacchettiamo per un totale di 512 bytes:\n",
    "    bitstring = np.packbits(bitstring).tobytes()\n",
    "    \n",
    "    bitstring = lzma.compress(bitstring, \n",
    "                              format=lzma.FORMAT_RAW, \n",
    "                              filters=[{'id': lzma.FILTER_LZMA2, 'preset': 9 | lzma.PRESET_EXTREME}])\n",
    "\n",
    "    compressed_size = len(bitstring) * 8\n",
    "\n",
    "    return (bitlength - compressed_size) / bitlength\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "compression_ratios = [lzma_random_bitstring_and_return_compression_ratio(1/5, 4096) for i in range(1000)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.057291015625"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.average(compression_ratios)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In media, risparmi circa 6% (`lzma.FORMAT_RAW` già esclude gli header) ed è molto più lento."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In `gzip`, i primi 2 bytes sono header e gli ultimi 4 son checksum. Potresti provare a stripparli. Ridefiniamo come segue:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def gzip_random_bitstring_and_return_compression_ratio(p, bitlength):\n",
    "    bitstring = bernoulli.rvs(p=p, size=bitlength)\n",
    "    \n",
    "    # Convertiamo in bytes e impacchettiamo per un totale di 512 bytes:\n",
    "    bitstring = np.packbits(bitstring).tobytes()\n",
    "    \n",
    "    compressor = zlib.compressobj(level=9, wbits=-15)\n",
    "    compressed_bitstring = compressor.compress(bitstring) + compressor.flush()\n",
    "    compressed_size = len(compressed_bitstring) * 8\n",
    "    \n",
    "    return (bitlength - compressed_size) / bitlength"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "compression_ratios = [gzip_random_bitstring_and_return_compression_ratio(1/5, 4096) for i in range(10000)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.1442943359375"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.average(compression_ratios)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Come previsto, guadagni un po' di più, circa 14%. Hai comunque bisogno di 3.5KB abbondanti:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3523"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "round(4096 * (1 - 0.14))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Se non si possono fare altre assunzioni su questa bitmap di cui mi dici, dubito cambi molto..."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"from scipy.stats import bernoulli\n",
	"import numpy as np"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Mi dicevi che in totale sono 4096 bits, e stimavi:\n",
	" * $\\frac{1}{5}$ dei bits hanno un valore, diciamo `1`\n",
	" * $\\frac{4}{5}$ l'altro valore, diciamo `0`\n",
	" \n",
	"Dunque l'entropia di `0` è minore di un bit. Vediamo al volo, in pratica quando potresti risparmiare con vari metodi. Iniziamo gzippando (Huffman coding):"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"import zlib"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"def gzip_random_bitstring_and_return_compression_ratio(p, bitlength):\n",
	" bitstring = bernoulli.rvs(p=p, size=bitlength)\n",
	" \n",
	" # Convertiamo in bytes e impacchettiamo per un totale di 512 bytes:\n",
	" bitstring = np.packbits(bitstring).tobytes()\n",
	" \n",
	" bitstring = zlib.compress(bitstring, level=9)\n",
	" compressed_size = len(bitstring) * 8\n",
	"\n",
	" return (bitlength - compressed_size) / bitlength"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Facciamo qualche migliaio di tentativi random (il ratio dipenderà dal fatto che gli `0` e `1` sono più o meno contigui):"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [],
	"source": [
	"compression_ratios = [gzip_random_bitstring_and_return_compression_ratio(1/5, 4096) for i in range(10000)]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0.132551953125"
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"np.average(compression_ratios)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In media, risparmi circa 13%. Invece con LZMA (`xz`):"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [],
	"source": [
	"import lzma"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [],
	"source": [
	"def lzma_random_bitstring_and_return_compression_ratio(p, bitlength):\n",
	" bitstring = bernoulli.rvs(p=p, size=bitlength)\n",
	" \n",
	" # Convertiamo in bytes e impacchettiamo per un totale di 512 bytes:\n",
	" bitstring = np.packbits(bitstring).tobytes()\n",
	" \n",
	" bitstring = lzma.compress(bitstring, \n",
	" format=lzma.FORMAT_RAW, \n",
	" filters=[{'id': lzma.FILTER_LZMA2, 'preset': 9 \| lzma.PRESET_EXTREME}])\n",
	"\n",
	" compressed_size = len(bitstring) * 8\n",
	"\n",
	" return (bitlength - compressed_size) / bitlength\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [],
	"source": [
	"compression_ratios = [lzma_random_bitstring_and_return_compression_ratio(1/5, 4096) for i in range(1000)]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0.057291015625"
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"np.average(compression_ratios)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In media, risparmi circa 6% (`lzma.FORMAT_RAW` già esclude gli header) ed è molto più lento."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In `gzip`, i primi 2 bytes sono header e gli ultimi 4 son checksum. Potresti provare a stripparli. Ridefiniamo come segue:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [],
	"source": [
	"def gzip_random_bitstring_and_return_compression_ratio(p, bitlength):\n",
	" bitstring = bernoulli.rvs(p=p, size=bitlength)\n",
	" \n",
	" # Convertiamo in bytes e impacchettiamo per un totale di 512 bytes:\n",
	" bitstring = np.packbits(bitstring).tobytes()\n",
	" \n",
	" compressor = zlib.compressobj(level=9, wbits=-15)\n",
	" compressed_bitstring = compressor.compress(bitstring) + compressor.flush()\n",
	" compressed_size = len(compressed_bitstring) * 8\n",
	" \n",
	" return (bitlength - compressed_size) / bitlength"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [],
	"source": [
	"compression_ratios = [gzip_random_bitstring_and_return_compression_ratio(1/5, 4096) for i in range(10000)]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0.1442943359375"
	]
	},
	"execution_count": 14,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"np.average(compression_ratios)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Come previsto, guadagni un po' di più, circa 14%. Hai comunque bisogno di 3.5KB abbondanti:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"3523"
	]
	},
	"execution_count": 17,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"round(4096 * (1 - 0.14))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Se non si possono fare altre assunzioni su questa bitmap di cui mi dici, dubito cambi molto..."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}