mrocklin/tensordot-memory-rechunk.ipynb

## tensordot-memory-rechunk.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Analyze x.dot(x.T)\n",
    "\n",
    "This workload is both common and easy to make pathalogical\n",
    "\n",
    "This notebook explores some tricks to modify its behavior, first through rechunking."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import dask.array as da"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = da.random.normal(size=(5e6, 25e2), chunks=(1e4, 5e2))\n",
    "# x = x.rechunk({0: -1, 1: 100})\n",
    "x = x.rechunk({0: 5e3, 1: -1})\n",
    "cov = x.T.dot(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x.nbytes / 1e9"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cov.nbytes / 1e9"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(cov.__dask_graph__())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This takes a while, but can be interesting to look at\n",
    "# cov.visualize(color='order', node_attr={'penwidth': '6'})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We set `processes=True` to get a few workers in different memory spaces or `processes=False` to get a single worker that won't share data around to see the effect of having a few workers.\n",
    "\n",
    "Currently we find that having a few workers definitely increases the total amount of distributed memory used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask.distributed import Client\n",
    "client = Client(processes=True)  # change to true or false to play with one or many workers\n",
    "client"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We watch the status and graph pages during execution.\n",
    "\n",
    "It would be nice to have a new page about how much data is replicated across workers, and how large it is."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cov.compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Analyze x.dot(x.T)\n",
	"\n",
	"This workload is both common and easy to make pathalogical\n",
	"\n",
	"This notebook explores some tricks to modify its behavior, first through rechunking."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import dask.array as da"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"x = da.random.normal(size=(5e6, 25e2), chunks=(1e4, 5e2))\n",
	"# x = x.rechunk({0: -1, 1: 100})\n",
	"x = x.rechunk({0: 5e3, 1: -1})\n",
	"cov = x.T.dot(x)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"x.nbytes / 1e9"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"cov.nbytes / 1e9"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"len(cov.__dask_graph__())"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# This takes a while, but can be interesting to look at\n",
	"# cov.visualize(color='order', node_attr={'penwidth': '6'})"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We set `processes=True` to get a few workers in different memory spaces or `processes=False` to get a single worker that won't share data around to see the effect of having a few workers.\n",
	"\n",
	"Currently we find that having a few workers definitely increases the total amount of distributed memory used."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"from dask.distributed import Client\n",
	"client = Client(processes=True) # change to true or false to play with one or many workers\n",
	"client"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We watch the status and graph pages during execution.\n",
	"\n",
	"It would be nice to have a new page about how much data is replicated across workers, and how large it is."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"cov.compute()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}