Skip to content

Instantly share code, notes, and snippets.

@ogrisel
Last active October 28, 2022 13:54
Show Gist options
  • Save ogrisel/eb3073db349cdf6445f9631f4fc416d5 to your computer and use it in GitHub Desktop.
Save ogrisel/eb3073db349cdf6445f9631f4fc416d5 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"numpy's RNG cannot directly generate uniform values in the uint64 domain but it's not complicated to reinterpret the random bytes via `np.frombuffer`:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"int64_values = np.random.default_rng().integers(\n",
" np.iinfo(np.int64).min, np.iinfo(np.int64).max, size=int(1e7)\n",
")\n",
"uint64_values = np.frombuffer(int64_values.data, dtype=np.uint64)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visually check that we actually have uniform data in the uint64 domain:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"_ = plt.hist(uint64_values, bins=300)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Map the uint64 values to the float32 space:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# naive_float32_values = uint64_values.astype(np.float32) / np.float32(np.iinfo(np.uint64).max)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.uint32(1)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"uint32_values = np.frombuffer(uint64_values.data, dtype=np.uint32)\n",
"\n",
"# XXX: this would also work but would waste random bytes...\n",
"# uint32_values = uint64_values.astype(np.uint32)\n",
"\n",
"float32_values = (uint32_values >> np.uint32(8)) * (\n",
" np.float32(1.0) / (np.uint32(1) << np.uint32(24))\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"assert np.isfinite(float32_values).all()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visually check if we have introduced a bias with this naive method:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"_ = plt.hist(float32_values, bins=300)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Still look uniform enough to me... Let's check with a KS test:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.00011932065124509172, 0.938274054695534)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from scipy import stats\n",
"\n",
"ksstat, p = stats.kstest(float32_values, stats.uniform(loc=0.0, scale=1.0).cdf)\n",
"ksstat, p"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So the p-value is large and we cannot reject the Null Hypothesis (the data is uniform in the [0, 1] range).\n",
"\n",
"If the bias had been large, maybe we could have computed an approximation to the empirical CDF once on such as large collections and then applied a correction to our `naive_float32_values` based on it.\n",
"\n",
"There might be more subtle issues but maybe for non-cryptographic machine learning applications, this would be enough.\n",
"\n",
"Let's do a simple sanity check:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>y</th>\n",
" <th>y_m1</th>\n",
" <th>y_m2</th>\n",
" <th>y_m3</th>\n",
" <th>y_m4</th>\n",
" <th>y_m5</th>\n",
" <th>y_m6</th>\n",
" <th>y_m7</th>\n",
" <th>y_m8</th>\n",
" <th>y_m9</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>0.401732</td>\n",
" <td>0.901828</td>\n",
" <td>0.621645</td>\n",
" <td>0.929796</td>\n",
" <td>0.065658</td>\n",
" <td>0.097534</td>\n",
" <td>0.117509</td>\n",
" <td>0.309837</td>\n",
" <td>0.565413</td>\n",
" <td>0.189782</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>0.553764</td>\n",
" <td>0.401732</td>\n",
" <td>0.901828</td>\n",
" <td>0.621645</td>\n",
" <td>0.929796</td>\n",
" <td>0.065658</td>\n",
" <td>0.097534</td>\n",
" <td>0.117509</td>\n",
" <td>0.309837</td>\n",
" <td>0.565413</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>0.557354</td>\n",
" <td>0.553764</td>\n",
" <td>0.401732</td>\n",
" <td>0.901828</td>\n",
" <td>0.621645</td>\n",
" <td>0.929796</td>\n",
" <td>0.065658</td>\n",
" <td>0.097534</td>\n",
" <td>0.117509</td>\n",
" <td>0.309837</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>0.229553</td>\n",
" <td>0.557354</td>\n",
" <td>0.553764</td>\n",
" <td>0.401732</td>\n",
" <td>0.901828</td>\n",
" <td>0.621645</td>\n",
" <td>0.929796</td>\n",
" <td>0.065658</td>\n",
" <td>0.097534</td>\n",
" <td>0.117509</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>0.737321</td>\n",
" <td>0.229553</td>\n",
" <td>0.557354</td>\n",
" <td>0.553764</td>\n",
" <td>0.401732</td>\n",
" <td>0.901828</td>\n",
" <td>0.621645</td>\n",
" <td>0.929796</td>\n",
" <td>0.065658</td>\n",
" <td>0.097534</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19999995</th>\n",
" <td>0.496097</td>\n",
" <td>0.072483</td>\n",
" <td>0.377848</td>\n",
" <td>0.173808</td>\n",
" <td>0.063203</td>\n",
" <td>0.640214</td>\n",
" <td>0.838051</td>\n",
" <td>0.698533</td>\n",
" <td>0.970963</td>\n",
" <td>0.697886</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19999996</th>\n",
" <td>0.167435</td>\n",
" <td>0.496097</td>\n",
" <td>0.072483</td>\n",
" <td>0.377848</td>\n",
" <td>0.173808</td>\n",
" <td>0.063203</td>\n",
" <td>0.640214</td>\n",
" <td>0.838051</td>\n",
" <td>0.698533</td>\n",
" <td>0.970963</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19999997</th>\n",
" <td>0.106291</td>\n",
" <td>0.167435</td>\n",
" <td>0.496097</td>\n",
" <td>0.072483</td>\n",
" <td>0.377848</td>\n",
" <td>0.173808</td>\n",
" <td>0.063203</td>\n",
" <td>0.640214</td>\n",
" <td>0.838051</td>\n",
" <td>0.698533</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19999998</th>\n",
" <td>0.868544</td>\n",
" <td>0.106291</td>\n",
" <td>0.167435</td>\n",
" <td>0.496097</td>\n",
" <td>0.072483</td>\n",
" <td>0.377848</td>\n",
" <td>0.173808</td>\n",
" <td>0.063203</td>\n",
" <td>0.640214</td>\n",
" <td>0.838051</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19999999</th>\n",
" <td>0.478732</td>\n",
" <td>0.868544</td>\n",
" <td>0.106291</td>\n",
" <td>0.167435</td>\n",
" <td>0.496097</td>\n",
" <td>0.072483</td>\n",
" <td>0.377848</td>\n",
" <td>0.173808</td>\n",
" <td>0.063203</td>\n",
" <td>0.640214</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>19999991 rows × 10 columns</p>\n",
"</div>"
],
"text/plain": [
" y y_m1 y_m2 y_m3 y_m4 y_m5 \\\n",
"9 0.401732 0.901828 0.621645 0.929796 0.065658 0.097534 \n",
"10 0.553764 0.401732 0.901828 0.621645 0.929796 0.065658 \n",
"11 0.557354 0.553764 0.401732 0.901828 0.621645 0.929796 \n",
"12 0.229553 0.557354 0.553764 0.401732 0.901828 0.621645 \n",
"13 0.737321 0.229553 0.557354 0.553764 0.401732 0.901828 \n",
"... ... ... ... ... ... ... \n",
"19999995 0.496097 0.072483 0.377848 0.173808 0.063203 0.640214 \n",
"19999996 0.167435 0.496097 0.072483 0.377848 0.173808 0.063203 \n",
"19999997 0.106291 0.167435 0.496097 0.072483 0.377848 0.173808 \n",
"19999998 0.868544 0.106291 0.167435 0.496097 0.072483 0.377848 \n",
"19999999 0.478732 0.868544 0.106291 0.167435 0.496097 0.072483 \n",
"\n",
" y_m6 y_m7 y_m8 y_m9 \n",
"9 0.117509 0.309837 0.565413 0.189782 \n",
"10 0.097534 0.117509 0.309837 0.565413 \n",
"11 0.065658 0.097534 0.117509 0.309837 \n",
"12 0.929796 0.065658 0.097534 0.117509 \n",
"13 0.621645 0.929796 0.065658 0.097534 \n",
"... ... ... ... ... \n",
"19999995 0.838051 0.698533 0.970963 0.697886 \n",
"19999996 0.640214 0.838051 0.698533 0.970963 \n",
"19999997 0.063203 0.640214 0.838051 0.698533 \n",
"19999998 0.173808 0.063203 0.640214 0.838051 \n",
"19999999 0.377848 0.173808 0.063203 0.640214 \n",
"\n",
"[19999991 rows x 10 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"y_series = pd.Series(float32_values).rename(\"y\")\n",
"\n",
"# Build some lagged features to attempt to predict y from previously generated values\n",
"df = pd.concat(\n",
" [y_series] + [y_series.shift(i).rename(f\"y_m{i}\") for i in range(1, 10)], axis=\"columns\"\n",
").dropna()\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"X = df.drop(\"y\", axis=\"columns\").values\n",
"y = df[\"y\"].values"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.90182781, 0.62164545, 0.92979622, 0.06565809, 0.09753442,\n",
" 0.11750877, 0.30983722, 0.565413 , 0.18978167])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X[0]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.4017317295074463"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y[0]"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.\n",
"[Parallel(n_jobs=4)]: Done 5 tasks | elapsed: 25.9s\n",
"[Parallel(n_jobs=4)]: Done 10 tasks | elapsed: 39.6s\n",
"[Parallel(n_jobs=4)]: Done 17 tasks | elapsed: 58.9s\n",
"[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 1.2min\n",
"[Parallel(n_jobs=4)]: Done 33 tasks | elapsed: 1.7min\n",
"[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 2.2min\n",
"[Parallel(n_jobs=4)]: Done 53 tasks | elapsed: 2.7min\n",
"[Parallel(n_jobs=4)]: Done 64 tasks | elapsed: 3.2min\n",
"[Parallel(n_jobs=4)]: Done 77 tasks | elapsed: 3.9min\n",
"[Parallel(n_jobs=4)]: Done 90 tasks | elapsed: 4.6min\n",
"[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 5.0min finished\n"
]
}
],
"source": [
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.model_selection import TimeSeriesSplit\n",
"from sklearn.model_selection import permutation_test_score\n",
"\n",
"\n",
"rf = RandomForestRegressor(min_samples_leaf=30, n_estimators=100)\n",
"cv = TimeSeriesSplit(n_splits=5, max_train_size=int(5e3), test_size=int(5e3))\n",
"score, permutation_scores, pvalue = permutation_test_score(\n",
" rf, X, y, cv=cv, n_permutations=100, verbose=10, n_jobs=4\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.43564356435643564"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pvalue"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So the p-value is large again, we can reject the Null Hypothesis that a `RandomForestRegressor` can predict the next RNG value from the past `X.shape[1]` generated values better than chance.\n",
"\n",
"To conclude: there is no obvious predictable signal in this RNG stream of `float32` values."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.10.6 ('base')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "c91744e846ab1fb46a81a92b1fa828c0e6b1381e7e12fd7b2bb300d813000458"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@smason
Copy link

smason commented Oct 24, 2022

not sure if it makes much difference, but I think your "smarter" version is preferred these days, i.e.:

(uint32_values >> 8).astype(np.float32) * np.float32(1 / (1 << 24))

the bottom of https://prng.di.unimi.it/ has some comments on some different definitions of uniformity that could apply here

I noticed that your twitter message said you wanted $[0..1]$, so you might want a -1 in there. Operator precedence is a bit annoying so it needs more brackets than you might expect:

(uint32_values >> 8).astype(np.float32) * np.float32(1 / ((1 << 24) - 1))

e.g. test with:

uint32_values = np.uint32(-1)

@ogrisel
Copy link
Author

ogrisel commented Oct 28, 2022

Thanks @smason!

@fcharras
Copy link

fcharras commented Oct 28, 2022

float32(np.uint32(x >> uint32(32)) >> uint32(8)) * (float32(1) / float32(uint32(1) << uint32(24))) seems to be better, it seems equivalent to casting an intermediate float64 to float32, without actually materializing the float64 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment