Skip to content

Instantly share code, notes, and snippets.

@dwhswenson
Last active July 11, 2021 23:20
Show Gist options
  • Save dwhswenson/88d672a7258e646de192dd4af5810869 to your computer and use it in GitHub Desktop.
Save dwhswenson/88d672a7258e646de192dd4af5810869 to your computer and use it in GitHub Desktop.
SimStore NumPy Array Storage Fix
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "71756bc7",
"metadata": {},
"source": [
"# Fixing the NumPy array storage problem in SimStore\n",
"\n",
"Before OPS 1.5.1, there was a bug in SimStore reloading NumPy arrays. This notebook will illustrate the problem and the solutions.\n",
"\n",
"First, we'll load up a SimStore storage written with an older version of OPS (1.5.0 or earlier)."
]
},
{
"cell_type": "markdown",
"id": "507a064b",
"metadata": {},
"source": [
"Contents:\n",
"\n",
"* **Background: The problem and proof that it is recoverable**: Overview of what's going on, and what in general needs to be done to fix it\n",
"* **Fixing it for this process**: Fixing this for a single analysis script/notebook, if you don't want to change the underlying file.\n",
"* **Fixing the file permanently**: Modifying the file in a way that it will permanently load correctly (for OPS 1.5.1 or later)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "1148848d",
"metadata": {},
"outputs": [],
"source": [
"from openpathsampling.experimental.storage import Storage, monkey_patch_all\n",
"import openpathsampling as paths\n",
"paths = monkey_patch_all(paths)\n",
"\n",
"# we'll use this for some of the fixes\n",
"from openpathsampling.experimental.simstore.uuids import get_uuid"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "bdf41388",
"metadata": {},
"outputs": [],
"source": [
"storage = Storage(\"./tps.db\", mode='r')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "57b6e238",
"metadata": {},
"outputs": [],
"source": [
"snap = storage.snapshots[0]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a71b1181",
"metadata": {},
"outputs": [],
"source": [
"dihedrals = storage.cvs['dihedrals'] # this is a CV that returns an array"
]
},
{
"cell_type": "markdown",
"id": "81b9a500",
"metadata": {},
"source": [
"## Background: The problem and proof that it is recoverable\n",
"\n",
"The general problem is that CVs that return NumPy arrays were reloading as byte arrays (i.e., they don't tell NumPy the correct type to load them as.) In the specific case I'm showing here, I have a CV that calculates several dihedrals, and then I use other CVs to select individual dihedrals from that. I only save the values of the `dihedrals` CV to disk (the others are trivial once that is saved).\n",
"\n",
"Just to show how this works, here's the CV for the `omega` dihedral, which takes element 2 (counting from 0) of the dihedrals array:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "8fae234f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"def getitem(snapshot, cv, num):\n",
" return cv(snapshot)[num]\n",
"\n",
"cv = dihedrals ; num = 2\n"
]
}
],
"source": [
"omega = storage.cvs['omega']\n",
"print(omega.source)\n",
"print(\"cv =\", omega.kwargs['cv'].name, \"; num =\", omega.kwargs['num'])"
]
},
{
"cell_type": "markdown",
"id": "c4e99a33",
"metadata": {},
"source": [
"However the problem is that the `dihedrals` CV is returning nonsense:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "6874a549",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"b'\\x07\\x92\\x92\\xbf\"M\\xdf?*:.\\xc0\\xa5c6\\xc0'"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = dihedrals(snap)\n",
"results"
]
},
{
"cell_type": "markdown",
"id": "311e5314",
"metadata": {},
"source": [
"There's the issue: results are reloaded as byte strings (that's the `b'...'` with a bunch of nonsense between the quotes) instead of NumPy arrays.\n",
"\n",
"To show how this should look, I'll make a copy of that snapshot (so SimStore doesn't know it is identical and has to recalculate the value) and the use the `dihedrals` calculation on that."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "408073be",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([-1.1450814, 1.7445414, -2.7223 , -2.8498318], dtype=float32)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new = snap.copy() # copy to make a new UUID\n",
"correct = dihedrals(new)\n",
"correct"
]
},
{
"cell_type": "markdown",
"id": "37559d47",
"metadata": {},
"source": [
"An array of 4 floats: this is what I should see.\n",
"\n",
"The proble is even worse for the `omega` CV, which is less obviously nonsense:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "b3570c19",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"146"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"omega(snap)"
]
},
{
"cell_type": "markdown",
"id": "53d529f2",
"metadata": {},
"source": [
"To recover the correct values of the `dihedrals` CV, we need to know the correct type for it. We can get the \"type identification string\" used by SimStore with the following (on a correctly-typed result of the function):"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "9f4df844",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'ndarray.float32(4)'"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type_name = storage.type_identification.identify(correct)\n",
"type_name"
]
},
{
"cell_type": "markdown",
"id": "f419a20d",
"metadata": {},
"source": [
"The `ndarray` tells us that this is a NumPy `NDArray`, `float32` tells use that the NumPy `dtype` is `float32`, and the `(4)` tells use that the shape is `(4,)`."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "9840e1c6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dtype: float32\n",
"shape: (4,)\n"
]
}
],
"source": [
"print('dtype:', correct.dtype)\n",
"print('shape:', correct.shape)"
]
},
{
"cell_type": "markdown",
"id": "0b2d8b51",
"metadata": {},
"source": [
"Knowing the `dtype`, we can get the results back using NumPy's `frombuffer` method:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "a8fd80fb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 0.48908299 -22.38924791] <== No dtype: incorrect!\n",
"[-1.1450814 1.7445414 -2.7223 -2.8498318] <== Explicit dtype gives correct values\n"
]
}
],
"source": [
"import numpy as np\n",
"# importantly, I need to know the dtype here -- if defaults to float64, which is wrong!\n",
"print(np.frombuffer(results), \"<== No dtype: incorrect!\")\n",
"print(np.frombuffer(results, dtype='float32'), \"<== Explicit dtype gives correct values\")"
]
},
{
"cell_type": "markdown",
"id": "f706a0cd",
"metadata": {},
"source": [
"Up until now, all of this would work with OPS 1.5.0. The following tricks were implemented as part of OPS 1.5.1, and will enable easier use of your data."
]
},
{
"cell_type": "markdown",
"id": "bce81d1d",
"metadata": {},
"source": [
"## Fixing it for this process\n",
"\n",
"If you don't want to modify the underlying file, you can fix it up in each analysis script/notebook. This is mainly if you don't want to change the file modification dates, or if you use some sort of cryptographic signature on your output files that will break if the contents are modified. Most users will want the second option below."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "48f6b9bb",
"metadata": {},
"outputs": [],
"source": [
"storage.backend.sfr_result_types[get_uuid(dihedrals)] = type_name\n",
"storage.backend.serialization[type_name] = storage.class_info.handler_for(type_name)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "1a7ccf0b",
"metadata": {},
"outputs": [],
"source": [
"dihedrals.preload_cache() # this clears and reloads the cache"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "41f836f6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([-1.1450814, 1.7445414, -2.7223 , -2.8498318], dtype=float32)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dihedrals(snap)"
]
},
{
"cell_type": "markdown",
"id": "d03a6474",
"metadata": {},
"source": [
"Note that this fix works for CVs that depend on this CV, too. In this specific example, the `dihedrals` CV stores 4 different dihedrals, then I have very simple CVs to extract each individual dihedral. Here's the code and keyword arguments I used to do that:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "19dc8b92",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"-2.7223"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"omega.local_cache.clear() # clear the result we previously loaded\n",
"omega(snap)"
]
},
{
"cell_type": "markdown",
"id": "2c8d2233",
"metadata": {},
"source": [
"## Fixing the file permanently"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "94c88c96",
"metadata": {},
"outputs": [],
"source": [
"# making a new copy of the file where I'll make the permanent fix\n",
"!cp tps.db permafix.db"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "f360ca78",
"metadata": {},
"outputs": [],
"source": [
"storage = Storage(\"./permafix.db\", mode='a')"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "4f1cedef",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"b'\\x07\\x92\\x92\\xbf\"M\\xdf?*:.\\xc0\\xa5c6\\xc0'"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# proving that the problem is still here\n",
"snap = storage.snapshots[0]\n",
"dihedrals = storage.cvs['dihedrals']\n",
"dihedrals(snap)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "afd5e3bd",
"metadata": {},
"outputs": [],
"source": [
"# create a table to store result types (only do once per file)\n",
"backend = storage.backend\n",
"backend.register_schema(\n",
" {'sfr_result_types': [('uuid', 'str'), ('result_type', 'str')]},\n",
" []\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "6a62af2f",
"metadata": {},
"outputs": [],
"source": [
"# register the result type for this specific CV (must do for each CV)\n",
"sfr_result_type_table = backend.metadata.tables['sfr_result_types']\n",
"func_uuid = get_uuid(dihedrals)\n",
"with storage.backend.engine.connect() as conn:\n",
" conn.execute(sfr_result_type_table.insert(),\n",
" {'uuid': func_uuid, 'result_type': type_name})"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "21aa3aa0",
"metadata": {},
"outputs": [],
"source": [
"storage.close()\n",
"# resetting _known_storages isn't needed if you do the rest in a new process\n",
"Storage._known_storages = {}"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "9437b4d2",
"metadata": {},
"outputs": [],
"source": [
"new_storage = Storage(\"./permafix.db\", mode='r')"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "951967e2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([-1.1450814, 1.7445414, -2.7223 , -2.8498318], dtype=float32)"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"snap = new_storage.snapshots[0]\n",
"dihedrals = new_storage.cvs['dihedrals']\n",
"dihedrals(snap)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "fb238f59",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-2.7223"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"omega = new_storage.cvs['omega']\n",
"omega(snap)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:tmp-fabulous]",
"language": "python",
"name": "conda-env-tmp-fabulous-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment