Skip to content

Instantly share code, notes, and snippets.

@shubhampateliitm
Created September 30, 2018 07:06
Show Gist options
  • Save shubhampateliitm/a9d5f01c340f2d6f3b3612e0657f6c34 to your computer and use it in GitHub Desktop.
Save shubhampateliitm/a9d5f01c340f2d6f3b3612e0657f6c34 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p> I have been looking for handling a large amount of data easily and in a fast manner using Tensorflow. After some research, I have figured out that TFRecord can be something that I was really looking for.\n",
"</p> <p>This tutorial is a summarization of my understanding of TFRecord.</p>\n",
"\n",
"# What is TFRecord?\n",
"\n",
"<p> TFRecord is a file format to store data. It can store the whole dataset in a single file (continous memory allocation, easy to read). It stores data in binary format. Moreover, it is comfortable with dataset API of Tensorflow. </p>\n",
"\n",
"The pipeline of storing data in TFRecord format looks like something like this.\n",
"> <b> Data -> Feature -> Features -> Example -> Serialized Example -> tfRecord.</b>\n",
"\n",
"Let's take a look at each stage.\n",
"\n",
"## Feature. \n",
"\n",
"\n",
"As tensorflow documentation describe it as <b>\"Protocol messages for describing features for machine learning model training or inference\"</b>. I know, it is great to understand the sentence when there are so many features in it.\n",
"So, basically \"Feature\" is a container to store data in form of the list. Three type support float, int64, and bytes(for string). List of float -> FloatList, list of int64 -> Int64List and bytes -> ByteList."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import tensorflow as tf\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"value: 3\n",
"value: 4\n",
"value: 5\n",
"\n"
]
}
],
"source": [
"## creaing intList.\n",
"## This is a method that create IntList that feature seeks as input. value can\n",
"# take anything that is integer iterable.\n",
"# for example you can pass a set {3,4,6}, list [2,5,4], tuple (6,4,2).\n",
"# It is all good\n",
"intL = tf.train.Int64List(value=[3,4,5])\n",
"\n",
"# Let see how it looks.\n",
"print(intL)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"value: 3.0\n",
"value: 4.0\n",
"value: 5.0\n",
"\n"
]
}
],
"source": [
"## Similarly for float and strings.\n",
"floatL = tf.train.FloatList(value=[3.0,4.0,5.0])\n",
"print(floatL)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Before सुई\n",
"After b'\\xe0\\xa4\\xb8\\xe0\\xa5\\x81\\xe0\\xa4\\x88'\n"
]
}
],
"source": [
"## For storing text in byte format let something in form of unicode.\n",
"# Text contain hindi literal Sui\n",
"text = u\"सुई\"\n",
"\n",
"## checking out before print\n",
"print(\"Before \",text)\n",
"\n",
"## converting it to utf8\n",
"text = text.encode(\"utf8\")\n",
"\n",
"## Seeing the change\n",
"print(\"After \",text)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"value: \"\\340\\244\\270\\340\\245\\201\\340\\244\\210\""
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Now lets store it in byte form. \n",
"## Remember we have to pass it in a list form. Since it is byte list.\n",
"bytesL = tf.train.BytesList(value=[text])\n",
"bytesL"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"int64_list {\n",
" value: 3\n",
" value: 4\n",
" value: 5\n",
"}\n",
"\n"
]
}
],
"source": [
"## converting list to Feature\n",
"intF = tf.train.Feature(int64_list=intL)\n",
"print(intF) ## how it structured"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[3, 4, 5]"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## getting value back\n",
"intF.int64_list.value"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
"## Same way converting floatList and byteList\n",
"floatF = tf.train.Feature(float_list=floatL)\n",
"bytesF = tf.train.Feature(bytes_list=bytesL)"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"float_list {\n",
" value: 3.0\n",
" value: 4.0\n",
" value: 5.0\n",
"}\n",
" \n",
" bytes_list {\n",
" value: \"\\340\\244\\270\\340\\245\\201\\340\\244\\210\"\n",
"}\n",
"\n"
]
}
],
"source": [
"## Taking a look\n",
"print(floatF,\"\\n\",bytesF)"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[b'\\xe0\\xa4\\xb8\\xe0\\xa5\\x81\\xe0\\xa4\\x88']\n"
]
}
],
"source": [
"## getting back float\n",
"te = bytesF.bytes_list.value\n",
"print(te)"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'सुई'"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## Decoding text and getting it back.\n",
"te[0].decode('utf-8')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Features.\n",
"<p> Now, we have seen feature, let's move on to something called \"features\".</p>\n",
"> <b>\"Features\"</b> are basically a key value pair, like python dictionary It map a \"key\" to \"feature\" that.\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [],
"source": [
"## Now create features.\n",
"features = {\n",
" 'float' : floatF,\n",
" 'byte' : bytesF,\n",
" 'int' : intF\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'float': float_list {\n",
" value: 3.0\n",
" value: 4.0\n",
" value: 5.0\n",
" }, 'byte': bytes_list {\n",
" value: \"\\340\\244\\270\\340\\245\\201\\340\\244\\210\"\n",
" }, 'int': int64_list {\n",
" value: 3\n",
" value: 4\n",
" value: 5\n",
" }}"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"features"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [],
"source": [
"tfFeatures = tf.train.Features(feature=features)"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"feature {\n",
" key: \"byte\"\n",
" value {\n",
" bytes_list {\n",
" value: \"\\340\\244\\270\\340\\245\\201\\340\\244\\210\"\n",
" }\n",
" }\n",
"}\n",
"feature {\n",
" key: \"float\"\n",
" value {\n",
" float_list {\n",
" value: 3.0\n",
" value: 4.0\n",
" value: 5.0\n",
" }\n",
" }\n",
"}\n",
"feature {\n",
" key: \"int\"\n",
" value {\n",
" int64_list {\n",
" value: 3\n",
" value: 4\n",
" value: 5\n",
" }\n",
" }\n",
"}"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tfFeatures"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example \n",
"<p> These lines of Tensorflow's official documentation are very straightforward and helpful in Understanding \"Example\" </p>\n",
"> An Example is a mostly-normalized data format for storing data for\n",
" training and inference. It contains a key-value store (features); where\n",
" each key (string) maps to a Feature message (which is oneof packed BytesList,\n",
" FloatList, or Int64List). This flexible and compact format allows the\n",
" storage of large amounts of typed data, but requires that the data shape\n",
" and use be determined by the configuration files and parsers that are used to\n",
" read and write this format. That is, the Example is mostly *not* a\n",
" self-describing format. <br> ___In TensorFlow, Examples are read in row-major\n",
" format, so any configuration that describes data with rank-2 or above\n",
" should keep this in mind.___ <br> For example, __to store an M x N matrix of Bytes,\n",
" the BytesList must contain M*N bytes, with M rows of N contiguous values\n",
" each. That is, the BytesList value must store the matrix as:\n",
" .... row 0 .... .... row 1 .... // ........... // ... row M-1 ....__<br><br>\n",
" A conformant Example data set obeys the following conventions:\n",
" - If a Feature K exists in one example with data type T, it must be of type T in all other examples when present. It may be omitted.\n",
" - The number of instances of Feature K list data may vary across examples, depending on the requirements of the model.\n",
" - If a Feature K doesn't exist in an example, a K-specific default will be used, if configured.\n",
" - If a Feature K exists in an example but contains no items, the intent is considered to be an empty tensor and no default will be used.\n",
" \n",
"Whenever we are storing images, we have to convert them into singular dimension (can be done using reshaping), moreover we need to save the dimension of image also, so that on the time of reading image can be again reshaped into it's original shape. The above logic is true for every matrix type data."
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"features {\n",
" feature {\n",
" key: \"byte\"\n",
" value {\n",
" bytes_list {\n",
" value: \"\\340\\244\\270\\340\\245\\201\\340\\244\\210\"\n",
" }\n",
" }\n",
" }\n",
" feature {\n",
" key: \"float\"\n",
" value {\n",
" float_list {\n",
" value: 3.0\n",
" value: 4.0\n",
" value: 5.0\n",
" }\n",
" }\n",
" }\n",
" feature {\n",
" key: \"int\"\n",
" value {\n",
" int64_list {\n",
" value: 3\n",
" value: 4\n",
" value: 5\n",
" }\n",
" }\n",
" }\n",
"}\n",
"\n"
]
}
],
"source": [
"# Example is sort of like sample of our dataset. It contain all the features related to a sample.\n",
"# So, we are basically creating a structure or container that will store a sample.\n",
"firstExample = tf.train.Example(features=tfFeatures)\n",
"## Seeing example\n",
"print(firstExample)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b> Examples are serializable. It means they can be converted to some form through which it can store in the disk. And later it can be retrive as it is without lossing any data. </b>"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"b'\\nB\\n\\x19\\n\\x05float\\x12\\x10\\x12\\x0e\\n\\x0c\\x00\\x00@@\\x00\\x00\\x80@\\x00\\x00\\xa0@\\n\\x0e\\n\\x03int\\x12\\x07\\x1a\\x05\\n\\x03\\x03\\x04\\x05\\n\\x15\\n\\x04byte\\x12\\r\\n\\x0b\\n\\t\\xe0\\xa4\\xb8\\xe0\\xa5\\x81\\xe0\\xa4\\x88'\n"
]
}
],
"source": [
"firstSerializableExample = firstExample.SerializeToString()\n",
"print(firstSerializableExample)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now this text form we will store in our tfRecord file. At the time of reading tfRecord file we will again go through same pipeline but in reverse direction."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TFRecordWriter is a class to write TFRecord file. It take path of file and option(By default None) to create object. It's object can be used as normal file object(In sense of writing). It has functionality of write, flush, close.\n",
"> A class to write records to a TFRecords file. This class implements `__enter__` and `__exit__`, and can be used\n",
" in `with` blocks like a normal file."
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [],
"source": [
"with tf.python_io.TFRecordWriter('data.tfrecord') as writer:\n",
" writer.write(firstSerializableExample)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment