Skip to content

Instantly share code, notes, and snippets.

@Jessime
Last active September 30, 2015 21:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Jessime/fbed0639fc69fe751398 to your computer and use it in GitHub Desktop.
Save Jessime/fbed0639fc69fe751398 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a random matrix to use in place of the full kmer counts matrix. We'll use numbers between 0 and 10, and use 6 rows and 4 columns."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"counts = np.random.randint(0, 10, [6, 4])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[9, 7, 7, 6],\n",
" [5, 8, 5, 8],\n",
" [7, 1, 8, 8],\n",
" [4, 6, 0, 4],\n",
" [7, 5, 7, 9],\n",
" [2, 1, 7, 1]])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll save the matrix for reproducibility and so we can load it from a file into my pearson correlation script."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"counts_file = '/home/jessime/Desktop/count_test.npy'\n",
"np.save(counts_file, counts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I've got a script that will calculate the pearson correlation matrix. We can go ahead and run that. Since we want to calculate the pearson of all of the rows of this column against itself, we'll pass the pearson function the count matrix twice, along with a file name to save the correlation matrix."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/home/jessime/Code/kmers\n"
]
}
],
"source": [
"cd kmers"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from basic_similarity import pearson"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done with 0 rows.\n"
]
}
],
"source": [
"pearson(counts_file, counts_file, '/home/jessime/Desktop/counts_pearson.npy')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can retrieve the matrix we just made and see what that look like. We can round to 3 decimals so it looks better."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 1. -0.688 0. 0.053 -0.324 0.023]\n",
" [-0.688 1. -0.514 0.688 0. -0.704]\n",
" [ 0. -0.514 1. -0.708 0.849 0.448]\n",
" [ 0.053 0.688 -0.708 1. -0.324 -0.945]\n",
" [-0.324 0. 0.849 -0.324 1. 0. ]\n",
" [ 0.023 -0.704 0.448 -0.945 0. 1. ]]\n"
]
}
],
"source": [
"dist_mat = np.load('/home/jessime/Desktop/counts_pearson.npy')\n",
"dist_mat = np.around(dist_mat, decimals=3)\n",
"print dist_mat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This looks fine. The next step is to row standardize the original counts matrix. We can center out the means first and then normalize the variance by dividing by the standard deviation.\n",
"\n",
"Numpy naturally does things columnwise, so we'll have to transpose the matrix, subtract, and transpose it back. "
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1.75, -0.25, -0.25, -1.25],\n",
" [-1.5 , 1.5 , -1.5 , 1.5 ],\n",
" [ 1. , -5. , 2. , 2. ],\n",
" [ 0.5 , 2.5 , -3.5 , 0.5 ],\n",
" [ 0. , -2. , 0. , 2. ],\n",
" [-0.75, -1.75, 4.25, -1.75]])"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts_centered = (counts.T - np.mean(counts, axis=1)).T\n",
"counts_centered"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The third row makes it particularly easy to make sure that the centering is correct."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1.60591014, -0.22941573, -0.22941573, -1.14707867],\n",
" [-1. , 1. , -1. , 1. ],\n",
" [ 0.34299717, -1.71498585, 0.68599434, 0.68599434],\n",
" [ 0.22941573, 1.14707867, -1.60591014, 0.22941573],\n",
" [ 0. , -1.41421356, 0. , 1.41421356],\n",
" [-0.30151134, -0.70352647, 1.70856429, -0.70352647]])"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts_norm = (counts_centered.T / np.std(counts_centered, axis=1)).T\n",
"counts_norm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can take the dot product of this normalized matrix and its transpose (equivalent to the inner product). We'll also print the pearson calculated matrix just to have it nearby for comparison."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 4. -2.753 0. 0.211 -1.298 0.092]\n",
" [-2.753 4. -2.058 2.753 0. -2.814]\n",
" [ 0. -2.058 4. -2.833 3.395 1.793]\n",
" [ 0.211 2.753 -2.833 4. -1.298 -3.781]\n",
" [-1.298 0. 3.395 -1.298 4. 0. ]\n",
" [ 0.092 -2.814 1.793 -3.781 0. 4. ]]\n",
"\n",
"[[ 1. -0.688 0. 0.053 -0.324 0.023]\n",
" [-0.688 1. -0.514 0.688 0. -0.704]\n",
" [ 0. -0.514 1. -0.708 0.849 0.448]\n",
" [ 0.053 0.688 -0.708 1. -0.324 -0.945]\n",
" [-0.324 0. 0.849 -0.324 1. 0. ]\n",
" [ 0.023 -0.704 0.448 -0.945 0. 1. ]]\n"
]
}
],
"source": [
"count_product = np.dot(counts_norm, counts_norm.T)\n",
"count_product = np.around(count_product, decimals=3)\n",
"print count_product\n",
"print \"\"\n",
"print dist_mat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now all that's left to do is divide by n where n is the number of columns of the counts matrix."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 1. -0.688 0. 0.053 -0.324 0.023]\n",
" [-0.688 1. -0.514 0.688 0. -0.704]\n",
" [ 0. -0.514 1. -0.708 0.849 0.448]\n",
" [ 0.053 0.688 -0.708 1. -0.324 -0.945]\n",
" [-0.324 0. 0.849 -0.324 1. 0. ]\n",
" [ 0.023 -0.704 0.448 -0.945 0. 1. ]]\n",
"\n",
"[[ 1. -0.688 0. 0.053 -0.324 0.023]\n",
" [-0.688 1. -0.514 0.688 0. -0.704]\n",
" [ 0. -0.514 1. -0.708 0.849 0.448]\n",
" [ 0.053 0.688 -0.708 1. -0.324 -0.945]\n",
" [-0.324 0. 0.849 -0.324 1. 0. ]\n",
" [ 0.023 -0.704 0.448 -0.945 0. 1. ]]\n"
]
}
],
"source": [
"print np.around(count_product/4, decimals=3)\n",
"print \"\"\n",
"print dist_mat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Column Standardizing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I'm going to rerun through everything, but column standardize this time."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[9, 7, 7, 6],\n",
" [5, 8, 5, 8],\n",
" [7, 1, 8, 8],\n",
" [4, 6, 0, 4],\n",
" [7, 5, 7, 9],\n",
" [2, 1, 7, 1]])"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts = np.load('/home/jessime/Desktop/count_test.npy')\n",
"counts"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 3.3333, 2.3333, 1.3333, 0. ],\n",
" [-0.6667, 3.3333, -0.6667, 2. ],\n",
" [ 1.3333, -3.6667, 2.3333, 2. ],\n",
" [-1.6667, 1.3333, -5.6667, -2. ],\n",
" [ 1.3333, 0.3333, 1.3333, 3. ],\n",
" [-3.6667, -3.6667, 1.3333, -5. ]])"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Subtract the mean\n",
"counts_col = counts - np.mean(counts, axis=0)\n",
"np.around(counts_col, decimals=4)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1.4586, 0.8489, 0.4961, 0. ],\n",
" [-0.2917, 1.2127, -0.2481, 0.7223],\n",
" [ 0.5835, -1.3339, 0.8682, 0.7223],\n",
" [-0.7293, 0.4851, -2.1086, -0.7223],\n",
" [ 0.5835, 0.1213, 0.4961, 1.0835],\n",
" [-1.6045, -1.3339, 0.4961, -1.8058]])"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Divide out the standard deviation\n",
"counts_col = counts_col / np.std(counts_col, axis=0)\n",
"np.around(counts_col, decimals=4)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 3.0944, 0.4808, 0.1495, -1.6982, 1.2002, -3.2266],\n",
" [ 0.4808, 2.139 , -1.4815, 0.8023, 0.6364, -2.577 ],\n",
" [ 0.1495, -1.4815, 3.3954, -3.4251, 1.392 , -0.0303],\n",
" [-1.6982, 0.8023, -3.4251, 5.7351, -2.1955, 0.7813],\n",
" [ 1.2002, 0.6364, 1.392 , -2.1955, 1.7752, -2.8083],\n",
" [-3.2266, -2.577 , -0.0303, 0.7813, -2.8083, 7.8609]])"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Take the dot product of the column normalized count matrix and its transpose.\n",
"counts_col = np.dot(counts_col, counts_col.T)\n",
"np.around(counts_col, decimals=4)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment