darkblue-b/f0.adoc

## f0.adoc

      
    Raw
  

              f0.adoc
            
          
    Search and Matching / Gene Sequences


Notes on implementation language environments, execution performance and machine considerations; based on the software technical supplement to "Uncovering disease-disease relationships through the incomplete interactome" Science Magazine Feb 2015


19Apr15 dbb  _v0.2


Observations


supplied python code ran quickly on all compute environments (~2 seconds)


non-trivial transform of a python networkx interactome requires 9GB of physical RAM


neo4j interactive environment is mature and visually pleasing


neo4j queries use a single core only


neo4j visualization breaks down at a few hundred nodes, even on the Xeon E5


Initial Supplement Code Execution


Three compute environments (described below) were used for various incarnations of the problem, as supplied in Python 2.7x and supporting data files, subsequently ported to neo4j. Initial Run:


create 'agilesde' account on i7d


ensure python 2.7x networkx, numpy libs are present


copy the example data files and code to server


execute sample code  (first run)


## two programs are run; the first is an anlalysis of a single disease profile
##  the reference interactome; the second uses two disease profiles and executes analysis

agilesde@i7d:~/Documents/saleh_pkg/source$ python localization.py -n interactome.tsv -g PD.txt \
    -o output.txt

> default network from "interactome.tsv" will be used

> done loading network:
> network contains 13460 nodes and 141296 links

> done reading genes:
> 20 genes found in PD.txt

> lcc size = 3
> mean shortest distance = 1.05

> random simulation [1000 of 1000]
> gene set from "PD.txt": 20 genes
> lcc size   S = 3
> diameter d_s = 1.05

> Random expectation:
> lcc [rand] = 19.671
> => z-score of observed lcc = -27.0169651597

> results have been saved to output.txt


#----------------------------------------------------
agilesde@i7d:~/Documents/saleh_pkg/source$ ./separation.py -n interactome.tsv --g1 MS.txt --g2 PD.txt \
    -o output.txt

> default network from "interactome.tsv" will be used

> done loading network:
> network contains 13460 nodes and 141296 links

> done reading genes:
> 108 genes found in MS.txt
> ignoring 39 genes that are not in the network
> remaining number of genes: 69

> done reading genes:
> 20 genes found in PD.txt

> gene set A from "MS.txt": 69 genes, network-diameter d_A = 1.85507246377
> gene set B from "PD.txt": 20 genes, network-diameter d_B = 1.05
> mean shortest distance between A & B: d_AB = 2.73033707865
> network separation of A & B:          s_AB = 1.27780084677

> results have been saved to output.txt


Show an example of another disease input


pick a new gene set from data/DataS2_disease_genes.tsv


make an input file in the expanded form of one gene number per line


ovarian neoplasms  (20 genes total, 14 from OMIM, 6 from GWAS)


python localization.py -n interactome.tsv -g ovarian_neoplasms.txt -o on_0.txt

> default network from "interactome.tsv" will be used

> done loading network:
> network contains 13460 nodes and 141296 links

> done reading genes:
> 20 genes found in ovarian_neoplasms.txt
> ignoring 3 genes that are not in the network
> remaining number of genes: 17

> lcc size = 10
> mean shortest distance = 1.70588235294

> random simulation [1000 of 1000]
> gene set from "ovarian_neoplasms.txt": 17 genes
> lcc size   S = 10
> diameter d_s = 1.70588235294

> Random expectation:
> lcc [rand] = 16.831
> => z-score of observed lcc = -15.9928345284

> results have been saved to on_0.txt

##-----------------------------------
agilesde@i7d:~/Documents/saleh_pkg/source$ python separation.py -n interactome.tsv \
    --g1 ovarian_neoplasms.txt --g2 PD.txt -o on_PD0.txt

> default network from "interactome.tsv" will be used

> done loading network:
> network contains 13460 nodes and 141296 links

> done reading genes:
> 20 genes found in ovarian_neoplasms.txt
> ignoring 3 genes that are not in the network
> remaining number of genes: 17

> done reading genes:
> 20 genes found in PD.txt

> gene set A from "ovarian_neoplasms.txt": 17 genes, network-diameter d_A = 1.70588235294
> gene set B from "PD.txt": 20 genes, network-diameter d_B = 1.05
> mean shortest distance between A & B: d_AB = 2.59459459459
> network separation of A & B:          s_AB = 1.21665341812

> results have been saved to on_PD0.txt

#--------------------------------------------------
agilesde@i7d:~/Documents/saleh_pkg/source$ python separation.py -n interactome.tsv \
    --g1 ovarian_neoplasms.txt --g2 MS.txt -o on_PD0.txt

> default network from "interactome.tsv" will be used

> done loading network:
> network contains 13460 nodes and 141296 links

> done reading genes:
> 20 genes found in ovarian_neoplasms.txt
> ignoring 3 genes that are not in the network
> remaining number of genes: 17

> done reading genes:
> 108 genes found in MS.txt
> ignoring 39 genes that are not in the network
> remaining number of genes: 69

> gene set A from "ovarian_neoplasms.txt": 17 genes, network-diameter d_A = 1.70588235294
> gene set B from "MS.txt": 69 genes, network-diameter d_B = 1.85507246377
> mean shortest distance between A & B: d_AB = 2.13953488372
> network separation of A & B:          s_AB = 0.359057475366

> results have been saved to on_PD0.txt


IPython / Jupyter Hub


IPython is an interactive command-line environment for python programming. It has recently been extended to multi-node, network execution with a very straightforward clustering model, re-branded as Jupyter Hub.  The interactome code was ported easily into the Jupyter environment for execution.


A unique benefit of the IPython environment is inline graphics, both library-generated such as Matplotlib, and html-friendly formats, such as png, jpeg, video and iframes.


(see IPython session udd_graph_ex0.html )


PostgreSQL Import


Import the interactome csv definition into a postgres table, for convenient count, uniques, sub-select and new csv generation.


DROP TABLE interactom_test0;
CREATE TABLE interactom_test0
(
  node_a integer,
  node_b integer,
  desc_orig text
);

Agile=# copy (select distinct(node_b) from interactom_test0) to
  '/Users/Shared/chalice_review_assets/neo4j_import_work/dist_b.csv';


Neo4j // Import and Execution


Neo4j Community Edition 2.2.1; Oracle Java 1.7_0_79; Mac OSX 10.10.3


10GB ← dbms.pagecache.memory


Start / Authenticate


Import Data


Indexes / Query


Backup


conf/neo4j-server.properties / org.neo4j.server.database.location=data/import0.db


$ bin/neo4j start


CSV Data file formats


nodes_cmb_uniq.csv

node_id:ID,attra
1,1
10,10
100,100
1000,1000
10000,10000
10001,10001
10002,10002
100049587,100049587
10005,10005


rels_comb.csv

:START_ID,:END_ID,:TYPE
100290337,4214,INTERA
122704,54460,INTERA
4790,79155,INTERA
2597,70,INTERA
5923,7157,INTERA
509,6122,INTERA
4067,933,INTERA
398,998,INTERA
1748,5976,INTERA
1537,55967,INTERA
10989,54927,INTERA
55890,7920,INTERA
6629,9140,INTERA


$ bin/neo4j-import

calvisitor-10-105-155-98:neo4j-community-2.2.1 Agile$ bin/neo4j-import --into data/import0.db  --id-type INTEGER \
> --nodes /Users/Shared/chalice_review_assets/neo4j_import_work/nodes_cmb_uniq.csv \
> --relationships:INTERA /Users/Shared/chalice_review_assets/neo4j_import_work/rels_cmb.csv
Nodes
[>:??---------------------|PROPERTIES---|*NODE:7.63 MB--------------|v:??----------------------] 20k
Done in 421ms
Prepare node index
[*DETECT:11.44 MB------------------------------------------------------------------------------] 10k
Done in 80ms
Calculate dense nodes
[>:??------------------------|*PREPARE----------------------------------------------|CALCULATOR]200k
Done in 225ms
Relationships
[>:??---|*PREPARE----------------------|RELATIONSHIP------------------|v:??--------------------]150k
Done in 243ms
Node --> Relationship
[*>:??------------------------------------------------|LINK------------------------------------] 20k
Done in 34ms
Relationship --> Relationship
[>:??-----------------------|*LINK------------------------------------------------|v:??--------]150k
Done in 88ms
Node counts
[*COUNT:0.00 B---------------------------------------------------------------------------------] 20k
Done in 12ms
Relationship counts
[*COUNT----------------------------------------------------------------------------------------]150k
Done in 36ms

IMPORT DONE in 1s 939ms


Snapshot of neo4j live environment


START n=node(*) MATCH (n)-[r]->(m) WHERE n.node_id < 22 RETURN n,r,m;


Compute Hardware


AgileSDE MacPro Bullet


6-core Xeon E5
16 GB RAM @ 1866 MHz
SSD 256GB


i7d


8-core i7-960
16 GB RAM @ 2000 MHz
2TB Western Digital black label


MacPro laptop 2007


2-core Intel Core-duo
4 GB RAM @ 667 MHz
500GB Western Digital black label


References - neo4j


http://neo4j.com/docs/2.2.1/graphdb-neo4j.html


http://neo4j.com/docs/stable/import-tool-types-labels.html


http://neo4j.com/docs/2.2.1/operations-security.html


http://neo4j.com/developer/guide-sql-to-cypher/


http://neo4j.com/docs/stable/cypher-refcard/


http://neo4j.com/developer/guide-data-visualization/


https://github.com/jexp/neo4j-shell-tools#setup-auto-indexing


http://stackoverflow.com/questions/8372788/show-all-nodes-and-relationships


https://groups.google.com/forum/#!forum/neo4j


References - neo4j Minimal Import


http://gist.asciidoctor.org/?dropbox-14493611%2Fblog%2Fadoc%2Fsimplest_import_example.adoc


http://graphgist.neo4j.com/#!/gists/d8f251a948f5df83473a


https://groups.google.com/forum/#!topic/neo4j/MZY0YrKo4vE


http://www.intelliwareness.org/2014/12/neo4j-new-neo4j-import/


References - python and networkx


https://jupyter.org/


http://networkx.github.io/documentation/networkx-1.9.1/reference/introduction.html#networkx-basics


http://www.slideshare.net/nigelsmall/introduction-to-py2neo


References - AsciiDoc and GraphGist


http://gist.neo4j.org/?5956246&_ga=1.184611425.975418876.1429035786


http://gist.asciidoctor.org/


http://graphgist.neo4j.com/#!/gists/about


http://asciidoctor.org/docs/asciidoc-syntax-quick-reference/