Skip to content

Instantly share code, notes, and snippets.

@ikwattro
Created July 30, 2018 13:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ikwattro/b008e839e2f857b5c331d01d9d71f9ee to your computer and use it in GitHub Desktop.
Save ikwattro/b008e839e2f857b5c331d01d9d71f9ee to your computer and use it in GitHub Desktop.
GKP ES index
{"took":4,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":140,"max_score":1.0,"hits":[{"_index":"documents","_type":"documents","_id":"1583e58d-93fb-11e8-9133-b630aa296667","_score":1.0,"_source":{"keywords":[{"score":1.0,"value":"incredible datum source"},{"score":0.8740276985632897,"value":"heterogeneous biological datum"},{"score":0.8660201774065337,"value":"datum source"},{"score":0.5607271616408498,"value":"batch importer"},{"score":0.4324037082232962,"value":"research arm"},{"score":0.40140374839319426,"value":"gigantic graph"},{"score":0.3802973686515041,"value":"first time"},{"score":0.38014363343342666,"value":"csv file"},{"score":0.3616156120916232,"value":"cypher query"},{"score":0.3616156120916232,"value":"next generation"},{"score":0.3609459065770126,"value":"neo4j"},{"score":0.34879835843755574,"value":"large graph"},{"score":0.34049163631799256,"value":"text mining"},{"score":0.27726250538402347,"value":"happy"},{"score":0.25983900689817496,"value":"beginning"},{"score":0.2577749800741522,"value":"compound"},{"score":0.24669948832809252,"value":"database"},{"score":0.2271278594003202,"value":"medicine"},{"score":0.22239434707241948,"value":"stephan reiling"},{"score":0.21938229373401422,"value":"biology"}],"entities":[{"id":"Novartis Institute for Biomedical Research","type":"ORGANIZATION","value":"Novartis Institute for Biomedical Research","frequency":1},{"id":"importer","type":"TITLE","value":"importer","frequency":2},{"id":"Stephan Reiling","type":"PERSON","value":"Stephan Reiling","frequency":1},{"id":"scientist","type":"TITLE","value":"scientist","frequency":1},{"id":"CSV","type":"ORGANIZATION","value":"CSV","frequency":1},{"id":"disease","type":"CAUSE_OF_DEATH","value":"disease","frequency":1},{"id":"Novartis Pharmaceuticals","type":"ORGANIZATION","value":"Novartis Pharmaceuticals","frequency":1}],"text":"- My name is Stephan Reiling. I am a senior scientist at the Novartis Institute for Biomedical Research, which is the research arm of Novartis Pharmaceuticals. We created a large graph of all kinds of heterogeneous biological data, and we're combining this with the results of text mining that we're doing. And so we're merging all of this data together, it goes into this gigantic graph, and we're using it to better understand the biology and how we can use this knowledge about biology to come up with the next generation of medicines. It's really about Cypher. It became very easy to adopt it once the data was in there, and since we see this mostly as a, we don't use it in a transactional mode, this is for us, we put the data there and then we mine the data, and so being able to, you haven't thought about what might be relevant and you can then formulate it easily as a Cypher query and get the results out. That, for us, is probably the killer feature with Neo4j. The connections that you're finding. So before we had, in biology, there is a ton of data out there, and you have these incredible data sources, but they're all, there's this website, there's this website and so on, and I really bring this together, we can for the first time say, \"I want to find compounds that are similar to this compound, \"that have annotations about this disease.\" So this flexibility to basically just navigate all of these data sources that we couldn't do before, that is really cool. What we didn't do initially, and what is now really working very well for us, is that we didn't use the batch importer. So we used it as a, you start the server, you have to use several process, and then you start loading the data into this, and since we're still exploring, so we have been using Neo4j since the beginning of the year. We're still exploring how to do this. What we do quite a bit is we say, \"Oh, we wanna use this data source, \"eh, it didn't quite work out.\" So for us, it's actually faster to then redo the database from the CSV files than issuing the delete statements to get the data out of there. So we wasted a little bit of time by not using the batch importer. For us, this is just the beginning. We're getting initially some really good results, and probably we're gonna test the scalability of Neo4j pretty soon. We have right now half a billion relationships in the database, and we're probably gonna easily triple this. If we load all the data that we're thinking about, we're gonna go to more than a billion relationships, and then we'll see if it performs the way that it does right now. We're very happy with the performance.","title":"Stephan Reiling, Senior Scientist, Novartis-Y7n_a7fK3VM.en.vtt"}},{"_index":"documents","_type":"documents","_id":"15839768-93fb-11e8-9133-b630aa296667","_score":1.0,"_source":{"keywords":[{"score":1.0,"value":"next graph database"},{"score":0.9586797292489581,"value":"graph database space"},{"score":0.8999172968956168,"value":"graph database"},{"score":0.6988255187526496,"value":"valid cipher implementation"},{"score":0.6137352197665269,"value":"rule planner"},{"score":0.6096677001904821,"value":"different language"},{"score":0.6045230840365675,"value":"language specification"},{"score":0.6045230840365675,"value":"natural language"},{"score":0.603224514565808,"value":"beautiful london coffee shop"},{"score":0.5880325037445523,"value":"andres"},{"score":0.5834730289049364,"value":"cost planner"},{"score":0.5823350914180191,"value":"nitty gritty technical detail"},{"score":0.574272981299989,"value":"database implemental"},{"score":0.559127179559236,"value":"common language"},{"score":0.5587054555687984,"value":"opencypher initiative"},{"score":0.5587054555687984,"value":"opencypher topic"},{"score":0.5528802941539883,"value":"execution plan side"},{"score":0.544580819835581,"value":"technology compliance kit"},{"score":0.5105705159442716,"value":"little road trip"},{"score":0.49485949054704004,"value":"neo technology"}],"entities":[{"id":"San Francisco","type":"CITY","value":"San Francisco","frequency":3},{"id":"Oracle","type":"TITLE","value":"Oracle","frequency":1},{"id":"DBA","type":"ORGANIZATION","value":"DBA","frequency":1},{"id":"London","type":"CITY","value":"London","frequency":1},{"id":"Narrator","type":"TITLE","value":"Narrator","frequency":1},{"id":"Tool Vendors","type":"ORGANIZATION","value":"Tool Vendors","frequency":1},{"id":"operator","type":"TITLE","value":"operator","frequency":1},{"id":"last year","type":"DATE","value":"last year","frequency":1},{"id":"model","type":"TITLE","value":"model","frequency":1},{"id":"Friday","type":"DATE","value":"Friday","frequency":1},{"id":"Europe","type":"LOCATION","value":"Europe","frequency":1},{"id":"Twitter","type":"ORGANIZATION","value":"Twitter","frequency":1},{"id":"Rik Van Bruggen","type":"PERSON","value":"Rik Van Bruggen","frequency":1},{"id":"Rik","type":"PERSON","value":"Rik","frequency":3},{"id":"Andres Taylor","type":"PERSON","value":"Andres Taylor","frequency":1},{"id":"execution","type":"CAUSE_OF_DEATH","value":"execution","frequency":1},{"id":"Andres","type":"PERSON","value":"Andres","frequency":12},{"id":" Hi Rik","type":"PERSON","value":" Hi Rik","frequency":1}],"text":"(indistinct chatter) - [Narrator] Hello everyone. My name is Rik, Rik Van Bruggen from Neo Technology, and here I am, in a beautiful London coffee shop recording a podcast, with someone that I've been speaking to a lot in the past couple of days, because we've been doing a little road trip through Europe about openCypher. That's Andres Taylor. Hi Andres. - [Andres] Hi Rik. - [Rik] (laughs) We're very good. Yeah, we've been on the road for the past two days, and its been all about Cipher, it's been all about openCypher. So, maybe before we dig into the openCypher topic, maybe we can just talk a little bit about Cipher itself. We had the corapi, and gremenant, and with my background a squeal DBA. Could you tell us a little bit more about, you know, where it came from, how did you get into it, where did it start, just very briefly? - [Andres] Sure. When I joined Neo Technology, we had two different languages that ran on our graph database. I felt that we needed something more (inaudible) (pot/something metal falling on the floor) - More, eye level. So we started working on that, as a, do what you want Friday project. And, for some time until we put it as part of the product, and it grew organically since... Yeah, and then today it's the primary API for neo4j that we use. But, technology has moved on, and now we have even better tools, even higher level of extraction to work with. So yeah, it's very much meant to make the Querying easier to understand, not just fast. So that's one side and the model itself is also... So original sequel was meant to have that power that non-programmers should be able to viterace. - [Rik] Absolutely, you all have been, as you know I'm a big fan of Cipher because, you know, non-programmers like myself to use a graph database, which is super important. I think, that's the power of, the clarity of languages, right? - [Andres] Yeah, exactly. - [Rik] Yep. in Neo4j has been a lot of work on Cipher, like you said it was a hobby project at first, but now it's like the primary API with a full on infrastructure, planners, all of those types of things, right? What are some of the big components there? - [Andres] So when we started, we built a heuristics planner, a rule planner, that has pretty simple rules saying, for example, if you find an index, use it. If you, see this type of pattern use this type of operator to solve it. Which is a great start, but a cost planner allows for even better plans to come out from the planner. better way of getting your data is starting from this index and then traversing this way instead of the other way around. The cost planner is the statistics that we store about the data in your graph, so that when we're building a plan, we know, or we estimate, we guess that this is a... - [Rik] That was a big change, right? Doing cost based traversals rather than rule based. I've seen that a lot and there's more work on the way I understand, you're doing work on more infrastructure components to make it better for you, faster, stuff like that? - [Andres] Yes. So, today we have an interpretive run time, which builds an objects structure and, has data flow between the subjects when you run a query. So this is the execution plan side of things, and, what we're working on at the moment is, component time, which takes your query plan, your logical plan, and transforms that into a java class, with an execute method. So when you're running your query, you're actually running a java class. - [Rik] Yep. I'm looking forward to that, that's, the next couple of versions right? But, one of the big things that's coming up, and that we've been announced last year at GraphConnect, in San Francisco, is this whole new openCypher initiative. Could you tell us a little bit more about that? You know, where does it come from, what do we want to do with it, who's working with it, those types of things. - [Andres] Right. So what we felt is that, we want to grow the graph database space. We think that, that's a critical factor. We don't want to have a bigger, piece of that market, instead we want to grow the market, and to do that we felt that, we need a common language across graph databases, so that people, when they invest in learning this technology, they can take that investment and use it in other products, and not feel like they're stuck, using Neo Technology. So, at GraphConnect, our big conference in San Francisco last year, we announced openCypher, which is a project that will, open up the language to make it useful for database implementals, for Tool Vendors and for any use of the worms to, get to the nitty gritty technical details of... - [Rik] So, sorta like, rogue vikings that want to build the next graph database, they could actually use it as well? - [Andres] That's the plan. - [Rik] So what are some of the key components of openCypher, you know, what's in there? - [Andres] Well, what we're releasing is, a grammar so people can create a parser and, know what valid syntax is and isn't. We are building a technology compliance kit, a TCK, which is a sample driven way of testing the implementation, to see that it behaves the way it's supposed to behave. We're actually moving that out from our neo4j repository to it's own, so that we are going to use this TCK, to validate neo4j as a valid Cipher implementation. We're also creating a reference implementation, which is sort of a way of showing how this is supposed to work. The TCK and the grammar are really useful and good starts, but it's not enough. So a reference implementation would show, how you can actually build something like this, and, what the details mean. And lastly, we're working on a language specification, a natural language, semi-formal way of describing the, expected behavior of the language. - [Rik] Wow that's really cool. Well when I was at the GraphConnect San Francisco, I saw some really big names on stage, and I was in the support. People like data breaks,and Oracle. Is that list expanding? - [Andres] It is. We're working with many different companies at the moment, and they're all so, already Tool Vendors that have started using the deliver books from openCypher, such as intellij plugins, we see projects popping up all the time using our software. - [Rik] Really cool. Well that should be, helpful at that ambitious goal of creating that new Language for Graphs, right? So, I think if anyone else wants to read up on it they can go to openCypher.org, right, and absolutely also through the neo4j.com website to find more information. Find us on Twitter or wherever they want to find more information. As you know I want to keep these podcasts fairly short so, thank you so much for taking the time to come to a noisy coffee bar with me, (small laugh) and talk about this lovely project. Thank you Andres. - [Andres] Thank you so much for having me. - [Rik] Yes, bye.","title":"Podcast interview with Andrès Taylor, Neo Technology - reloaded-KKidU8cq7ts.en.vtt"}},{"_index":"documents","_type":"documents","_id":"158433b2-93fb-11e8-9133-b630aa296667","_score":1.0,"_source":{"keywords":[{"score":1.0,"value":"future neo4j project"},{"score":0.9568862365312415,"value":"digital media manager"},{"score":0.8143554812037266,"value":"interesting story"},{"score":0.44948598645188287,"value":"guy"},{"score":0.39891099054798085,"value":"beginning"},{"score":0.37273804927470094,"value":"case"},{"score":0.34652921030650624,"value":"hästens"},{"score":0.34652921030650624,"value":"graphconnect europe"},{"score":0.34250026028282526,"value":"stuff"},{"score":0.3348551995401694,"value":"yeah"},{"score":0.3264691562488315,"value":"london"},{"score":0.32458404190716483,"value":"happy"},{"score":0.32458404190716483,"value":"thought"},{"score":0.3236680098065646,"value":"sure"},{"score":0.31664760590445173,"value":"kent lovestjärna"}],"entities":[{"id":"Kent Lovestjärna","type":"PERSON","value":"Kent Lovestjärna","frequency":1},{"id":"London","type":"CITY","value":"London","frequency":1},{"id":"Hästens","type":"LOCATION","value":"Hästens","frequency":1},{"id":"Bryce Merkl Sasaki","type":"PERSON","value":"Bryce Merkl Sasaki","frequency":1},{"id":"Manager","type":"TITLE","value":"Manager","frequency":1},{"id":"GraphConnect Europe","type":"LOCATION","value":"GraphConnect Europe","frequency":1},{"id":"Sweden","type":"COUNTRY","value":"Sweden","frequency":1},{"id":"SAP","type":"ORGANIZATION","value":"SAP","frequency":1}],"text":"Hi. I'm Bryce Merkl Sasaki, and I'm here at GraphConnect Europe in London and I have with me Kent Lovestjärna, the Digital Media Manager at Hästens in Sweden. So, can you talk to me about what you guys use Neo4j for? Well, for the moment we are using it to a lot of stuff. First off, we are using it for the web and also we are combining that to-- we're connecting it to the SAP systems, making a better 360 view of our clients, and also to the sales force. So since we do have a lot of partners, we are using it to a lot of stuff and connecting it to get as much as possible, to get a better view. Okay. So a lot of master data management? Yeah, exactly. Okay, good. And then why'd you guys choose Neo4j? What made it stand out? I think for us as a company, it was important to-- since we don't know everything and how we want to build it, we thought it was really a good way to not being sure of everything yet, so we could collect the data and start building it afterwards. And also the process of it has been much faster, regarding the old ones. So, yeah, those are the key for us. Okay, good. And then what have been some of your most surprising or interesting results you've seen as you've started using Neo4j? I think it's the graph, that you can use it to so much more. Because in the beginning we just thought that we could do this and that, but now we realize that we can use it so much more. So we're integrating it more into the company and trying to find new ways of solution that we have been doing for a lot of years. Okay, great. And then, if you could take everything you know about Neo4j now, or maybe even how you guys have used it in terms of use case, and go back to the beginning of when you first started using it, what would you do differently, or where would you change? I think I would go all in from the beginning, because the road map for start using it. It was like half a year we look at it, but seeing the results we would maybe have been better prepared as a company to push it out faster because of the outcome. So I think that is the-- yeah, looking at the better history, I think that would be the case. Okay, great. And then, do you have any other interesting stories or any other thoughts or anything else you want to add? Wow. Well, I just can't think of something right now, but we're really happy about it and the results that is coming in right now. Okay, great. Well, thank you so much for your time. Thank you. And best of luck with all of your future Neo4j projects. Thank you.","title":"MDM with Neo4j - Interview of Kent Lovestjärna, Hästens Sängar AB-ByLA5PbNdvg.en.vtt"}},{"_index":"documents","_type":"documents","_id":"15845ab9-93fb-11e8-9133-b630aa296667","_score":1.0,"_source":{"keywords":[{"score":1.0,"value":"document content repository"},{"score":0.9664732791837762,"value":"classic content management system"},{"score":0.947235354403565,"value":"content management system"},{"score":0.8699885364981965,"value":"content cleaning area"},{"score":0.8501924720816666,"value":"cisco content"},{"score":0.8371443161338942,"value":"content management platform"},{"score":0.8293078673287296,"value":"topic list content creation"},{"score":0.8188127339365704,"value":"content reading service"},{"score":0.8152745290809024,"value":"right content"},{"score":0.7927697922922922,"value":"content people"},{"score":0.7850320530823559,"value":"content type classification"},{"score":0.7821079551414456,"value":"content recommendation"},{"score":0.7821079551414456,"value":"recommendation content"},{"score":0.7761841526641379,"value":"content piece"},{"score":0.7707821553755291,"value":"content classification policy"},{"score":0.7674756691755598,"value":"rich content"},{"score":0.7629053303740143,"value":"content consumption"},{"score":0.7623719465758934,"value":"content repository"},{"score":0.7553127082874997,"value":"content change"},{"score":0.7540453751894831,"value":"marketing content"}],"entities":[{"id":"Amazon","type":"ORGANIZATION","value":"Amazon","frequency":1},{"id":"Cisco","type":"ORGANIZATION","value":"Cisco","frequency":15},{"id":"layer","type":"TITLE","value":"layer","frequency":1},{"id":"general","type":"TITLE","value":"general","frequency":3},{"id":"second","type":"TITLE","value":"second","frequency":2},{"id":"author","type":"TITLE","value":"author","frequency":1},{"id":"Google","type":"ORGANIZATION","value":"Google","frequency":1},{"id":"count","type":"TITLE","value":"count","frequency":3},{"id":"Alfresco","type":"MISC","value":"Alfresco","frequency":1},{"id":"MDS","type":"ORGANIZATION","value":"MDS","frequency":1},{"id":"LDA","type":"ORGANIZATION","value":"LDA","frequency":1},{"id":"RDF","type":"ORGANIZATION","value":"RDF","frequency":2},{"id":"today","type":"DATE","value":"today","frequency":1},{"id":"Prem Malhotra","type":"PERSON","value":"Prem Malhotra","frequency":1},{"id":"model","type":"TITLE","value":"model","frequency":1}],"text":"It is good to be back here again. I was here about three years back when we were talking of one of the first largescale deployments of Neo4j at Cisco. My name is Prem Malhotra. I work for Cisco IT and my team, a bunch of them are here. What we did in the past was leverage Neo4j for a system that’s called IT Management Platform that’s still going strong and being used. But today I’m going to talk to you about a totally different area which was on the area of content findability. What I mean is find any content -- documents, files, presentation and so on. Considering the size of Cisco, 70,000 employees, maybe 30,000 vendors, our documents are really large in number, close to 20 million documents around different repositories. A typical problem that one gets into is you have your good search engines but search engines are not really giving you the kind of relevance that you need to find exactly the content you want. The reason we started looking is a subset of our users, these are our sellers and sellers are very important. I want to be selling them all the time, they have to be spending all their time with the customer but our sellers were telling us that they were spending close to 40 minutes to one hour every day finding the content they wanted to take to their customers. I don’t know how many of you have the same kind of problem in your organizations but we definitely had one. The intent was how can I come up with techniques that will not only help our sellers but also our bigger, larger rules across Cisco? When you start looking at this findability problem you learn what the key is and how we can make the search engines do a better job and what is the missing link? We can put the best search engine on the job but still you don’t get what you want. Why the findability problem there? We look at a typical search engine. It has a nice index, it looks at all the content in the different repositories and it designates words to document a situation and what happens next is a user comes in and typical users have two or three words as their queries. It’s a very tight, small request and the engine is trying to do the best job it can to match the request to the content that it has indexed. And often the case is like when you go to get your luggage at the airport you end up with a lot of black luggage bags which are hard to find which is yours and often people start decorating them with different labels or strings or what have you and that is exactly what they’re doing to put in the metadata on the content. The problem was too much content, no deeper understanding of the content, which means no more characterization on top of this content and the content is word files, presentations, recordings and what have you. The queries are typically very brief so one can definitely do a manual tuning of the search engines and we do that and they work only for the top queries that you can manually do and so there is an expense associated with that. As the content changes you might go back and retune because your business has changed and so on. That’s a doable technique. We use it but it’s not something that can scale and the results are ineffective because the context of the user you can’t really get that well with really few amounts of words in that. The key reason for this problem that we identified was we needed to add metadata to our unstructured content. You will notice that a lot of content management systems do provide a place to add metadata but how many people actually put it? Less than one percent and these are even if you put in a few words it’s kind of time consuming and people are too busy to do that. So what did we do? Our teams set their different ways to get to the metadata but the first attempt is to figure out how to get it. Manually you can read a few thousand documents, not millions of documents. So we have reached to the machine learning and big data techniques which we are fortunately in the right time and the right place and that helped us get a deeper inspection of our content, what is inside it and be able to extract concepts which I’ll talk more about later. That forms one kind of metadata that we have. The second thing we did was let’s look at what people do with the content. It’s important to know how well the content is being used. We give people capabilities to plug into the end consumption point, the ability to read the content, track how well it is downloaded, used and do people want to do it in a particular way. All these are rich pieces of information that are telling us something about the content, so that we characterize that as another part of the metadata. Combining these two is the rich metadata framework that you get. Once we got that... And we were doing all this for the domain across Cisco but all the applications of that were to the smaller area of sellers’ content. So how did we use this? If you start on the side of searching the content, the first thing was when you get all this rich content can I use it as a tagging mechanism? Can I automatically machine tag your content which is already metadata _____ or new content that you’re creating? So we did that. Along with that we provided a very nifty way to provide governed tagging which are if there are these content creators like technical writers and so on who do have the incentive to tag, can they do it really effectively? And that’s called governed tagging in our space. Next was a thing called query assist for the search. This was leveraging what we learned about the content to be able to help the search engine do a better job. This has become quite effective and you’ll see it in other following slides. On the right you’re going to look for the content in a conventional search. On the left is newer ways to get the content and there were some talks here earlier on recommending content. That is the content comes to you. The recommendation in this work is being able to use metadata to bring the right content to the right context of a user and that has been quite effective for us as well. Intelligent interactive browsing: How can you make the conventional browsing through your documents, the different hierarchies that people use to see how documents are organized, make it smarter so people don’t have to go through or click through different routes to get to the content they were looking for and you also only show the browsing that makes sense. So this is one area that we’re still working on and the kind of roadmap in this direction is that eventually you should be able to interactively tell the user -- Here’s something you dived in. Here’s another piece that I know because you’re looking for Cisco content that you should also be interested in. Are you? That kind of content, that’s on a future roadmap in this particular area. The intent here is to make it easier and easier to find the content that you’re looking for. This third category that I call here content type classification, this has something to do with as our systems are all going to cloud, the security of our content is a very important consideration. How do you secure your content if you don’t know what type of content it is? So using machine learning techniques to be able to put a type on the content and the type means this is a product requirement specification or this is a test specification or this is actual customer feedback or what have you. That is very important so that then you can apply your content classification policies on top in terms of security and really tighten down the security on only the limited content that you need to. So in that bigger domain of moving our systems into cloud this is one area we’re working on. The remaining of the presentation I’ll walk through some of the details in areas I have talked about and then kind of conclude with a thing called Sales Connect. Sales Connect is an application that used many of these capabilities that evolved out of this work and there is also a demo that goes with it that I will get to show you. Before we do that let’s look at where Neo4j comes into the picture. At the heart of our system where all the information kind of sticks around and gets served or recorded is the Neo4j database and start from left and go counterclockwise, the first thing here is the document content repositories that are being processed automatically using big data techniques and concepts. They go in there, then the content reading service which is like Amazon style reading. You can read it as your comments. Intelligent query assist has been the search engine. I’ll talk more about it. Machine tagging I just described is sending a document in real time; it can actually give you the tags which are very useful for finding it later on. Content organization is services that can let exports in different areas say -- These are all pieces that all hang together for a particular purpose. It’s almost like saying I put it in my briefcase or I put it in what you do in your browsers conventionally, or you can organize it in a much easier way. Content browsing, we just mentioned that it’s easy to browser around and then content consumption and personalization is how can I localize my application. When you localize you have a lot of personal choices that people are making so we capture that as well in the system. The content recommendation we’ll talk more about it. This is metadata based content recommendation. So you see everything hangs around the Neo4j ecosystem. Let’s talk about the ability to find concepts in the content. What happens here is we leverage document clustering techniques using Hadoop and Spark big data platforms. So start on the bottom up. You get your content in different forms. The first thing is you do stripping of [stock] words and things like that so make it in a form that can be easily represented for the clustering process and the clustering process uses a version called [LDA]. There are quite a few options available. We found LDA useful. The outcome of that is a collection of key words or phrases that represent a group of documents. So a [string] of clusters or documents represented by this bunch of phrases that you can call a concept and this is all fed into the database. So here’s an example of the architecture, a block diagram of how the machine tagging works. What you see below the dotted line is the previous layer I showed, the bulk data processing and whether we can process about 1.7 terabyte of content. Do you have a question? Question: _____ Most of our processing does use the _____ techniques, but semantics are used for things like... We don’t really use semantic techniques, but we use content cleaning areas, we use language _____ for that purpose. So what you see here on the top left is a real time service, you can tag a content of a moderate sized document in less than a second or so. How it does work is you bring the file in, let’s say content management system on the right is the one that’s creating your document so it sends the document into the service in real time. Once the user says I’m about to check it in, it’s almost done and what the service does here, that white box on the left top is essentially doing almost the same thing that you would have done for a lot of documents in the bulk processing but now do it in a form that you can do a distance matching of that document to the concepts that are in the graph database so we use a cosine distance technique and out comes a set of tags which look like this. What you see on the left of the document, you can see the tags with different colors. The colors are just there to highlight different phrases out there in the words and so notice on thing. This is not just giving you words; it’s giving you phrases that we have discovered in the content that we associate with a document. In most cases the phrases do appear in the document but in many other cases they might be appearing in a collection of documents that all hang together because of far back in processing. So you might find a combination of two things here. Typically about 20 tags is what we keep but there is no limit and we can also have a large number of tags as well. Let’s say we did all this machine tagging, what does it help you with? Let’s revisit the slide I showed you before. You had the tags and you had the search engine index and it was the person was typing something to be searched for. The difference between that picture and what we do with the tagging is that now we apply it on the left to machine tagging so that was a metadata service box that is the one that is doing this tagging that I showed you in the previous slide and the tags are kept along with the document. Now what happens, the search engine can do some smart things. The search engine can say -- Oh, I’ve got more stuff to look at so I can look at the entire text along with the tags or I can relook at the text and the tags as a separate location and the relevancy of items have a lot of options to be designed for. You can just do a search based on tags or you can do a search based on text plus tags. All these options are available. So that is what gets you the richer index, a richer way to understand what the user was looking for and you get a better matching by the search engine. Another thing I talked about was query assist to the search. Query assist was something that we stumbled on when we were trying different things with the content that we had and the understanding that we had built on the back end processing. What we found was that as you look in the right top, if you take the search request and trap it before it goes to the search engine and send it to our system, the intelligent query service, what it does is since it knows the Cisco content, because we are looking at Cisco content only. You’re not looking at Google, you are looking inside Cisco. Therefore, we know what we understood about our content. We can do a very smart matching of what the user asks for this content and identify phrases and accruements that can then expand the request into a logical expression, something like this. So what you do here on the top you’ve got a search request called Internet of Things at the application-centric architecture. There is a very Cisco lingo kind of search because it’s talking of Internet of Things that Cisco works in and application-centric architecture is one of the new ideas that we have solutions built around. So what the middle row here is showing is what the query just did. It applied IoT as acronym along with Internet of Things but the important thing is identifying on the quotes the phrases and once you give a phrase to a search engine you have a very different outcome. You have fewer matches which are more effective, more accurate, instead of the search engine trying to make sense of it using other techniques. This is here to help it and what we did was, based on the amount of content that we understood today, which is not the whole amount. We have about 20 million documents and we are probably at three million or less but it is half the time. Half the time it can actually say -- I can do a change. Then we did A/B testing kind of evaluation to see how effective is it and it turns out it is as good as your manual curation of content which means that manually you can tune the relevance to get an outcome. The same thing happens for the same kind of queries almost as effectively as when it’s done by query assist. This is the last main thing I’ll talk about. Content recommendation was done for our sellers. Remember that we are trying to say -- How can I get the right context? We did first for the selling community. As you know, most of the companies have their sellers on a Salesforce.com system where they start their day and end their day entering information, how are the status of their deals and so on. We said we also have it, let’s look at it. If you have the Salesforce.com system you know what kind of deals are in the system for each user and what is the age of the deal and all kinds of other attributes of the deal so that’s the way to get the context. The next was content. I said earlier we tag the content. Some of it could be manually tagged or automatically tagged and then it is grouped into bundles. Bundles means it’s like a collection of three or four documents which all hang together for a purpose and they’re talking of the same topic and it’s particularly done by exports. Then the magic of this happens during the matching, which is smart matching that happens which uses the deal information and the tag information from the content repository to be able to find out, along with applying the business rules, one interesting thing that happened when we were talking to our sellers is that I only need help if my deal is less than one month old. Once it has gone past the date, either it has fallen off or it has gone into a more mature state. I don’t need any more help from you. So those are the kind of business rules that we were able to incorporate in this smart matching and the outcome was a few content pieces that they can actually take to the sales call right away. What does this all look like now? This is one snapshot from the application I’ll give you a demo on but we will not have this snapshot because I am not a seller so I won’t get this particular recommendation. It is telling you based on your CCW... CCW is the internal name for one of the applications like Salesforce that our partner sellers go to and by the way, we have close to 300,000 partner sellers along with our sellers so it’s a large selling community that this is all targeted towards. Here it’s telling you it’s giving the documents. There are about four documents it should be looking at for that particular seller. This is a bit of a busy diagram but it brings together the different areas that I talked about. Sales Connect is the name of that application but the foundation is used not just by that but the entire Cisco use as well. What you see on the left is a graph database which is doing all these smart things I talked about and then we just... The next is Al Fresco. Alfresco is a classic content management system and next to it is the content management system for web content which is like the content you see on cisco.com, the web pages and so on that comes from Adobe. It’s called Adobe AEM. Then you also have Box. We have 100,000 users on Box and so that content is as well part of our content management platform. A company of this size needs to have multiple systems but we have been on the journey of consolidating all of them into these three systems. Then the right of cisco.com there’s a new search engine that we deployed at the start of this journey about a year and a half back and doing very well. The rest is there to demonstrate the different kinds of APIs which are all REST APIs. The services view that we have here is you can consume it from anyplace with fairly good throughput. What I’ll do next is give you a sense of this application that I talked about so give me a moment and then we’ll do a Q&A. Here is a SalesConnect application. I logged in from wherever I am. Any Cisco employee can get into it. I already logged in. What you see here is a look at things like global sales kit. The global sales kit was one of the mechanicals to converge very related content together and it has all the style and it looks the same on iPad and iPhone. Going through this list and you can click on any of these groupings but let’s say I was looking and something that caught my eye like life sciences, for example. You will see the content that comes under it and go into each of the content and it actually provides ratings and so on into it. It tracks on if you viewed it, how often it was downloaded, all the shared value on all that, all that rich information is coming back to our system. This is the piece that I was talking about. Recommendation based on CCW. I’m not a seller. I don’t get it. Because I’m tapping into the real system that people are using right now. Then we have things like SalesConnect. As the content gets refreshed or redone, updated and so on, the system can get a recommendation to you based on that content. Then here is something interesting. What’s trending? What’s trending is based on the usage rating and comments, evaluation of what content people are finding very useful that you should also look at. This is not customized to the context but in general the richness of the content on the system. Each of these, click on it you will find either the usual pieces of content or there would be bundles of content depending on... Like in this particular case, let’s look at this. This has a very large view count, share count, download count and so on. One other thing I wanted to show you here is a thing called browse by category. Remember I talked to you about smart browsing so it’s something that it does here in the browsing case is you go to these different categories but then you can come up with, let’s say, data center and virtualization is one of the groupings of what Cisco does and you can actually go look at content in the same way. So only you are going through the recommendation of the content here, you can say it’s something I want to click through myself and find the content. In this particular case all these pieces of content are coming to you because search engine is looking at the metadata that goes with the content. So think of my kind of dark areas, very wise, very important for finding the content and there are several ways to look at it and how to apply it and how we have just started our journey, we have more to go in this particular area. I think I’ll stop here and there’s a fair amount of time, about 12 minutes left so we can have a Q&A. Question: In your diagram you referenced an MDS ontology. I’m curious what is the nature of that ontology and is it RDF triple store or is it something which is simply a model that’s used conceptually? Prem: We don’t use RDF there but we store it in a graph and it has a collection of this information that we got from the data processing in this testing technique to be able to say -- Here’s a sense of this collection of documents in terms of words and phrases and then how are they linked together. Question: So it’s an ontology that is dedicated to the content that it was processed? It’s not a metadata ontology for content in general? Prem: It represents the metadata of the processed content and then you also augment it with the human developed hierarchies which are part of our ecosystem like the products and the verticals and so on so it’s a mix of the two. Question: The reason I’m asking is when people talk about ontologies they’re talking about models of things in general that can be applied to instances of classes but this sounds like it’s more customized to the meaning that’s contained in the content. I see. Prem: Correct. Yeah? Question: When you generate the automatic tags, do those get presented back to the original author when they contribute a document to give them a chance to interact with those, to say whether they were right or wrong? Prem: Yeah, that’s a good question. When the tags are generated what happens is they come back to the user who is checking in the document and at that point the user has the option to say -- I can take some off, I can take all of them off and so on. So they are absolutely in control all the time. Question: Any parts on getting into the documents and using partial content from the different documents? Has that request come up? Because for the salespeople the problem is always new, it’s always a new area and then you may have to assemble content from different documents. So are you using graphs for looking at portions of documents? Prem: I think that is a very, very timely question because we are right now in that journey for our technical writers. Our technique of documentation in the legacy sense was like in this huge framework of documents and we still have them. You give it to a user it’s useless because you’re looking at a 500-page document and you want to find a small snippet there. So where Cisco is moving in that area is a topic list content creation so once you have topic list content creation you actually have the ability to categorize the metadata and now you can have a dynamic assembly possible. So a user comes in and says -- I’m looking for information for this particular thing. It’s almost like search but instead of giving you a preassembled search document from search, you could actually assemble it all together using the metadata. We are not there right now but it is an active area of work that we’re doing. A thing to add is this same problem is not just there for sellers. The problem is diffused throughout our organization for marketing content and the technique of recommendation content and other areas that we’re exploring as well. Content and finding it is a real hard problem for large companies. Question: In your talk you mentioned about doing some A/B testing. Can you give some references or some examples of what you guys do there? Prem: How do they do the A/B testing? Question: Yeah, what are the different approaches that you guys use and are there any open source tools that you guys use to find the effectiveness? Prem: I think with this particular case we do not use the open source tool but in the marketing group we do use an open source tool for A/B testing. I don’t remember the name. What we did was we turned off the query assist for some content and then once we turned it off, that content happened to be also manually curated. The relevancy was manually tuned for it and so we were able to compare the results of that. If there are no more questions we can stop here. Thank you.","title":"GraphConnect SF 2015 _ Prem Malhotra, Cisco - MetaData Graph-dbNzAD4gOaY.en.vtt"}},{"_index":"documents","_type":"documents","_id":"158433aa-93fb-11e8-9133-b630aa296667","_score":1.0,"_source":{"keywords":[{"score":1.0,"value":"relationship type"},{"score":1.0,"value":"type relationship"},{"score":0.879953322726063,"value":"different type"},{"score":0.8243432016780697,"value":"node type"},{"score":0.7895658575991136,"value":"single type"},{"score":0.7801403989033585,"value":"multiple type"},{"score":0.6652537253359065,"value":"compound disease pair"},{"score":0.6513545477053289,"value":"public neo4j instance"},{"score":0.6347602022663937,"value":"different side effect"},{"score":0.5941204462885509,"value":"public neo4j browser"},{"score":0.519205137889599,"value":"different compound"},{"score":0.506641838842852,"value":"little package"},{"score":0.5056604154004164,"value":"academic research project"},{"score":0.49419000288984133,"value":"neo4j support"},{"score":0.4817663777037778,"value":"specific network path"},{"score":0.44771561927473924,"value":"permissible open licens"},{"score":0.4449199699979563,"value":"interesting research"},{"score":0.4400979914671897,"value":"academic research"},{"score":0.4340185728217701,"value":"neo4j browser"},{"score":0.4329734416246809,"value":"open source licens"}],"entities":[{"id":"Belgium","type":"COUNTRY","value":"Belgium","frequency":1},{"id":"diseases","type":"CAUSE_OF_DEATH","value":"diseases","frequency":3},{"id":"disease","type":"CAUSE_OF_DEATH","value":"disease","frequency":8},{"id":"Daniel Himmelstein","type":"PERSON","value":"Daniel Himmelstein","frequency":2},{"id":"Graphistania Podcast","type":"LOCATION","value":"Graphistania Podcast","frequency":1},{"id":"Daniel","type":"PERSON","value":"Daniel","frequency":5},{"id":"Philadelphia","type":"CITY","value":"Philadelphia","frequency":1},{"id":"Nicole White","type":"PERSON","value":"Nicole White","frequency":1},{"id":"drugs","type":"CRIMINAL_CHARGE","value":"drugs","frequency":4},{"id":"guide","type":"TITLE","value":"guide","frequency":3},{"id":"diabetes","type":"CAUSE_OF_DEATH","value":"diabetes","frequency":1},{"id":"University of Pennsylvania","type":"ORGANIZATION","value":"University of Pennsylvania","frequency":1},{"id":" Too-da-loo","type":"PERSON","value":" Too-da-loo","frequency":1},{"id":"Rick","type":"PERSON","value":"Rick","frequency":1},{"id":"scientist","type":"TITLE","value":"scientist","frequency":2},{"id":"Rick Van Bruggen","type":"PERSON","value":"Rick Van Bruggen","frequency":2},{"id":"gene","type":"TITLE","value":"gene","frequency":1},{"id":"Dutch","type":"NATIONALITY","value":"Dutch","frequency":1},{"id":"Doctor","type":"TITLE","value":"Doctor","frequency":1},{"id":"Hetenet","type":"PERSON","value":"Hetenet","frequency":1},{"id":"San Francisco","type":"CITY","value":"San Francisco","frequency":2},{"id":"Nicole","type":"PERSON","value":"Nicole","frequency":2}],"text":"- [ Rick Van Bruggen] Hello everyone. My name is Rick, Rick Van Bruggen from Neo Technology and today I am recording another episode for our Graphistania Podcast and I'm being joined by Daniel Himmelstein from University of Pennsylvania. You're a postdoc fellow there, right, Daniel? - [Daniel Himmelstein] That is true. Just got my PhD, - Fantastic. - Out in San Francisco and then moved east to Philadelphia - Fantastic. Why don't you introduce yourself a little bit, Daniel, and then you'll work into your relationship to the wonderful world of graphs. - Okay, so, I guess I could introduce myself with my twitter description which is digital craftsman of the biodata revolution. - Wow. - (laughs) Yeah - That sounds great. - Wow. (laughs) What I really do is I'm a scientist working on integrating a lot of medical data and making predictions about biology and disease. So, it's an exciting time because there is so much data that's becoming available and we need ways to organize and store that data and learn from it. And that's where Neo4J has filled the gap for us. - Wow. So how did you get into the world of Neo4J? You know, how did you get to know us? - Yeah, so, I work with what I call hetnets, and a hetnet I define as a network with multiple node or relationship types. And when I started doing this research, about four or five years ago, I looked at Neo4J a little bit but it didn't quite suit my needs then. I don't think Cypher was mature at that point, which is a query language. And, so, I wrote a little package and python to work with graphs with multiple types of relationships because a lot of the built in, like, python packages, or mature packages, didn't really do a good job representing types on a network. - Okay. - So that's how I got interested in it. And then, several years down the road, I reevaluated Neo4J and I said, \"This will solve a lot of the problems we are having, it'll take a huge development burden off of our shoulders and we'll get to be part of this great ecosystem.\" So... This is a database for hetnets.\" Even though, I don't think anyone..., I asked Nicole, \"Do you know the term hetnets?\" And she didn't. - I think you met some of our people at a meet up in San Francisco, right? Nicole White, and those types of people, right? - That's right. It was a fun meet up and it just really clicked with me 'cause Nicole, who was going over the basic concepts, like how each relationship has a type, each node has one or more labels, edges are directed, and I was like, \"Wow. This is what we need. I think, in Neo4J speak, you call it a property graph. - Yes. Yeah well I mean, a hetnet... I'm from Belgium, and my mother tongue is Dutch and hetnet means \"the net\". (laughing) \"Het\" is the, how do you say it? Is the equivalent of \"the\". So, that's a bit funny in my language. But um... - I like it. (laughs) - Exactly. The Net. So, can you tell us a little bit more about it. You know, why is it such a good fit for hetnets? You described it in your GraphGist and you made a public instance of Neo4J available, which I'll obviously link to from the podcast. But, you know, why is it such a good fit, Daniel? - Yeah, so, I guess to answer that I'll tell you a little bit more about what we're doing. We're trying to encode as much of the knowledge produced by biomedical research in the past 50 years as possible. So, we take data from millions of medical and scientific studies and we condense it into a network. And traditional people have done this but they've done it with a single type of node and single type of relationship. So, for example, people would make networks with genes and they would connect the genes if they interacted inside of a cell. But, obviously biology is very complex and given that complexity it helps to model it with the actually diversity of types that are involved in health and disease. - Can you give me an example of the different types of interactions? - Yeah, so, what we've created is something we call heteonet. Version 1.0 has 11 different types of nodes and 24 types of relationships. So, what these would be, would be, like, a compound or drug. So, that's something like Aspirin. Then we have diseases. So, a disease would be multiple sclerosis, diabetes, etc. We have the symptoms of diseases. We have the side effects of diseases. And those are all node types. But then we'll have relationships. So, for example, a compound is known to cause different side effects and that's information that's actually extracted from the drug labels or the little package you'd get on the inside of your medication when you pick it up from the pharmacy. - Absolutely. Yeah. - And then of course we have genes. So in the past decade there's been a lot of research on how different compounds affect genes in your body. Does giving someone a drug or compound make more or less of a given gene? So we have that type relationship. We also have relationship for which genes does the compound target in the body. So, how were the compounds designed to act? - So you model all this information in a graph? In a property graph? In a hetnet? And what are the types of questions that you want to ask of that then? You know, is it about drug interactions or is it about new treatment paradigms or what's the end goal there? - Yeah, so, the question that we've been asking most recently is, \"Can we systematically learn why drugs work?\" So, traditionally drug development is often very serendipitous. So, people will observe that a drug has a certain effect. And often times a lot of the main pharmaceutical therapies it's not entirely known why they work, just that they were observed to have a positive effect on a disease. And, traditional pharmacology, when actually looking at why compounds work has, or why drugs work, has done it on a single drug/disease level. So they look at a single therapy and try to understand why it works. But we're looking for patterns across all drugs that work. So, from a machine learning perspective, what makes compound/disease pairs that actually are efficacious? What makes them different from non-efficacious compound/disease pairs? - Wow. That sounds like there's a, there could be a lot of potential there. You know, a lot of new drugs that could be repurposed or new applications. Is that what you're looking for? - Totally. The end result of our algorithm is we make about 200,000 predictions and each one of those predictions is for a compound/disease pair and we give a probability we think that that compound disease pair represents a treatment. So, if you're interested you can go to our website and you can browse by compound or disease and see all of the predictions. And actually what's cool is that when you have a specific prediction you're interested in, you can click on it and it takes you to a guide in our public Neo4J browser so you can see what parts of the network contributed to that prediction; the specific network paths that we think provide evidence or support that a drug treats a disease. - I've seen that I thought that was so well done. Congratulations on that. Really, really, really, really cool actually. So this sounds like a mountain of gold, you know. Is this all in the public domain? Or is this just academic research? Or does it have business applications as well? - Yeah, so, we're part of an open science movement where we release all the code for what we do under open source licenses. We release all of the data as openly as possible. So, everything, if possible, is put into the public domain. And we're really looking to get people to use the research we make. It's fine if they profit of of it. That would be great. We just want to produce something that people find useful. I guess, because I'm a publicly funded scientist. I get to do what I want and make it available for free. - I think that is just so admirable. I really, really applaud that for you. I mean, We were talking about it earlier, right, so this podcast is going to be published under a creative commons license, as well, because, you know, that's how you want to publish your work. I really applaud that. That's fantastic. Really, really, appreciate it. - Thanks, yeah. I guess, it may just be a selfish thing that I like when my work is free used. (laughs) - No. - But... - I think it's uh, especially in the type of data you're dealing with and this type of research that you're doing, I mean, this could save lives, right? I think it's important that people do stuff like that. And I congratulate you on that. - Thanks. Yeah. Well, so, I've also experienced from both sides because we had to take data from about 30 different resources to integrate it into Heteonet and a lot of them would have licenses, even though they're publicly funded academic research projects, that made it really hard to integrate the data. So, that taught me the hard way the importance of having permissible open licenses. - So, let's talk about the future, Daniel. You know, where is this going? You know, what are your plans with graphs and with Neo4J? You know, where do you want to take this? - Yeah, so, right now Hetenet has about 2.5 million edges/relationships and I'd like to not only grow that number, but start to get more meaningful edges. So I think we can grow the network quite a bit and we can look at new applications. So, we were predicting whether a compound treats a disease, but we could also predict, say, new side effects of compounds, or we could predict... or we could start to get a more nuanced algorithm. So that's also of interest. As far as Neo4J goes, I've been really excited about the guide technology. So, you briefly mentioned that but we have this public Neo4J instance which lets anyone just go to the URL which is: neo4j.het which is \"h\" \"e\" \"t\" .io and then immediately see a Neo4J browser with our network in it and we have guides which are like a little kind of web page, or html tutorial, that just shows up naturally in the browser and can inform you about the network So, I think that really will help, like, biologists and pharmacologists interact with our network to have these guides. - Great. Well I'll put some links to this with the podcast transcription writes so hopefully you'll get some people visiting it and I really thought it was very impressive what you did there and much more impressive than ... I did a beer guide (laughs) - Ah, I think I've seen that. Yep. (laughs) - But, which is a lot less interesting but you know that's the only thing I know anything about. So (laughs) - I did see on one of the previous podcasts, I think, - Yeah - it was a network of movies - Yep - And it was like, date night, two people would put in the movies they like and it would find an intermediate movie. - Yeah. Yeah. - That was cool. They're on the west coast. Cool. Well, Daniel, thank you so much for coming online and doing this interview with me. I wish you the best of luck with all your research and hopefully, you know, it'll lead to lots of new treatments and new interesting research. Thank you so much and I hope to meet you some day at one of the GraphConnect conferences, perhaps. - Yeah. Totally. - That would be great. - I'm excited. Yeah. I think the whole community is developing so quickly, like, we use Doctor to deploy our cloud instance and the Neo4J support there is good. It's just a really fast moving project. - Totally is. - So, it's exciting. - Great. Thank you so much. I wanna keep this digestible and short so I'm going to wrap up here and I'll talk to you soon. - Okay. Too-da-loo. - Too-da-loo, exactly. Bye.","title":"Podcast Interview with Daniel Himmelstein, University of Pennsylvania-QfElj7A12Rw.en.vtt"}},{"_index":"documents","_type":"documents","_id":"1583e581-93fb-11e8-9133-b630aa296667","_score":1.0,"_source":{"keywords":[{"score":1.0,"value":"neo4j hashtag"},{"score":0.6847427495489364,"value":"graph database kind"},{"score":0.5270990416005276,"value":"graph kind"},{"score":0.5166520364717305,"value":"content kind"},{"score":0.44426075186484687,"value":"cool import mechanism"},{"score":0.3532556267731977,"value":"cool job"},{"score":0.35236223494680297,"value":"cool factor"},{"score":0.3485699546215091,"value":"cool kid"},{"score":0.34657634501233814,"value":"real world problem"},{"score":0.3438228528211575,"value":"developer advocate"},{"score":0.3144455967427122,"value":"cipher query"},{"score":0.30781161756795755,"value":"wikipedia article"},{"score":0.29457865168430775,"value":"first-class citizen"},{"score":0.29283122876989615,"value":"datum project"},{"score":0.2770928898495096,"value":"content marketing manager"},{"score":0.268185925462601,"value":"cooler way"},{"score":0.2669769463546938,"value":"java developer"},{"score":0.26622896992512296,"value":"java slant"},{"score":0.26427165037611017,"value":"closing thought"},{"score":0.2623478055448759,"value":"social media"}],"entities":[{"id":"model","type":"TITLE","value":"model","frequency":1},{"id":"Developer","type":"TITLE","value":"Developer","frequency":1},{"id":"Spring","type":"DATE","value":"Spring","frequency":2},{"id":"developer","type":"TITLE","value":"developer","frequency":1},{"id":"Second","type":"TITLE","value":"Second","frequency":1},{"id":"Mark Needham","type":"PERSON","value":"Mark Needham","frequency":1},{"id":"Neo4j","type":"MISC","value":"Neo4j","frequency":2},{"id":"Java","type":"MISC","value":"Java","frequency":2},{"id":"Marketing Manager","type":"TITLE","value":"Marketing Manager","frequency":1},{"id":"Michael Hunger","type":"PERSON","value":"Michael Hunger","frequency":1},{"id":"Wikidata","type":"ORGANIZATION","value":"Wikidata","frequency":1},{"id":"Graph Connect Europe","type":"LOCATION","value":"Graph Connect Europe","frequency":1},{"id":"Pivotal","type":"PERSON","value":"Pivotal","frequency":1},{"id":"London","type":"CITY","value":"London","frequency":1},{"id":"James Weaver","type":"PERSON","value":"James Weaver","frequency":1},{"id":"Bryce Merkl Sasaki","type":"PERSON","value":"Bryce Merkl Sasaki","frequency":1},{"id":"95","type":"PERCENT","value":"95","frequency":1},{"id":"Wikipedia","type":"ORGANIZATION","value":"Wikipedia","frequency":3},{"id":"Jim","type":"PERSON","value":"Jim","frequency":4}],"text":"Hi, I'm Bryce Merkl Sasaki and I'm here at Graph Connect Europe in London, and I'm here with James Weaver, a Developer Advocate with Pivotal. My friends call me Jim, by the way. You can call me Jim. Oh, Jim. Great. So, Jim, talk to me about-- how do you guys use Neo4j? Well, I can tell you how I've been using it. So, I just did a presentation here that uses Neo4j, it's called Navigating All The Knowledge, and so it kind of fuses Wikidata and Wikipedia together, and demonstrates some of Pivotal's technologies - like Spring and Club Foundry - in the context of being able to semantically navigate Wikipedia articles with Wikidata. And so, we're using Neo4j as part of that story, holding 11 million records or notes and 74 million relationships from Wikipedia and Wikidata to pull that off. And also, Spring has a Spring data project that interfaces with Neo4j, and I understand from talking to the Neo4j doctors out there at the booth, that there's going to be even tighter integration with Neo4j in the form of maybe even marketplace, to where Neo4j becomes a first-class citizen in the ecosystem. So, I'm very excited about that. Okay, great, great. Why did you choose Neo4j for your project? What made it stand out? Well, it's got the cool factor. So, all the cool kids are using Neo4j, first of all. That's first and foremost. Second of all, it's got a Java slant to it, I'm a long time Java developer and advocate, so the technologies are right. It's fast, it's a very smart way to solve the graph database kind of-- to create graph database kinds of solutions. There are a lot of options out there, triplestores and things like that. But for my taste, and for just really solving real world problems, and be able to model, graph kind of problems and then solve them - Neo4j is the best thing in town. Great. So, what have been some of your most surprising or interesting results when using Neo4j? Any kind of \"a-ha\" moments? Yeah. Yeah, really. So, with 11 million nodes and 74 million relationships, I was astounded at the speed. I'm actually using Neo4j hosted with Grafian DB, and they have a great hosting solution, and so I'm just really astounded at how quickly results come back from queries with that much data, and so that was an \"a-ha\" moment. Another one was the simplicity with cipher queries, but also the amount of how sophisticated you can make them. You can certainly make them complex, but most of them tend to be very straightforward and very simple to create. So, I've just had a very pleasurable experience with Neo4j. All right. Great. And then, if you could take everything you know about Neo4j right now, and you can go back in time to when you first started using it, what would you do differently, or what would you tell yourself? Okay, so I attended a session today, it was Michael Hunger and Mark Needham, and it was about importing data, and I learned a lot of things there. So, I had to import those 11 million nodes and 74 million relationships, and so I learned a lot through going through that exercise, and I learned that Neo4j already has tools to do that, so I could have saved myself a lot of trouble. And then, I found out that with Neo4j 3, they are actually using, I believe, the sort of procedures that are even quicker and cooler ways to do that. So, knowing that-- so if I had to do it all over again, I would probably wait until right now, when Neo4j 3 came out, and use that cool import mechanism that exists. Okay, great. Anything else you want to add or say? Any closing thoughts? Do you have any closing thoughts? I'd like to know about your job. You seem like you have a pretty cool job being a content kind of person. Tell me about what you do. Yeah. So, I'm the Content Marketing Manager at Neo, and I help make sure that our community's well taken care of on social media. People have questions, things like that. Always helped them find the answer quickly, and then I'm always here for making sure people's projects get shown off on the Neo4j blog - so, guest contributors, things like that. I'm the facilitator that makes sure they can show stuff off. So, when I'm tweeting, \"Graph connect hashtag\" and \"Neo4j hashtag\" you may be one of the people re-tweeting. I'm probably 95% chance the person re-tweeting it. Awesome. It was great meeting you. So, yeah, great meeting you. Thank you so much. Thank you.","title":"Navigating all the Knowledge with Spring + Neo4j - Interview of James Weaver, Pivotal-lQTnYwKGLoo.en.vtt"}},{"_index":"documents","_type":"documents","_id":"158433a7-93fb-11e8-9133-b630aa296667","_score":1.0,"_source":{"keywords":[{"score":1.0,"value":"graph datum visualization"},{"score":0.9569944732788476,"value":"page rank algorithm"},{"score":0.8582777108878485,"value":"page rank property"},{"score":0.8303593412031659,"value":"highest page rank score"},{"score":0.7728096105767913,"value":"page rank score"},{"score":0.7476751862102555,"value":"neo4j sandbox instance"},{"score":0.7024913608861905,"value":"page rank"},{"score":0.6764583297848125,"value":"graph visualization"},{"score":0.6444058569510247,"value":"graph algorithm"},{"score":0.639749496882517,"value":"troll relationship"},{"score":0.6218777462823151,"value":"community detection algorithm"},{"score":0.6191589846952305,"value":"troll node"},{"score":0.6100972554416323,"value":"datum visualization"},{"score":0.6003216909220396,"value":"neo4j instance"},{"score":0.5784420998897905,"value":"neo4j javascript driver"},{"score":0.5772788748939873,"value":"retweet relationship"},{"score":0.5741850468383382,"value":"neo4j sandbox"},{"score":0.555117088893292,"value":"github page"},{"score":0.5484101726628962,"value":"neo4j browser"},{"score":0.5111511439601805,"value":"troll user"}],"entities":[{"id":"model","type":"TITLE","value":"model","frequency":2},{"id":"editor","type":"TITLE","value":"editor","frequency":1},{"id":"Neo4j Browser","type":"MISC","value":"Neo4j Browser","frequency":1},{"id":"driver","type":"TITLE","value":"driver","frequency":1},{"id":"Twitter","type":"ORGANIZATION","value":"Twitter","frequency":1},{"id":"Twitter Trolls","type":"ORGANIZATION","value":"Twitter Trolls","frequency":2},{"id":"Twitter Troll","type":"ORGANIZATION","value":"Twitter Troll","frequency":1},{"id":"second","type":"TITLE","value":"second","frequency":1},{"id":"CDN","type":"ORGANIZATION","value":"CDN","frequency":1},{"id":"Instructor","type":"TITLE","value":"Instructor","frequency":1},{"id":"count","type":"TITLE","value":"count","frequency":4},{"id":"ID","type":"STATE_OR_PROVINCE","value":"ID","frequency":2},{"id":"guide","type":"TITLE","value":"guide","frequency":1},{"id":"developer","type":"TITLE","value":"developer","frequency":1}],"text":"- [Instructor] In this screen cast, we're going to explore graph data visualization and graph algorithms with Neo4j. First, we're going to spin up a Neo4j sandbox instance with some Twitter data to create a database to use for this example. Then, we're going to apply some graph algorithms, like page rank and community detection, and finally, we'll see how we can use a JavaScript library called NeoVis.js to create graph visualizations that we can embed in a web app. Now, there are many different motivations and tools for creating graph visualizations. This includes tools for exploring the graph, the type of interactive visualizations you might see in Neo4j Browser, and these are visualizations for showing the results of some analysis. These can be interactive, something to be embedded in a web app, or static, meant to contain meaning that might be used in print or a blog post. I'm going to focus on a specific tool that addresses specific goals of graph visualization. This tool is NeoVis.js, and is used for creating JavaScript based graph visualizations that are embedded in a web app. It's basically a combination of the Neo4j JavaScript driver to connect to and fetch data from Neo4j, and a JavaScript library for visualization called vis.js. NeoVis can also take into account the results of graph algorithms, like page rank and community detection, for styling the visualization by binding property values in the graph to visual components. Specifically, there are three style components that can be styled according to the results of graph algorithms. The first is binding node size to the result of a centrality algorithm, so this allows us to see, at a glance, the most important nodes in the network. Visually grouping communities or clusters in the graph is done with the use of color, so that we can quickly identify these distinct groupings, and finally, styling relationship thickness to an edge weight. In social network data, this might be the number of interactions between two characters. In logistics and routing data, it might be the distance between two distribution centers. That would be useful for pathfinding. So, we're going to use the Twitter Troll sandbox as our data set. This data contains tweets from known troll accounts, including retweets, and it's specifically the retweets that we're going to be interested in today. So, what we want to do is look where trolls have retweeted other trolls and run page rank and community detection to try to find the most important trolls and see if we can group those into clusters. Now, here we have a user that posted a tweet and a user who posted another tweet that retweeted that first tweet, so we have a implied retweets relationship between the two users. So, our first step will be to find those inferred retweets relationships, and run our graph algorithms. So, I'm going to switch over to Neo4j Sandbox. Now, I've already signed in and spun up this Twitter Trolls sandbox instance, so I think anyone can do that. Just sign in and select the Twitter Trolls instance. You can see lots of other data sets that you can load as well, but this is the one we're going to use today. If we look in the details tab, we can see we have the connection information for our Neo4j instance, including IP address and password, and we also have access to Neo4j Browser for this instance. Neo4j Browser is a query workbench for working with Neo4j, and in the case of the sandbox instances, they all include these interactive browser guides that embed images and queries and text to help us explore the data. I'm gonna skip the guide, and we'll just write some queries on our own here. Now, the first thing that we need to do is create the retweets relationships connecting the troll accounts, so let's write a cipher query matching on the graph pattern for all retweets, so where a troll has posted a tweet, and there is another tweet that retweets that tweet. You'll need to grab both of the troll users, then we'll do an aggregation for all distinct pairs of these trolls. We'll count the number of retweets, then let's create our retweets relationship, where our two retweets are one, and on this relationship that we're creating, so here, r refers to this retweets relationship, and let's set a relationship property called count that's equal to the count of retweets, in this case, where r2 has retweeted r1. Let's go ahead and run that, and we end up creating a few hundred new relationships. We can verify this if we inspect the data model. We can see now that we have this troll retweets troll relationship in our data model. Okay, so the next thing we want to do is run page rank on this retweets piece of the graph, so let's run page rank on the troll nodes following the retweets relationships, and let's be sure to write the data back. Don't just compute it, but update the nodes with a page rank property. So, now, we can see which trolls have the highest page rank score. So, match on all trolls. Order by page rank in descending order. Let's look at the top 10, and so you can see, here are the top 10 screen names of the top 10 trolls by page rank. Okay, and the next algorithm we're interested in is something that can help us identify communities. We have a few different options for community detection. Alright, let's use the label propagation algorithm, in this case, on troll nodes, following the retweets relationship, label propagation. We choose a direction, and again, we need to write that data back to the graph. Let's, this time, set a community property that identifies the community, and also, let's be sure to take into account the weights of the relationship. Remember, we set that count property on those relationships. Okay, so we run that. Let's verify that data was written back. If we just select some trolls at random here, we can see now that we have a community, value, as well as a page rank score for these nodes. Okay, great. So, now, I'm going to get side by side with my text editor here, and let's jump over to a blank html document, starting off, and I've opened that in Chrome here. So, what we want to do now is pull in this data from Neo4j and create a graph visualization, something that we might be embedding in a web application, taking advantage of some of those graph algorithms that we just ran, and we're going to use this NeoVis.js JavaScript library to do that, so the first thing I'm going to do is go to the GitHub page for NeoVis. Now, I can go to the release tab and grab the JavaScript file for the latest release, but in this case, I'm just going to grab the link to pull that in from the CDN, and we'll go ahead and do that inside head here, that's fine. So, pull in our JavaScript file for the library. Let's make the text just a little bigger. There we go. Okay, so imagine this is our web application, and we want to create a visualization somewhere in here. Maybe this is a dashboard or something like that. Well, we need a body, and let's create a single div here, and we'll set an ID on this div. Okay, now, I'm going to add an unload function call here, so when the page has loaded, let's call this draw function, and let's define this draw function up here in the script tag. So, what draw is going to do is it is going to create a new NeoVis instance, specifying some configuration, and then render that visualization. So, we say vis is new. NeoVis passing in this config object, and then render the disk. So, NeoVis takes this config object which specifies how to connect it to Neo4j, and how to style our visualization. So, for instance, we'll need to add a container ID. In this case, that's the dom element that we want to populate with our visualization. We'll need to specify a server URL, a server user, and password to connect to Neo4j, and then we'll also need to specify what labels and relationships we want to visualize, and we'll need to specify an initial cipher statement for grabbing some data for Neo4j to populate our visualization. In this case, we want to match on all trolls that retweet other trolls, then return all of those. Now, we don't have to specify this initial cipher, although, if we don't specify that, we will pull all of the data in, into our visualization, which we don't wanna do in this case. Okay, so let's jump back to Neo4j sandbox. We need our connection information to connect to Neo4j, so let's copy the IP address. Now, in the server URL, we're going to use the bolt protocol, and we want to be sure to grab the bolt port, not the HTTP port. Bolt is the binary protocol for Neo4j that the Neo4j drivers use to connect to and talk to Neo4j. HTTP, that port is the one that Neo4j Browser is served on, so we don't want that. Username is Neo4j, so we'll set that, and specify the password. Okay, so let's save that and go back and refresh our page, and let's see if we can visualize anything here, so we'll refresh that, and we don't have a visualization yet. Here we go. Okay, so we can see here that it fetches some data from Neo4j, and it's showing us the results of this cipher query where trolls have retweeted other trolls. So, if we zoom in here a bit, we can see, yep, here's a troll that has retweet connections to other trolls, but that's not telling us much information, so let's update our config object. In this case, let's specify the styling that we want for the troll label. We'll specify the property to use for the caption. In this case, that's user key or the screen name, and now, we're going to say I want the size of the troll nodes to be proportional to the value of the page rank property, and similarly, take into account the value of the community property for our troll node. Remember, we ran our page rank algorithm to update that value for page rank, and label propagation to set the value for community. So, now, when we refresh that, we can see pretty clearly the result of our community detection algorithm, so here, we can see these green nodes are one community. We can see the purple nodes here, blue here, so we can see some distinct communities that were identified, and we can also see that some nodes are larger than others. So, here's TheFoundingSon. Remember, this account had the highest page rank score, and you can see here that it's the largest node in our visualization. Okay, there's just one more change we're going to make here, so let's configure how we want to style the retweets relationship. I am just going to turn off the captions and set the caption as the same for all retweets relationships, and we're going to use the count property to style the thickness of the relationship, and we'll go ahead and refresh that page to generate the visualization again. We're fetching data from Neo4j, and then we're rendering our visualization, and now we can see that the size, the thickness of our relationship is now proportional to the count property, so here, we can see that, for instance, count here, 36, so this user has retweeted the other 36 times, and that has a thicker relationship or a stronger connection than this case, where the value is just two. Okay, that's the basic approach for using NeoVis to generate graph visualization with data in Neo4j. There are some other layouts, some other styling options that we can use. They're described in the documentation on the read.me on GitHub for NeoVis. If there are features you'd like to see added, I would encourage you to open an issue on GitHub so that we can start working on those. And finally, I just want to leave you with a few resources based on things we talked about today, so the first is Neo4j Sandbox, which we saw is a great tool for spinning up Neo4j instances with data sets. The GitHub page for NeoVis.js is the second link there. The code for the example that we used today as well as some other examples are on that GitHub page as well, so you can find the code there. And then, there are two pages on Neo4j.com in the developer section, one on data visualization, and another that goes into more detail on some of the graph algorithms that we used. We looked at centrality and community detection. There are lots of other algorithms in there as well, things in pathfinding and so on, so if you're interested in graph algorithms, I would encourage you to check out that page. Great, well, that's it. Thanks a lot.","title":"Screencast - Graph Visualization With Neo4j Using Neovis.js-0-1A7f8993M.en.vtt"}},{"_index":"documents","_type":"documents","_id":"15845ac4-93fb-11e8-9133-b630aa296667","_score":1.0,"_source":{"keywords":[{"score":1.0,"value":"aggregation function"},{"score":0.9337457315771053,"value":"custom function"},{"score":0.895567145024072,"value":"procedure library"},{"score":0.806372936387508,"value":"built-in function"},{"score":0.7944454422513225,"value":"standalone procedure"},{"score":0.7933335038874176,"value":"official language driver"},{"score":0.7609102478144678,"value":"awesome procedure"},{"score":0.6559394897832255,"value":"apoc video series"},{"score":0.6167166046149003,"value":"dedicated apoc channel"},{"score":0.57561940914869,"value":"binary driver"},{"score":0.5665278491807646,"value":"custom business logic"},{"score":0.5592120109608263,"value":"high performance business logic"},{"score":0.5292993840302251,"value":"annotated java method"},{"score":0.5283830069325787,"value":"java clause"},{"score":0.5267480035161758,"value":"cypher statement"},{"score":0.4893409963013549,"value":"custom detail clause"},{"score":0.48511113480659285,"value":"official driver"},{"score":0.4752283896274882,"value":"further web video"},{"score":0.46678380088970706,"value":"neo4j access"},{"score":0.45825208810736645,"value":"different part"}],"entities":[{"id":"Cypher","type":"MISC","value":"Cypher","frequency":1},{"id":"Michael","type":"PERSON","value":"Michael","frequency":1},{"id":"Java","type":"MISC","value":"Java","frequency":4},{"id":"general","type":"TITLE","value":"general","frequency":1},{"id":"Scala","type":"PERSON","value":"Scala","frequency":1},{"id":"APOC","type":"ORGANIZATION","value":"APOC","frequency":3},{"id":"JVM","type":"ORGANIZATION","value":"JVM","frequency":1},{"id":"Kotlin","type":"LOCATION","value":"Kotlin","frequency":1},{"id":"Python","type":"PERSON","value":"Python","frequency":1},{"id":"Boolean","type":"MISC","value":"Boolean","frequency":1},{"id":"Java API","type":"MISC","value":"Java API","frequency":1},{"id":"Swiss","type":"NATIONALITY","value":"Swiss","frequency":1},{"id":"2016","type":"DATE","value":"2016","frequency":1}],"text":"- [Michael] Due to public demand, I want to create an APOC video series that explains how to install, use, and make the most out of the awesome procedures for Cypher, a procedure library for Neo4j. First, I want to start with a little bit of background, and then, in the next episodes, we will look at installation and different parts of APOC. Neo4j has developed over the years from an embedded Java API, over REST server, to the introduction of Cypher, that originally was served over HTTP, to until 2016, a binary protocol called Bolt, with official language drivers, and due to the change to the binary protocol, we couldn't use the REST APIs for management functionality anymore, so, as part of that, user defined procedures were added to Neo4j, added as a new capability. And in the subsequent versions also, user defined functions and aggregation functions were added, which was really cool. The, basically, binary drivers for Neo4j access for JavaScript, .NET, Java, and Python, as official drivers, and for all the other languages as community drivers, so you could, should find something for your language pretty easily. But today, we want to look at user defined procedures and user defined functions and aggregation functions. So, basically, you can implement them in any programming language on the JVM and Java, and Scala, Kotlin, and Groovy. And so, it's basically a annotated Java method that has then deployed to Neo4j, and can be called from Cypher either standalone, so as a call procedure, or as part of Cypher statements. So here's an example. We just call it apoc.index.nodes function, pass in an index name and a query, and it returns a node and a score, and we can return then from our query the node and the score, as results. But you could also call apoc.index.nodes as a standalone procedure, and then the query would have returned a node and score, as added two columns. It actually looks like this. So you have a Java clause that has something injected, in this case, a graph database service, and the procedure of this name that you just saw, and parameters that you just saw, index and query, and basically, it returns a search hit stream. That means it doesn't just return a single row, but it returns a streaming result of data that could be potentially very large. And internally, most of procedures are quite small, and return a, a custom detail clause. Yeah. For user defined function, it's similar. They're a little bit more flexible, because they can be called in any expression or computation, so, in arithmetic expressions, Boolean expressions, and predictors, relational expressions, and creates statements or merge statements. Everywhere they can have an input in expression of Cypher or built-in function, you can also use a user defined function. And in this case, we have the apoc.text.join function here, if you just call as part of the return statement, and if you take this list of words and join it with a space, and then we see hello world. A permutation is even simpler, because it doesn't return a stream, but just a single value, so it could be any of the Cypher types, or strings, numbers, Booleans, lists of node, relationship, path, maps, and so on. And, so we just basically call this code and return this string, and then we are good. And, for aggregation functions, this is a little bit more evolved, because they have to keep state, because they have to aggregate data in some data structure, and here's just an example that you can use for your own aggregations. So basically, here's an example that computes longest strings of lists. And, internally, it just creates an instance of this aggregator that keeps the longest string and the length of the string, and then, for each update, for each row, it gets an update call, and we can see this being computed, and then at the end, it returns result. But actually, you don't want to all to go out and now write your own user defined functions for anything that's kind of more utility level code. Of course, you can do it for your custom business logic, and probably should, also should do that for high performance business logic and custom functions, but in general, as utility, you shouldn't do that. So, we have quite a lot of projects in Neo4j Contrib that already use user defined procedures and functions, so neo4j-graph-algorithms, APOC that we talked about, neo4j-graphql, and spatial all provide user defined functions and procedures for you to use out of the box. And that brings us to APOC, which is a project I started way before Neo4j 3.0 came out, to basically play around with this procedure API, and which now has evolved into a pretty active and large project. You see that we have quite a lot of contributors, and had already almost 30 releases and almost 80,000 downloads so far. And that's quite cool, and as you would love to see it used, and if you use it, please let us know. Also, if you have any questions or so, please raise an issue, or ask in our Slack channel about it. Basically, APOC can be seen as a standard library or a typical Swiss army knife of procedures and functions, and there's nothing you can't find in there, and this video series is meant to give you a little bit of overview about all the things that are there, and to show you in practical examples how you use them, and find the things that you could utilize. And, this is the first bit. Following up, we will do installation, and then, the individual parts of APOC. If you have questions, join our Slack channel. There's a dedicated APOC channel that you can ask, and the repository I created, short link here, neo4j-apoc and the bit.ly, and if you would want to watch further web videos and other videos around Neo4j, go to our YouTube channel. Thank you.","title":"Introduction to Neo4j Procedures & Functions and the APOC Utility Library (#1)-V1DTBjetIfk.en.vtt"}},{"_index":"documents","_type":"documents","_id":"15845abd-93fb-11e8-9133-b630aa296667","_score":1.0,"_source":{"keywords":[{"score":1.0,"value":"neo4j graph datum platform"},{"score":0.9836871173719974,"value":"graph datum platform"},{"score":0.9547307569861131,"value":"graph datum"},{"score":0.8533803874017133,"value":"neo4j graph"},{"score":0.8084465100271243,"value":"different graph representation"},{"score":0.802945030849319,"value":"same graph"},{"score":0.8022540565432427,"value":"social network graph"},{"score":0.7724620925276199,"value":"different graph"},{"score":0.7649687320908789,"value":"cypher gremlin toolkit"},{"score":0.7481956328179074,"value":"cypher declarative query"},{"score":0.7361102756745955,"value":"gremlin cypher"},{"score":0.7313837248994396,"value":"cypher query"},{"score":0.7257831303402383,"value":"datum lake integration demo"},{"score":0.6957020776053906,"value":"pluggable graph source factory library"},{"score":0.6886103071072075,"value":"graph view"},{"score":0.6724512073792333,"value":"actual gremlin query"},{"score":0.6632846746062526,"value":"city friend graph"},{"score":0.6556615929902796,"value":"different datum"},{"score":0.6407135435682643,"value":"product graph"},{"score":0.6378092264376108,"value":"analytic graph algorithm executions"}],"entities":[{"id":"Cypher","type":"PERSON","value":"Cypher","frequency":1},{"id":"Cosmos DB","type":"ORGANIZATION","value":"Cosmos DB","frequency":1},{"id":"Oracle","type":"TITLE","value":"Oracle","frequency":1},{"id":"independent","type":"RELIGION","value":"independent","frequency":2},{"id":"New York","type":"STATE_OR_PROVINCE","value":"New York","frequency":1},{"id":"Mats","type":"PERSON","value":"Mats","frequency":7},{"id":"Alpha Centauri","type":"ORGANIZATION","value":"Alpha Centauri","frequency":1},{"id":"North","type":"MISC","value":"North","frequency":2},{"id":"Gremlin Server-based","type":"MISC","value":"Gremlin Server-based","frequency":1},{"id":"American","type":"NATIONALITY","value":"American","frequency":2},{"id":"second","type":"TITLE","value":"second","frequency":2},{"id":"Alastair","type":"PERSON","value":"Alastair","frequency":2},{"id":"Dimitry Solovyov","type":"PERSON","value":"Dimitry Solovyov","frequency":1},{"id":"Gremlin","type":"ORGANIZATION","value":"Gremlin","frequency":1},{"id":"explorer","type":"TITLE","value":"explorer","frequency":1},{"id":"Manager","type":"TITLE","value":"Manager","frequency":1},{"id":"European Union","type":"ORGANIZATION","value":"European Union","frequency":1},{"id":"Microsoft Azure","type":"ORGANIZATION","value":"Microsoft Azure","frequency":1},{"id":"Mats Rydberg","type":"PERSON","value":"Mats Rydberg","frequency":1},{"id":"Mila","type":"PERSON","value":"Mila","frequency":1},{"id":"London","type":"CITY","value":"London","frequency":2},{"id":"JanusGraph","type":"PERSON","value":"JanusGraph","frequency":1},{"id":"Dimitry","type":"PERSON","value":"Dimitry","frequency":6},{"id":"Gremlin Server","type":"ORGANIZATION","value":"Gremlin Server","frequency":1},{"id":"European","type":"NATIONALITY","value":"European","frequency":1},{"id":"Proxima Centauri","type":"ORGANIZATION","value":"Proxima Centauri","frequency":1},{"id":"Java API","type":"MISC","value":"Java API","frequency":1},{"id":"Cloudera Cluster","type":"MISC","value":"Cloudera Cluster","frequency":1},{"id":"Apache-licensed","type":"MISC","value":"Apache-licensed","frequency":1},{"id":"last week","type":"DATE","value":"last week","frequency":1},{"id":"Riga","type":"CITY","value":"Riga","frequency":1},{"id":"Parquet","type":"MISC","value":"Parquet","frequency":1},{"id":"Cypher for Apache","type":"MISC","value":"Cypher for Apache","frequency":1},{"id":"Gremlin-utilizing","type":"MISC","value":"Gremlin-utilizing","frequency":1},{"id":"Apache Tinkerpop","type":"PERSON","value":"Apache Tinkerpop","frequency":1},{"id":"ID","type":"STATE_OR_PROVINCE","value":"ID","frequency":2},{"id":"50","type":"PERCENT","value":"50","frequency":1},{"id":"header","type":"TITLE","value":"header","frequency":1},{"id":"Cypher Everywhere","type":"MISC","value":"Cypher Everywhere","frequency":1},{"id":"Mats","type":"ORGANIZATION","value":"Mats","frequency":1},{"id":"Sweden","type":"COUNTRY","value":"Sweden","frequency":1},{"id":"Hadoop","type":"MISC","value":"Hadoop","frequency":3},{"id":"analyst","type":"TITLE","value":"analyst","frequency":1},{"id":"Tinkerpop","type":"MISC","value":"Tinkerpop","frequency":2},{"id":"Neo4j JavaScript","type":"MISC","value":"Neo4j JavaScript","frequency":1},{"id":"Hadoop","type":"PERSON","value":"Hadoop","frequency":1},{"id":"director","type":"TITLE","value":"director","frequency":1},{"id":"Gremlin","type":"PERSON","value":"Gremlin","frequency":3},{"id":"repo","type":"TITLE","value":"repo","frequency":3},{"id":"IntelliJ IDEA Cypher","type":"MISC","value":"IntelliJ IDEA Cypher","frequency":1},{"id":"Explorer","type":"TITLE","value":"Explorer","frequency":1},{"id":"manager","type":"TITLE","value":"manager","frequency":1},{"id":"Neueda","type":"LOCATION","value":"Neueda","frequency":2},{"id":"Cypher","type":"MISC","value":"Cypher","frequency":6},{"id":"model","type":"TITLE","value":"model","frequency":1},{"id":"folder","type":"TITLE","value":"folder","frequency":2},{"id":"Neo4j","type":"MISC","value":"Neo4j","frequency":1},{"id":"Kerberos","type":"MISC","value":"Kerberos","frequency":1},{"id":"ISO SQL","type":"ORGANIZATION","value":"ISO SQL","frequency":1},{"id":"Scala","type":"PERSON","value":"Scala","frequency":1},{"id":"Java","type":"MISC","value":"Java","frequency":2},{"id":"Friday","type":"DATE","value":"Friday","frequency":1},{"id":"Neo4j Cypher","type":"MISC","value":"Neo4j Cypher","frequency":1},{"id":"Alastair Green","type":"PERSON","value":"Alastair Green","frequency":1},{"id":"execution","type":"CAUSE_OF_DEATH","value":"execution","frequency":1},{"id":"Apache Software Foundation","type":"ORGANIZATION","value":"Apache Software Foundation","frequency":1},{"id":"Sirius","type":"ORGANIZATION","value":"Sirius","frequency":2},{"id":"Microsoft","type":"ORGANIZATION","value":"Microsoft","frequency":4},{"id":"principal","type":"TITLE","value":"principal","frequency":1},{"id":"Tinkerpop","type":"LOCATION","value":"Tinkerpop","frequency":1},{"id":"Hadoop","type":"LOCATION","value":"Hadoop","frequency":1},{"id":"Alistair","type":"PERSON","value":"Alistair","frequency":2},{"id":"executions","type":"CAUSE_OF_DEATH","value":"executions","frequency":1},{"id":"Gremlin","type":"LOCATION","value":"Gremlin","frequency":10},{"id":"driver","type":"TITLE","value":"driver","frequency":5},{"id":"Gremlin-based","type":"MISC","value":"Gremlin-based","frequency":1},{"id":"SD","type":"STATE_OR_PROVINCE","value":"SD","frequency":1},{"id":"Bill Miller","type":"PERSON","value":"Bill Miller","frequency":1},{"id":"Gremlin Server","type":"LOCATION","value":"Gremlin Server","frequency":1}],"text":"- We can get started with this session on Cypher Everywhere which is a follow-up to some of the points that I was talking about in the keynote and also follows on from the talk we had earlier on today about Cypher for Apache Spark. So we've got three people here to present, myself, Alastair Green, we have Mats Rydberg who's part of the Cypher for Apache Spark team and also a member of the Neo4j Cypher language group, and Dimitry Solovyov, who is from Neueda which is a technical partner of Neo4j. Dimitry's been heavily involved in the work we've been doing for Cypher over Gremlin. And we're gonna talk about how Cypher for Apache Spark actually interacts with a Hadoop environment in a bit more detail, so really that side of things. And we're also gonna talk about, we've done the spoiler in the keynote, the unexpected bit here is the work that we've been doing on Cypher over Gremlin. I guess the next thing we need to see is that's Mats's back with the new syntax of Cypher for constructing graphs, so that's just a graphic to indicate where we're going in paths of this stuff. And this really shows us where openCypher is up to from the color perspective. So openCypher has, effectively, a repo that is related to the tooling, things like the grammar and the technology compatibility kit and there's a new repo for the Cypher for Apache Spark that was only opened up on Friday, I guess it was, yeah? And there is some stuff in terms of website maintenance and so forth. And then there's a private repo at the moment which contains the Cypher over Gremlin work. And Dimitry will be telling us more about that. So I think that the broad direction here is that Cypher over Gremlin is also headed for Apache-licensed, open source, under the rubric of the openCypher project, but we haven't quite got to the point of actually working out when and exactly how, what the modalities are with that. But that's sort of where we're going. Neo4j has a Cypher frontend which effectively does query compilation and preprocesses query planning in the Neo4j product. We've taken that Cypher frontend and we've made that Apache Two licensed. So anybody can use it, it's a very sophisticated parser, dot dot dot, SD construction and so forth. Query rewriting to some extent. And we've used it in all three of these different implementations of Cypher and that's quite deliberate. So we've got Neo4j, we've got Cypher for Apache Spark, and Cypher over Gremlin is using exactly the same frontend as well. And already mentioned the Apache Two licensing on those two fronts. So really what we're gonna talk about, two things in this discussion, we're talking about Cypher over Gremlin, work initiated by a sponsor for Neo4j but conducted on the ground by developers within Neueda very effectively, indeed. Great technical partners. You may also know them from, for example, the IntelliJ IDEA Cypher plugin was produced by the Neueda team. This is the Neueda labs in Riga. And then secondly, we're gonna talk about data lake integration, how we work with Hadoop and Hive and Allied Matters, and here again we made a partnership, we've been working with a small consultancy that specializes in the big data ecosystem in London called 51zero. Unfortunately none of them were able to be here today in New York but we'll be representing their work effectively and Mats will be discussing that. So before passing over to Dimitry to talk about the details of Cypher over Gremlin, just want to remind you, this is a diagram from the Apache Tinkerpop project. Apache Tinkerpop is a project that provides a low-level API called Blueprints and it also provides a higher-level API called Gremlin. Blueprints API actually originated in work that took place within Neo4j, effectively, we still have a native Java API which is kind of similar in terms of its functional intent and characteristics. Above that Gremlin, and Gremlin is really very much the focus of the work that's gone on with Tinkerpop going into the Apache Software Foundation. So Apache Tinkerpop Gremlin consists of a traversal API which represents itself to users as an embedded DSL in your language of choice, which means it's not independent of a general-purpose programming language at the point of use. So it's different in that respect from SQL, and different in that respect from Cypher. And then there is a Gremlin engine, which many implementers using Gremlin take this engine and actually then have pluggable backends as this diagram indicates. So you could put different databases behind the Gremlin framework. And actually, Neo4j is, I think one of the three reference implementations that are used to maintain the Gremlin stack, as it were. So at the front, there's a concept that there could be higher-level languages that might make it easier for you to program against the Gremlin-based implementation. So we sort of took that red Cypher that you can see on the top left, red means they haven't done it yet, as a kind of invitation, well, why don't we do it? Let's have a Cypher language over Gremlin and hopefully, if this project is useful, it will be picked up and it will mean that Cypher can become available as an API, a surface, for any Gremlin-utilizing implementation. And Gremlin is often being used as a quick way of getting yourself into the graph world, because you've got this software framework that you can put on top of existing backing store. It doesn't actually deal with things like transactionality, for example, if you use a non-transactional backing store, one that can't actually span transactions across partitions of a graph, it won't address those issues. But it certainly is a very good way of getting quickly into the world of actually interfacing graphs over existing stores. And we're taking that one level up and saying, well let's have Cypher at the top of that stack. What happens here is, the work that we're gonna see is effectively taking a Cypher declarative query and working out how to map that to the traversal API of Gremlin, and there's a fair amount of complexity in terms of the different places where that has to happen and the different tricks that have had to be performed in order to make that work. But as we'll see, we're gonna be seeing, I hope, this kind of capability running, not over Gremlin Server-based applications or databases like JanusGraph, which is formerly Titan, which many of you will have heard of. But also seeing it running over Cosmos DB, the Microsoft graph for data service. So just in terms of coverage, there's a lot of the Cypher language that's already covered. I won't go through the details of all of this. Many of these aspects you'll find that there's some corner case or less-used part which hasn't been fully implemented yet. We have in openCypher, we've used it, I think, Dimitry, you found this extremely useful, yeah? The Technology Compatibility Kit. 873 scenarios taken from Neo4j's Java implementation test environment, taken out and represented as cucumber scenarios that can be used with any language. So programming language independent. And enables people to test their conformance to the defacto Cypher specification based on the largest footprint implementation in terms of Neo4j. So that's part of the process that we begin to create, of tools that constitute ultimate ways of formally specifying the actual content of Cypher, that's gonna be part of the openCypher initiative. It was used here, this is kind of, it's difficult; this diagram is a little bit out of date, I believe the lines moved just beyond 50% of the scenarios passing at this point. I don't mind, if you think it means that we've done half the language or the guys have done half the language, that's actually just fine 'cause this is not a finished project. In some other ways, it probably underestimates the actual effective useful implementation coverage. Anyway, this is work in progress. So I will therefore pass over to Dimitry to talk further about this and to show us some interesting demonstrations of the technology, thanks. (audience applause) - Hey, do you hear me well, right? Yeah, okay. So I'm going to go over a couple of theoretical things first and then we'll go to live demo of the actual thing. So this is a kind of conceptual picture of all of the things you can do with our Cypher over Gremlin translation, which we internally call the Cypher Gremlin toolkit. And it's a toolkit because it consists of different parts that kind of inject themselves in the Gremlin ecosystem. And the parts that you're interested in are highlighted in yellow here. So that's the Gremlin plugin which works in the Gremlin console. Sorry, Cypher plugin which works in the Gremlin console. The Cypher plugin which works inside the Gremlin Server instance, and that custom client which essentially uses the same translation module that translates Cypher queries to Gremlin queries as the plugin in Gremlin Server. So why all of these different pieces? Gremlin has different instances of how it works. The usual one that you find is when you download parts of the Tinkerpop project, is that they have a thing called Gremlin Server, right? Which is a Java application that allows you to run a sort of an API on top of different backends. One of them would be JanusGraph or Titan. And there is Gremlin the language that could be used across other things, not just the Gremlin server itself. And one implementer of the Gremlin language is Cosmos DB. So they have a completely redesigned and remade instance of a service that accepts Gremlin queries. And what the toolkit allows you to do is actually either send Cypher directly to a Gremlin server and then do the translation and execution of the queries there, or you can do the translation on the client side and send actual Gremlin query to a server that accepts Gremlin. Which might be less efficient than when you are working in the Gremlin server because when we're working in the Gremlin server we actually do the translation to the so-called traversal API which Alastair mentioned as Blueprints, which was the name for the traversal API in Tinkerpop two. So Tinkerpop three is just essentially the traversal API. So depending on the use case, you can use this translation module in different contexts. Some things that this allows us to do is actually unify usage of Cypher across different databases or backends that accept Cypher queries. And that would be a Gremlin server and different types of databases that support Gremlin input, that would be CAPS that you might have seen in action in the session previously today and in the keynote. And anything else that accepts Cypher including Neo4j itself. What you see here is pieces of code that we use in a demo of our application that I'm going to show you briefly that actually uses an API that we've built completely identical to the existing Neo4j JavaScript client, with the only exception that allows you to run Cypher queries over anything that accepts Cypher. And it works like this, you essentially initialize it in the same way that Neo4j JavaScript drivers initialize and accept. You can specify other protocols besides those here which is what we do. And in the center, you see different connections that are used in React that serve as parameters to this code and allow you to initialize different connections to different databases. So I'll just go right into the demo. You might have seen a short clip of this application in the keynote. So this is the one. The application is called Cypher Explorer and this is essentially a demo where we use it to just show how stuff works and it's configured in the same setup as it was in the keynote where we have Neo4j 3.2. Unfortunately not 3.3 but that's what it's been made for. Connection then CAPS. The alphabet is currently open sourced and JanusGraph 0.1.1 which was the previous release before they made the 0.2 release just before Graph Connect. So all of these backends are running in Docker Compose here. They've already started and you can see the locks for them, right? So that's CAPS demo, that's JanusGraph, they've just been logging some service information up until now. But now I'm going to just run a query. So I'm going to select this and then execute it with a button. And we have results across all three different backends. As you can see, they are absolutely identical, right? So this is working on the social network graph that if you've been to the CAPS demo you've seen it there. We are just working on a European Union slice of the social network graph here. And this query just returned the list of countries for us. The tabs that are in each column are kind of in the same vein as how it works in Neo4j browser so we can switch to the code tab here and see the payload that we get from running this across each different backend. And as you can see, the payload is completely the same. And if you've ever looked at the internals of what you get, or the code tab in Neo4j browser, this is the actual structure that gets returned and it gets returned across any of the backends. If I go back to the log center I try to find the piece of the log where it ran over JanusGraph. Hopefully. It's just that CAPS logs a lot of things unfortunately, so it's hard to find here. Ah, right, let's just find something else and maybe it will be more like it. Right, so this is another example of a query that we do that ran over Gremlin, CAPS, and Neo4j. Let's take a look if there are actually. Right, CAPS, yeah. So here are three tiny lines of what JanusGraph logged when we ran that query, and we can see that we are logging things about more that we've sent this query to our Gremlin Server instance which is backed by JanusGraph. And it also logs the translation that this thing got transformed into. It's a bit different from the one that Alastair showed on his slides because we are changing the translation every day in order to better suit the corner cases of Cypher semantics because we are keeping the Gremlin translation very close, it's learned by Cypher semantics, right? Actually, I've got people in the audience that have worked with Gremlin the language before. So they understand, so anyone? Great, so a couple of people, that's great. So as you can see, this is a usual Gremlin query and it might look a bit machine-like because it's of course, it was generated. The way that we do it is that we just traverse the Cypher query and then for each subpiece of the query we generate a respective piece of Gremlin and we keep a bit of context because depending on parts of Cypher you have to add some stuff at the end or in the middle. But this is what it ends up looking like. Looking at the result here, the graphs are laid out a bit differently because of the layout algorithm which is force driven and it's a bit random. But these are the same results, essentially. So we have a list of the cities and relationships that the Sweden node pushes, the country for the cities. Right, so this is one and this is the identical city here, and if we look at the table, we see that there's also the tables are identical as well for each node and then the relationship between them. And again, the code is just the same across all different backends. So the nice thing of having a completely identical API built that's identical to Neo4j's JavaScript driver is that we can actually use this library in contexts where Neo4j JavaScript driver is used and the application wouldn't tell the difference mostly. Which is what we did and I am going to show you how it works on Azure. Microsoft Azure has a service named Cosmos DB which was previously DocumentDB but then they kind of expanded the name because it became multi-model. And one of the models that they support is a graph model, and you can run Gremlin queries on top of things in the database. So here you see the data explorer UI in Microsoft Azure that allows you to look at you graph data inside it. And right now it's empty, so I'm doing a g.V Gremlin query which essentially gives you the traversal from each node in the graph and since the graph is empty we don't really get any nodes back. And in the next tab, I have Neo4j browser running that has actually been hacked to work on top of the Cypher client instead of the Neo4j JavaScript driver. There has been some fixes to support that, for example, you don't really get service information as you can see, these lines about the database are empty as well as their counters here, just because this is not Neo4j, right? So we can't really get that information running on top of Gremlin Cypher like thing. But what I am going to do is, I'm going to first of all clean up everything just in case and now I have a query prepared here that's going to create a small graph inside this backend that will create some star systems then connect stars which reside in these star systems to them and then add some planets. So it essentially recreates a small star map or solar system, Alpha Centauri and Sirius and their interconnections. I'm going to run this and the query will finish and now in Azure we will run the traversal for the graph, we see that there is no data inside Microsoft Azure from a Cypher query that run over Neo4j browser over Cypher client 'cause you got translated into Gremlin which then got proxied to Microsoft Azure. And if we click around, we can actually see the nodes that got created. So it visualizes them node by node, but if I select one I can see that this is a star system, it says here it's a star system with the name Sirius and it's connected to a couple of others like this one is Alpha Centauri star system. Right, so this is a solar system and this is a star that's connected to the Sirius star system. And there are a bunch of other nodes here. The interesting thing is that I could now continue running Gremlin queries here and just work with this data but I don't want to because I can run Cypher. So I can go back to the browser and run a different query that I got prepared. So this one, it mentions Alpha Centauri in the graph and then returns every connection that is to Alpha Centauri. And this is the result. And as you see, since the browser doesn't really know that it has been tricked and it is running not on the Neo4j JavaScript driver but on the Cypher client, it works normally and since the structure of the result is the same. It even can do things like build its visualization because it gets the data that it expects. So if I zoom in a bit, actually no, I have no idea how to zoom in on this graph, sorry. But this is Alpha Centauri in the center and there are three relationships they are a member of and that's Proxima Centauri, Beta and Alpha Centauri Alpha. And everything else works as well, so we can see the table here that works as expected and this is the trick, is that the browser received the proper structure of the payload that it can continue working with and rendering these results in the table view under the graph view. So maybe I can just zoom in like this, yeah. So this is the structure that Neo4j tooling expects. So essentially, if we slap in Cypher client anywhere else, or if you swap in Cypher client anywhere else inside of your applications that work with Neo4j JavaScript driver it should just work. Besides the queries that wouldn't be supported because of course, these backends don't support the full extent of the Cypher language. In particular, in Neo4j browser we actually made some hacks, so it's a bit of smoke and mirrors in that we intercept several call queries 'cause Neo4j browser fetches some service information from the backend that it's running on top of. Which would be Neo4j normally. I guess that's it, that's the thing that I wanted to show and I took my chance to pass to Alistair. (audience applause) - Thanks very much, Dimitry. So a couple of points to make before passing over to Mats, we'll look at the Hadoop side of things. But all we've illustrated here are some potential capabilities. The demo ware is the word, that's a prototype way of looking at an API that might go to multiple data sources and so forth. We're trying to illustrate the fact that we're thinking about how to get Cypher Everywhere and how to make it easier, actually, to bring ultimately data into the Neo4j graph data platform. Because the theme of the next part of this talk is about data integration. It's about connecting to sources of data which could include existing graph sources of data to actually enable the utilization of that data within the graph data platform. And data lake integration is a very important part of our thinking as far as this is concerned beaus we see a lot of our enterprise customers they're either using Neo4j or they want to use Neo4j, they have a preexisting investment in a Hadoop ecosystem including Spark, and they frequently use that as a necessary source of information for the kind of processing that's required for graphs. Just one sort of introductory slide before passing over to Mats, this picture sort of renders the idea that using openCypher in Spark, we can wrangle data out of the lake, we can get data out of the data lake and assemble it into useful subunits as graphs and then those graphs can be processed, they might be put into Neo4j transaction insertions analytic graph algorithm executions and so forth. And indeed, snapshots could be preserved and taken back into HDFS and Hive using this kind of capability. And we'll talk first about that, that idea of the in and the out, the save and the load, and then we'll finally talk a little bit about another aspect which is the idea of being able to superimpose graph views on an existing lake of data using Hadoop and Spark. I'll pass it over to Mats for data lake integration demo. - Yes, hello. I'm audible? Great. So my name is Mats and I'm, as introduced by Alistair, part of the Cypher language group and also the Cypher for Apache plus Spark implementation team. I'm going to show you an example scenario on how to integrate existing data in a data lake scenario, in this example it's Cloudera Cluster and data in Hadoop and Parquet files. So you can see here the AWS console. We have a few instances running, in particular, we have two Neo4j instances down here hosting assertion networks, and if you're part of the presentation earlier today, those are the same social networks that you saw there. Additionally, we have a Hadoop name node, master node here, Cloudera manager, Kerberos server, Cloudera director and some HDFS data nodes. And I'm logged in, no, not here, here. Logged in here as onto the name node where I have the demo deployed running in the Kerberos cluster. Currently not authorized by Kerberos. So if I try and run the demo, it's gonna throw me out. And the demo itself, as you can see, just says no, no, no. So I'm going to use kinit to authorize myself in the cluster and I'm going to run this demo using the credentials that we have here. See, now that we're authorized, and if you want to authorize using key tab file, for example, I'm also going to show you this Scala code for the demo that is being run. Good font here. Mila hears this weirdness. You can just configure the principal and the path to the key tab file on the pluggable graph source factory libraries that we have initialized with our CAPS session here. These are developed by a technology partner of ours, 51zero from London. And what essentially is done is that we've implemented a Parquet file-based graph representation on disk. And additionally support for a Hive SQL view of those Parquet files so that we can load graphs as stored in tabular formats in the data lake and pre-ETL, so to say, as graph formatted files. And then we can store different graphs back or the same graphs. And it also loads here Neo4j graphs and social networks and then it loads here the product graph, looks it up via a URL. And I'm just going to run the demo again and I add a few breakpoints in the demo. And while that is actually loading, as you can see, it starts now to load the product graph which is a few hundred thousand nodes and some millions of relationships. You can actually see here, so this is, can you see this? It's kind of small, maybe I can zoom. So this is Hue, which is a tool that you can get if you're using the Cloudera distribution of Hadoop and the Cloudera Manager comes with the Hue view. So it enables us to, for example, see the files in the HDFS cluster here. So we have here at the user, mats, products, that stores a graph in our format. So there are some metadata and some other, metadata folder for storing things like the schema of the graph which is needed by the Cypher processing engine. And then the actual graph representation. And here we're storing just one, but you can imagine many different graph representations depending on what kind of queries you want to issue to the graph, different representations make more or less sense from a performance perspective. So the one we've developed here is node edges by label, we call it preliminary. So it partitions the graph over nodes with the same label, relationship of the same type, end up in the same file. So while that is loading, it's done, press enter to continue, now we're at this point in the demo. It's loaded all the graphs and we want to start query the graphs. Press enter here, you see it outputs some Cypher queries in the log, what it's actually doing is for the social networks it extracts sub-graphs of people who are located in the same city and know each other by one or two hops. And then we say that these people are acquainted with one another. And we project a graph called result of these nodes and the acquainted relationship. And we do the same for the North American, we do that for the North American social network and the European social network. And then we're computing the union of these to get one single graph of all the acquainted people in the social networks. And this one we're actually storing back into HDFS plus Hive at the location user/mats/all-city-friends. And this is now done in the demo, it's done all the data stuff. So if we go back here and refresh this file path, you can see, voila, there's a new folder all-city-friends that we've stored at. You can see it has the same kind of structure here. It has the same files. Not the same files, obviously, different data. Same format. And additionally, if we switch over to the database view of things there are no tables, but if I refresh here, can I see that it created five Hive external tables that we can query using this here. This is just me cleaning up the last time. Just query one of these, for example, the person. So it stored back, as you saw here, it stored back the all cities friends graph which was a union of the subgraphs which contained persons who were acquainted with other persons. So the schema of these all city friends graphs are just persons acquainted to other persons. Nothing else. So there's a table for persons, there's a table for the edges acquainted, and then there are some metadata tables. And if you click here, you can see the SQL schemas of these tables, this one doesn't have much, just the elementary properties of relationships, an ID, source and target, and a type. And for persons, there's an ID, and there's first name, a last name, and email. And also the label. So we can execute the SQL query and get a SQL view of the data that we wrote back from CAPS. So we can see that there are some persons, they have some IDs, some emails, and names. So we let the demo continue. From here on, it now computes the more elaborate queries that we've produced to this demo. And also, now it reads back in this all city friends from the storage that we just wrote it to to match for people. And then we switch over to looking in the products graph again, located somewhere in HDFS using the URI and we match for users. Because in the products graph we have users who bought products, right? And in the social network graph we have persons who know other persons. But maybe some of those users in one graph are the same real-world people as the persons who know people in the social network graph. And in this case, we're saying that if you have the same email, then you're the same real-world person. So then we project a new graph that U is actually P. Call that graph links. And then we union everything we got. We union the social networks with the links, with the products and the city friends who were acquainted with one another in order to compute our big recommendation query. And there's no multiple graph features in this one, this is a query that if you're doing recommendations with Neo4j you've probably wrote queries like this before. So persons who are acquainted with other persons, i.e. they live in the same city, we've defined this previously, live in the same city and they know each other by one or two hops. And the second person likes some interest, that interests comes in here again. Because the A person is a user who bought a product that belonged to a group with the name of the same interest name as the second person that A was acquainted with. And then some predicate that says that the product is good or that A liked it, then we could recommend this product for B. And that's exactly what we return here, we return the product, first and last name of B. And we're gonna print that out. And as you can see here in the application right now, it's about to print. So it's printed the header because we can get that immediately. Since CAPS builds some data frames, it's all lazily evaluated so Spark didn't actually do anything until it had to write the rows of this table and if you're familiar with Spark you'll recognize this log output, this is actually the Spark job now executing all its stages in order to print this table that we see here. Apparently Life of Pi is a good recommendation for the Bill Miller person. And then it executes writing recommendations back to Neo4j. So what do you wanna do at the end? Well, now we have some recommendations, we could either push that back again into HDFS, we're not doing that in this demo, or you could push it out to some other storage, let some analyst look at it, see were these good recommendations or not. Do I need to change my recommendation algorithm? Do I need to do something else? But in this example we're actually writing this back to the social networks again. So we're taking all the results and we match for persons with those names and we're setting this p.should_buy property with the product name. So new queries can then be issued to the social networks looking for these should_buy properties, and then you can understand which persons should be recommended what products. And that's the end of this demo. And again, you can inspect these graphs here, but I think we're running out of time. So I'm actually gonna switch back here. - Thanks very much, Mats. (audience applause) There's a lot of stuff going on there but the one thing that I wanted to really draw out is, there's a reason why we didn't just write the data into HDFS but we put the Hive QL DDL view on top of that. And that's not so obvious when you look at this demonstration because we create this format which effectively maps the data frames in Spark and we push it down into HDFS. We take files in directories which are mapped to tables, which effectively just lift it up into memory and become data frames in Spark. But there's another evolution of this, because one of the big problems that people have is their existing data. Data that was never modeled as a graph and the problem of how to actually extract or distill a graph view from data inside the data lake. And that's the thing that I want to finish on. And I'm gonna have to talk about this pretty quickly. I was at a meeting of the SQL, ISO SQL working group three which is a committee that defines the SQL standard last week and we were talking about the issue of how to think about graph data in the relational context and that's actually relevant for us in Cypher as well. We were discussing exactly the model that we've just seen from Mats where we've got labels, effectively tables for labels which are data frames in the Spark world. And this is the bit that matters, I think. Because if you imagine that we've got tables, which might actually be views, and we could, in the big data world using Hive, make those views. We could make those views but pluck any columns that we want from any tables that we want to represent a table for a label from a graph perspective. Once we've assembled those views, if you imagine you have the tools to do that, you could then say, well those views or tables are associated with labels, and there are things about labels that are different from views like for example, they might have keys. So we might have a unique key for a particular node, label, or unique key for an edge label, or a relationship label, and once we've got that kind of label understanding superimposed on these tables, if we understand that, then describing the graph is basically a matter of assembling a set of labels for nodes and a set of labels for edges, and possibly in some future evolution, we could have more complex schema that would constrain those relationships and say you can only connect persons to cities and person PEPs to cities, you can't use the lives-in relationship for anything else. That's music of the future, but it's something we're interested in both in Cypher and there's been some discussion in the SQL world around that. But the thing that I really wanted to leave you with from this is, if those views and tables represent your user-defined understanding of the data inside the data lake, and it could be, by the way, you're using to find understanding of the data in your Oracle database, and you can then have a mechanism for defining how that actually maps together a view as labels, we can then put that together and say that's a graph. And that could be the framework for snapshotting the data that you have in your data lake into a graph format for graph processing. You could do that every night for your nightly view, you could use this as a template for how you actually pull out data and shape it in the kind of bit of data wrangling facilities that we've seen over these two presentations. So I think that's a very powerful direction that we intend to pursue within Neo4j in order to be able to easily absorb your data and lift it up into memory without moving it. Take the data in HDFS, bring it into memory, and manipulate it as a graph. I've run out of time, last slide won't happen, but anyway, thank you very much indeed.","title":"Cypher Everywhere - Neo4j, Hadoop_Spark and the Unexpected — A. Green, M. Rydberg, D. Solovyov, Neo4j-IrWnUFXjeMQ.en.vtt"}},{"_index":"documents","_type":"documents","_id":"15845ab6-93fb-11e8-9133-b630aa296667","_score":1.0,"_source":{"keywords":[{"score":1.0,"value":"datum analytic thing"},{"score":0.9287604593537697,"value":"enterprise architecture information repository"},{"score":0.7980366179401086,"value":"system information"},{"score":0.7677000941537718,"value":"heavy weight datum load process"},{"score":0.7644788645117436,"value":"enterprise system"},{"score":0.7089453682375825,"value":"sight datum warehouse"},{"score":0.7013357973857607,"value":"different system"},{"score":0.6921013305058004,"value":"enterprise architecture modeling tool"},{"score":0.6899362210770739,"value":"datum source"},{"score":0.6848420012123678,"value":"datum visualization"},{"score":0.6820214866866993,"value":"single datum asset"},{"score":0.6676323452906608,"value":"program datum"},{"score":0.658535022661721,"value":"datum asset"},{"score":0.6514347286607927,"value":"datum element level"},{"score":0.6487814670042017,"value":"little bit information"},{"score":0.6004315595253272,"value":"database information"},{"score":0.600237745228606,"value":"datum warehouse"},{"score":0.5947536384419969,"value":"eair system"},{"score":0.5871600344128879,"value":"different component organization"},{"score":0.5855788980435069,"value":"datum profile"}],"entities":[{"id":"Enterprise Architecture Office","type":"ORGANIZATION","value":"Enterprise Architecture Office","frequency":1},{"id":"DHS","type":"ORGANIZATION","value":"DHS","frequency":19},{"id":"Glance","type":"MISC","value":"Glance","frequency":1},{"id":"web developer","type":"TITLE","value":"web developer","frequency":1},{"id":"Border Immigration and Customs","type":"ORGANIZATION","value":"Border Immigration and Customs","frequency":1},{"id":"manager","type":"TITLE","value":"manager","frequency":1},{"id":"2002","type":"DATE","value":"2002","frequency":1},{"id":"Explorer","type":"TITLE","value":"Explorer","frequency":1},{"id":"Transportation Security","type":"ORGANIZATION","value":"Transportation Security","frequency":1},{"id":"Oracles","type":"MISC","value":"Oracles","frequency":1},{"id":"Neo","type":"MISC","value":"Neo","frequency":1},{"id":"Blackstone Technology Group","type":"ORGANIZATION","value":"Blackstone Technology Group","frequency":1},{"id":"Cypher","type":"MISC","value":"Cypher","frequency":1},{"id":"CPB","type":"ORGANIZATION","value":"CPB","frequency":1},{"id":"Blackstone Federal","type":"ORGANIZATION","value":"Blackstone Federal","frequency":3},{"id":"Customs","type":"ORGANIZATION","value":"Customs","frequency":1},{"id":"Jessica Dembe","type":"PERSON","value":"Jessica Dembe","frequency":1},{"id":"cold","type":"TITLE","value":"cold","frequency":1},{"id":"general","type":"TITLE","value":"general","frequency":1},{"id":" Coast Guard","type":"ORGANIZATION","value":" Coast Guard","frequency":1},{"id":"OCIO","type":"ORGANIZATION","value":"OCIO","frequency":1},{"id":"EA","type":"ORGANIZATION","value":"EA","frequency":1},{"id":"2018","type":"DATE","value":"2018","frequency":3},{"id":"Facebook","type":"ORGANIZATION","value":"Facebook","frequency":1},{"id":"Congress","type":"ORGANIZATION","value":"Congress","frequency":5},{"id":"Cypher","type":"PERSON","value":"Cypher","frequency":1},{"id":"execution","type":"CAUSE_OF_DEATH","value":"execution","frequency":1},{"id":"chief technology officer","type":"TITLE","value":"chief technology officer","frequency":1},{"id":"Patrick Elder","type":"PERSON","value":"Patrick Elder","frequency":1},{"id":"second","type":"TITLE","value":"second","frequency":1},{"id":"March","type":"DATE","value":"March","frequency":1},{"id":"Neo","type":"LOCATION","value":"Neo","frequency":2},{"id":"United States","type":"COUNTRY","value":"United States","frequency":1},{"id":"NLP","type":"ORGANIZATION","value":"NLP","frequency":1},{"id":"CSV","type":"ORGANIZATION","value":"CSV","frequency":1},{"id":"Oracle","type":"TITLE","value":"Oracle","frequency":3},{"id":"June","type":"DATE","value":"June","frequency":2},{"id":"Jessica","type":"PERSON","value":"Jessica","frequency":1},{"id":"EAIR","type":"ORGANIZATION","value":"EAIR","frequency":4},{"id":"Patrick","type":"PERSON","value":"Patrick","frequency":5},{"id":"Application Express","type":"ORGANIZATION","value":"Application Express","frequency":1},{"id":"Secret Service","type":"ORGANIZATION","value":"Secret Service","frequency":1},{"id":"Coast Guard","type":"ORGANIZATION","value":"Coast Guard","frequency":1},{"id":"CIO","type":"ORGANIZATION","value":"CIO","frequency":1},{"id":"Wikipedia","type":"ORGANIZATION","value":"Wikipedia","frequency":1},{"id":"Department of Homeland Security","type":"ORGANIZATION","value":"Department of Homeland Security","frequency":2},{"id":"architect","type":"TITLE","value":"architect","frequency":3},{"id":"TSA","type":"ORGANIZATION","value":"TSA","frequency":2},{"id":"last two days","type":"DATE","value":"last two days","frequency":1},{"id":"chief information officer","type":"TITLE","value":"chief information officer","frequency":2},{"id":"Windows","type":"MISC","value":"Windows","frequency":1},{"id":"Information Sharing for Enterprise Architects","type":"ORGANIZATION","value":"Information Sharing for Enterprise Architects","frequency":1},{"id":"Federal Emergency Management Association","type":"ORGANIZATION","value":"Federal Emergency Management Association","frequency":1},{"id":"85","type":"PERCENT","value":"85","frequency":1}],"text":"- Welcome to our presentation. This is called Information Sharing for Enterprise Architects. - You guys are all troopers for hanging in til the end, by the way. Really appreciate that. - Yup. So my name is Jessica Dembe. I am front-end web developer. I got started in this project in March. I've been playing around with Neo4J, but I've been really fascinated by the data visualization that Neo4J gives, and I've been playing with it ever since. - My name's Patrick Elder. I'm the product architect for our program EAIR. I've been with the program about two years now. I was intrigued by Neo4J, kind of seeing how that was able to provide some visualization and show relationships and was able to kind of bring this on and convince our customers that it was worth pursuing. So we're gonna tell you our story and how we did this. - So a little bit about the company we work for, it's called Blackstone Federal. It's a division within Blackstone Technology Group with three divisions, federal, staffing and financial services. Our branch within Blackstone Federal focuses on delivering for federal clients and product delivery, methodology, which is based on our four principles of engaging with the customer, building products, measuring for metrics and how to improve, and to learn and to build on for future best practices. - And that's born out of our main three practice areas, which are cyber security, Agile, and DevOps So we're all about iterating over things and trying to improve continually as we do this stuff. You may be asking, how did we end up getting into this data analytics thing? Has nothing to do with best practice areas, but this is where we have our growth opportunity. So we had a lot of latitude as started to try to take our path down here and take these principles with us. - Right. - Okay, so, about our customers, so we are federal contractors. We work for DHS, that's the Department of Homeland Security, which was founded in 2002 as part of a reaction to some increased awareness about protecting the homeland. And really the mission of DHS was to bring together a bunch of disparate component organizations in the United States such as the Federal Emergency Management Association, TSA, work for Transportation Security, CPB, Customs and Border Immigration and Customs, and Coast Guard, Secret Service, just to name a few. Really at headquarters DHS, which is who we work for, our mission and their mission is to collaborate across all those component organizations so that they can share information, hopefully reduce risk and reduce duplication of effort. Our group sits under the chief information officer, underneath the chief technology officer in the Enterprise Architecture Office at DHS headquarters, and that's where our program resides. - So onto our work, on what is enterprise architecture. - Yeah, so anybody here aware of what enterprise architecture is? Anybody, any enterprise architects in the house? All right. Thank God. Well, for those that aren't, I was able to pull this off of Wikipedia, a very trusted source for reliable definitions of these kinds of things. And I was able to find that enterprise architecture is a well-defined protocol for conducting enterprise analysis, design, planning and implementation using a comprehensive approach at all times for the successful development and execution of a strategy. Everybody got it? Great. In real life, what it's really about is the confluence of where IT and business process comes together. So what is the business purpose of our organization and how is IT gonna help us do that? And from enterprise architecture standpoint, it's developing the target state of what that's going to be. So that breaks down to these four different areas. So we have a target state, describing and defining what's the to be architecture. Okay, what is that technology, what are the things that we're trying to do, what are we sharing? What are we doing individually? From a framework perspective, trying to develop processes and things that are repeatable and standard across the organization so that everybody's on the same sheet of music whenever possible. From risk reduction, we're talking about reducing duplication and we're talking about reducing risk by using those frameworks and doing these same kind of processes at a high level. And lastly, the alignment piece. This is the part where we're starting to tie together those different pieces of the organization, so what is the mission of the organization? What are the capabilities that it serves? How is it funded? What IT is it using? What system they're using, the stuff, what programs are doing these things. And this is the alignment part is where Neo4J really came in for us. - So a little bit about our program, EAIR or the Enterprise Architecture Information Repository. Well, it's been going strong for the past seven years. It started off as an Access database on someone's laptop. So he knew the system very well, and he was able to track into it, but it wasn't really available to the public. - Yeah, and from that it grew out and just started answering some data calls from Congress, and all of sudden Department of Homeland Security realized they might need an enterprise system for this. And out of that, this system has grown to be the CIO's official line of sight data warehouse. So all the reporting that comes out of DHS to Congress to any external organizations comes from this program. - A little bit of our line of sight, how do we stitch together data from external sources? We have as the data warehouse for the office of the chief information office, you can imagine that there's a lot of data sources that we have to piece together and have people understand. - Yeah, and these can be relational databases, so some of these are just direct connections to relational databases, other cases we're hitting web services, other cases we're getting flat files, that could be CSV or XML. So a lot of different ways that we're collecting data and trying to put this together and be that integration point for not only all the offices within the chief information officer, but all those components across DHS as well. And with those components we are in use by over 15 different component organizations within DHS, so as you can imagine, again, different components, different sources, how do you make sense of it all? - Right, and one of the things that's been a real challenge for us and for DHS as a whole is that all these component organizations have a very diverse mission. TSA cares about transportation security. They could care less about emergency management. U.S. Coast Guard cares about protecting the waters, they don't really care about customs and border patrol. But all these guys do have some overarching common mission, and we're trying to make sure that all that stuff is shared and that we can reuse any pieces or parts of those components that we can. - The thing that the EAIR does very well is that we've been able to aggregate data for the past seven years, and with Neo4J and how to connect those lines of data, we were now seeking to align ourselves as not just a data aggregator, but as an information provider for all components across DHS. - And one of the things that EAIR is known for is that term, line of sight. We use that a lot to talk about what we have. And what this is is really showing how all these things that are within the organization are related. You know, and again, we'll go through this, we've got a diagram that we can talk through, but thinking about how all these disparate component pieces of data are related to each other and then how can we show that within our application. - So who uses the enterprise architecture information repository? Some of our main biggest users are not just our enterprise architects, they're also managers and they're also decision makers. We have seen other sources. Folks use them for what systems is DHS using, products, investments, and different aspects to the line of sight that we will be introducing to you shortly. - Yeah, and so we've got this nebulous decision makers thing and a lot of people want to know what that is. That can be somebody that's basically deciding are we gonna step forward with another technology on a system, or it's gonna be talking about handing off that data to Congress or another entity outside of DHS to make a decision based on the information that we've gathered together. - So next here is the EAIR line of sight. - [Patrick] Yeah, so this diagram represents the line of sight in general terms, and you know, those of you that have been here for the last two days can recognize this looks an awful lot like a graph. We've got strategic objectives, so these are the things that come down from Congress as saying these are the objectives of your organization. We've got the mission and program data, so saying this is the mission that you serve, and these are the programs that are going to meet those missions. Budgeting and investments, this is how this stuff is funded. Where are we getting this stuff from Congress, and when we ask for money, this is where it's all documented. Capabilities and activities, these are the actual things that the systems do. What do they do? That could be, you know, biometrics, screening, things like that. The systems themselves, so these are the applications that we're talking about. Like EAIR would be an example of a system at DHS. We have the technology that supports them, so that's the Neo4J's, the Oracles, all the other things that we'll talk about in our stack a little bit later. Data assets, so we think about these as these are the containers of data inside of DHS, so maybe it's a database, maybe it's a part of a database, but this is a data asset that we have, and those can have agreements and information exchanges, so when we talk about interfaces that we have across these different systems, that's where we're documenting that stuff. And then they share sets of data. So the sets of data contains all these different pieces and parts of data. Could be many in a single data asset. And lastly, the performance measures and goals. These are the things that we're gonna be evaluated against as we go through each one of our planning sessions. And so the more accurate and better that we perform on those performance measures, the more likely we are to get the budgeting that we request when we go to do that. - So I just want to share with you EAIR Today So this is our application. Just a little bit information about our stack, current it is an AngularJS application with Oracle APEX on the front end, and with all our database information stored on an Oracle database. So as you see here, this is the front page of our application. Basically it's just showing you this is a most comprehensive site information for the OCIO and quickly find the data you need safely and with our goal to improve data quality. - Just to give a bit of history here, as this program evolved from an Access database into an enterprise system, DHS has an enterprise license for Oracle database, so they decided to use that as their implementation. With that comes the application called Application Express, and that allows you to very quickly deploy web pages that are based on your database tables that allow you to manage the data, add, update, modify, delete, and really quickly represent tables and charts and things like that that you'll see when we start to show some more detail. As we started to modernize our product, we realized that that was a more antiquated way of looking at the data, so we wanted to try to add some new kind of look and feel, so that's where we started to pull an AngularJS. We're continuing to do that as we talk about bringing in Neo4J and some of the other visualization stuff we'll get into a little bit more. - So I'm gonna walk you through a search of a brand new enterprise architect at DHS, so they're gonna look at the EAIR system. A little bit of information about this. The search engine is on Elasticsearch. So it pulls data from our back end and displays it on the screen here. And then once you click on the system of EAIR, you get this lovely page. So you have system information, you have a description, you have a major help, what kind of application it is. Then you keep going, all about the aliases. - [Patrick] And these are just, you know, there's a tremendous amount of information that we have for each one of these content types, but as you can see, it's a lot to digest. If you don't know exactly what you're looking for, you're going to be scrolling a lot, and you'll start to see these associated lists that represent the way that we tie stuff together and the way that it's represented. - So for example you'll see here, associated function areas, segments and functional activities. You keep going, investments and child systems. Data profiles. Does anyone remember what they just saw up on the screen? No? So that's why we thought about the information, and it's like what does this mean for enterprise architects? Well first, it means the user is responsible for figuring out how systems are related to functional activities, investments, and other systems. Again, a lot of scrolling you have to do, and I have to be the one to connect the information. That's not really a good time. - Yeah, and part of the thing that you can do is this is mainly a search and discovery tool right now, so there's not a ton of analysis that's happening on its front end. The onus is really on the user to go in and do analysis, search for the thing that they're looking for and click through and kind of walk through this stuff. Now you can imagine when you see these pages, yes, there are links everywhere where I can get more information about the related entities, but then I'm jumping to another page. And now I can't remember where I was or how I started to get there. So as you're doing this exploration, there's a different kind of analysis happening, but there's a lot of data here that's really untapped, and these are the things that we're trying to evolve as we add Neo4J to our stack. - The next is the user responsible for understanding how systems support the capabilities and activities of DHS. Of course, those are broad, those are disparate, so why not use a relational database to connect all those pieces together? - Yep, exactly. And I mean, as we said before, all the other content areas are related as well, because we could just as easily be talking about investments instead of systems here, or products instead of systems. But all these things have those interrelations that you saw in the diagram. We can click through all that stuff, but again, it's something where you're really jumping from page to page, and you don't really have that lineage that you might get from a graph. - And finally, the user must comprehend how systems meet DHS' current and future goals and objectives. Again, a lot of the analysis and the onus is on the enterprise architect or the manager or the decision maker to figure out how everything is aligned and if we are aligning to the goals and objectives properly. - And this is a great example of having to be two degrees of separation away from those direct relationships that we saw. So systems are not directly related to a mission goal or objective. That's through the mission that they support, the investment that they're funded by, or the capability and activity that they do. So imagine trying to do this in your mind as you're clicking through all this text and remembering which thing you clicked on to get to, okay, so what thing does this support after all? - So that brings in our case to add Neo4J. - And so from here, when we started looking at Neo4J, we saw this graph visualization show up, and it just, it hit for me. We had had another product on our stack that was a modeling tool, was an enterprise architecture modeling tool. And it was supposed to take all the data that we had in EAIR and be able to present architecture diagrams for the enterprise architects. Sounds like a great idea. They all wanted to use it. But they couldn't. It was too difficult to use. We couldn't figure out how to load the data, and they couldn't figure out how to query their architecture diagrams. That's a major problem. When I saw the Neo interface, I said wow, this is easy. I can add the data very easily, and I can get something that at least shows this stuff related to each other right out of the box. We don't have to do anything extra except load the data. And so what we decided to do was bring this to our client and say, \"Hey, do you think this is something \"that enterprise architects might like?\" And they almost jumped out of their chair. \"Yeah, we think this is worth showing to people.\" So back in June, we had started to kind of develop a prototype. We had gotten to a point where we felt pretty good about what we had. We wanted to give it a shot. So say we had the enterprise architects collaboration forum, where all the enterprise architects from across DHS, all the components are there, all the folks that are supporting them, and even a couple of other agencies were guests on this one. We showed them the Neo4J interface and they loved it. They wanted it the next day. And of course we had to pump the brakes on that one, because we can't cut everybody loose without trying to explain Cypher and all that good stuff. But it did give us enough momentum to try to add that to our stack. And so this is the case that we came up with. Obviously the diagram piece was huge for us. That's something that we just needed to be able to show how all this stuff's related in one quick look. The At a Glance thing was key. We wanted to make sure that yes, you can get all the detail for the associated lists if you want, but if I want just a little bit higher level than that, I can see it and I don't have to do a lot of reading. And then the next piece was the analysis part. We didn't even know what analysis to develop for our customers, because they've never had the opportunity to ask these questions of us yet. So we're trying to hopefully see as they walk through this and then they start clicking through to see all these different networks and how they're all tied together, tell us what you want to know, so that we can automate that and deliver it to you. And then lastly, our innovation, kind of mission for our team. This was something that we thought was gonna be a huge thing. We were bringing another tool into the suite that could really augment what we've already done without necessarily replacing anything that we've already done. - So right after this, before the collaboration formed, sorry, we had to kind of discuss our value proposition on how EAIR and Neo4J can bring us to next level EA. - [Patrick] And from that perspective, a lot of the things that we've already talked about, right? So EAIR, we're doing a lot of integration from disparate sources. That's not broken. We still need to do that. We have search-based navigation. Search is used by almost all of our users. 85% of our hits are searches. And that's good, we want to keep that. We still want to be able to search and discover information about the enterprise, but we need to add more. And then we've got detailed information, and we've got a lot of detailed information. Still important, but we want to make sure that there are other ways to do this and give people options. You know, not everybody analyzes data the same way. So by bringing in Neo4J, we had a simple data loading mechanism. We wouldn't be worried about having a heavy weight data load process to sync up the data between Neo and our existing database. We've got the flexible query interface that we cold use based on nodes and relationships. So we could actually start thinking about this data in a totally different way. And then lastly, the built-in visualization was something that jumped out, but then we started to think about how could we integrated that into the EAIR tool so that it's a seamless interface for our users that requires a little bit less for them to learn in order to take part and make a real benefit from this. So with that, that's what we tried to bring in. Synchronized data that goes across the two databases that's available in one place. We wanted to have searchable and clickable navigation of data that wouldn't necessarily be a page jump, but really just allow you to navigate within place. And lastly, multiple views of the dataset. So if I want to look at the list values, I can absolutely do that. If I want to look at it as a diagram, I can do that too. - So going back to this diagram on line of sight, this is how we were able to structure our database and investigate those relationships and dig into them further, and how to say those strategic objectives, influence the mission programs, and the performance goals into the investments for the budget year, for example. - Yeah, and so, as far as our database design and what we decided to do with Neo4J, because the visualization piece was the part that was the most attractive to us, we got to thinking that this would be something that we could make very lean. So as far as we were concerned, the only things that we really needed to maintain were the names, the acronyms, and the relationships. That's all we needed to show what we needed to show to do our value prop. So that's the way that we did it. And this is the part that we're seeing as we go forward and kind of look to our customers to tell us like what kind of things are you asking? What other pieces of information do you need to know while you're doing this traversal of relationships to get the information out that you want? So that was our kind of lightweight approach to this. We used the LOAD CSV mechanism to do that. And our database is pretty small, comparatively. I think I've heard some pretty big numbers with B's and all that stuff in there. We're probably about 30,000 entities and 50,000 relationships. - So... - I'm sorry? - [Audience Member] 30,000 entities and... - And 50,000 relationships. - So a little bit about what we've shown so far... Come on. So here are some examples of the relationships. So first I would do the search on the activity. If you look at the activity here, you'll see how this spreads out to the primary activities. More primary activities here, and how they're so disparate. Now, this was the list that you were seeing earlier and just seeing, associated activity, associated activity. Instead of clicking into each associated activity, this would provide the mechanism to show you an activity that you'd be interested in. These dig in further, but if I do a search for material, for example, that was not as extensive, but... - [Patrick] The significance here is that as we're looking at this, the purple ones are activities. So these are the actual things that these systems do. As we're looking at these systems that do the same kinds of activities, are there ways to share those services? Are there products that they're using that do the same thing that we should be sharing licenses with? Does one of these systems do it better than everybody else and they should just be a shared tenant of what they're doing? Or are they really that different that they need to be segregated and be their own entities? That's what we're trying to think of find as we're doing this enterprise architecture effort. And as you go across, you start to see the network, that kind of analysis is really what we'll be able to pull up. - So we also have the same thing for capabilities, which can get a little more complex, like so. So you see these once again now, these two degrees of separation between, say, this functional area, which is demand situational awareness, and how it's associated with monitoring homeland security risks and threats. - [Patrick] And so this capability and activities and functional area's our hierarchy and what we have right now, so you can start to see how those are all related as you traverse the tree. The way that we had that represented in the system now is a standard kind of tree navigation that you would see, kind of think of Windows Explorer or something like that. This allows something that's a little bit more visual, and you can collapse and expand depending on what exactly you're looking for. - And again we have a similar example for functional area and goals. I wanted to especially show you the goals, because then we now are mixing in the missions with the goal and the objectives and to these tree nodes that will traverse each other. - Right, like we mentioned before, this is a good example of that extra degree of separation that we weren't able to represent in the original system by just having just the one distance away. So by being able to have this explorability and be able to show those different length of depth of relationship is something totally new that hasn't really been tapped by our clients yet. Yeah? - [Audience Member] How long did it take you (crowd murmuring drowns out audience member) - For the Neo part? - [Audience Member] Yeah. - Not very long. I think I was able to get the prototype database up and running within a couple of days, and that was using a test set of data, but I was able to drop that in with the production set the next day, pretty much. So it was pretty clean. Like I said, we kept this pretty lean and followed the entities that we had in the existing relational system. Not that we haven't thought about revisiting that as we go forward, but we want to see some more use cases as far as the analysis part goes. Right now it's more the visual representation part. - And speaking of visual representation part, so this is the part that we demonstrated to our stakeholders in June. So we gave the example again of the EAIR system and how it uses different software, different activities, and different investments and data assets that makes up that system. You also see here, this is a data asset. What's the difference between the EAIR and the data asset? Well, the data has to come from somewhere, and that's where we're showing the relationship from where it's getting the data from, the investments, and the activities. - Yeah, so for data assets, we can talk about producing and consuming data versus just being a system that uses it. So this is in D3. This is what we wanted to try to do to bring in this visualization inside of our app and embed it so it's just a one user interface. Jessica's been doing a lot of great work with that, and as a part of what we've been doing to try to modernize the look and feel, we've gotten a lot of influence from some new leadership on the federal side to try to push for this open source stuff. So we're also using D3 to do some representations of some of the data that we have relationally. It's allowed us to kind of create a one stop shop that allows us to kind of put more of a clickable interface, so less of a, hey, I'm gonna do a Cypher query to get the data that I want, and I'm just gonna push buttons at the top. Do I want to see capabilities, do I want to see systems, do I want to see products? And allow that to take the place of the associated lists that we had on the original app. So allow you to see that list as a network diagram. - All right. And back here. A second. All right. So before, again, just to reiterate where we've been before is a heavily text-based application that requires users to kind of fix the puzzle pieces for themselves. Again, a lot of navigation. I have to go to this system, to this data asset, to figure out where the pieces go and to kind of follow the trail mix on where, how the system is created and what's producing and what's consuming and all that information. And then, like I said before, the user is expected to remember these relationships. With Neo4J now, we can use D3 JS to visualize each representation on every page. Navigation will be in place to help the user decide what they want to see and what they wouldn't want to see, and most importantly, for the user to easily trace relationships. So going forward, we're looking for a complete redesign of our system with a new logo, new name, color, everything, just a new system to be completed in 2018. We would like to continue to leverage open source technologies, including Neo4J and all it has to offer, and to transform ourselves from an information aggregator to an information provider. - Yeah, so, like I said earlier, part of this is making sure that we can have visualizations on every page, so not quite so much text, that's what we're looking for. But then also the data analysis piece, like what are the visualizations that are gonna provide additional information to just the aggregation that we've done by pulling in all these different sources. We've got a question? (crowd murmuring drowns out audience member) - Yeah, so, I think we're gonna basically start out with a community edition to kind of see how far we can go with that and see what our performance looks like. We're trying to share the servers that we have for our application as it is. So we haven't done a whole lot of load testing yet as we're still in development, but we're gonna be pushing into test I think actually next week. So we'll be able to start looking at that with a full dataset and then see if we can get some of our users in there and start testing it out. (crowd murmuring drowns out audience member) Yeah, that's what we have for our application server right now. Like I said, not a huge amount of data in our system, and our users range from 500 to 1,000 each month. - All right. - Yep. (crowd murmuring drowns out audience member) Yeah, so for our stuff, we can drill down to the data element level, but that stuff all resides in our repository. So we do get copies of that from our sources. We don't do a whole lot of linking out, with the exception of our governance tracking tool, which is also kind of lockstep with us as far as being a new tool that's gonna launch in 2018. But we do have that lineage to go all the way down and see the whole thing, and add as much detail as you want. There's even a lot of detail that's in the system in the database that we don't display because of certain sensitivities and stuff like that, but it's available upon request. (crowd murmuring drowns out audience member) (laughs) Yeah, I was hoping that we could get away with not having to be MEGA or True or any of those guys. I think what we wanted to try to be able to do is provide some visualization and some iconography that would be custom for DHS and allow them to see the stuff that they want to, and then ultimately provide some analysis on top of that. I don't know if we can compete with MEGA though. (crowd murmuring drowns out audience member) Yeah, yeah. We're trying to take the approach of can we get the first big bang for the buck out in 2018, shoot for that and see if we can get some good feedback, and then let that kind of drive where we want to go. We've also been looking our document repositories right now are pretty much keyword searches and Sharepoint. We want to start seeing if we can pull that stuff in and do some NLP to do some topic modeling in Neo to help drive a lot of the review boards that are trying to do some analysis on those documents. - All right. So feel free to connect with us. At Blackstone Federal, we are passionate engineers who would like to address any technical needs especially across the client space. We are on LinkedIn, Facebook and on Twitter. And we always love new ideas and love how we can all improve upon our system. You can find us on email or you can just Tweet at us. I'm sure someone will tell us about this. So that's what we got. Cool. - Yup. - Any questions? - That's pretty much it for our content so far. Any other questions? - Cool. - All right. - Great, thank you very much for your time, guys.","title":"Information Sharing for Enterprise Architects — Patrick Elder & Jessica Dembe-aMPm4Zo58E4.en.vtt"}}]}}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment