Skip to content

Instantly share code, notes, and snippets.

@kirk-marple
Created December 11, 2023 06:18
Show Gist options
  • Save kirk-marple/391fdcccf2d9be921dae5b26c9334e68 to your computer and use it in GitHub Desktop.
Save kirk-marple/391fdcccf2d9be921dae5b26c9334e68 to your computer and use it in GitHub Desktop.
Transcript of Mapscaping podcast episode
[00:00:02] That day's data or that week's data, but once it starts to age out a little bit, it goes dark. And and that kinda sort of dark data concept is something that that is starting to be an industry term.
[00:00:13] Welcome to another episode of the Mapscaping podcast. My name is Daniel, and this is a podcast for the geospatial community. My guest on the show today is Kirk Marple. Kirk is the founder of Unstruct Data. And today on the podcast, we're talking about unstructured data, but we cover a few other sort of interesting concepts along the way. So Kirk is gonna introduce us to the idea of 1st, 2nd, and 3rd order metadata
[00:00:37] will touch briefly on edge computing and knowledge graphs. Just before we get started, I wanna say a big thank you to Lizzie, who who is one of the brand new supporters of this podcast on Patreon, and, of course, to all the other people that are supporting this podcast via Patreon. If that's something you might be interested in, you'll find a link to the Mapscaping Patreon account in the show notes of this podcast episode.
[00:01:00] Hi, Cook. Welcome to the podcast. You are the founder and CEO of something called Unstruct Data. And and today, I'd really like to talk with you about unstructured data. But But before I think before we do that, can can you just introduce yourself to us, please? Let us know who you are and how you got involved in in into geospatial. Yeah. For sure. Yes. Kirk Marple. I mean, I obviously had I founded Unstruct Data, had been a long time software developer, and actually
[00:01:24] Just remembered yesterday that I've been dealing with geospatial data back even from my 1st job, dealing with maps on laserdiscs. It goes back that far. So I've been more in the media space. So media software space, I guess I consider, but I dabbled time to time in geospatial and now a bit more focused on it. Well, I I think we'll end up coming back to that Later on to your experience with the media space.
[00:01:45] But but let's start here. What tell tell me what unstructured data is for you? For us, it's really, I mean, everything. I mean, from imagery, audio,
[00:01:54] but also 3 d, I mean, geometry point clouds, as well as documents and email. So it's a broad set of data. Back in I came from the video space and media space, and we would just call them files. I mean, file based workflows.
[00:02:06] But for us, it's it's really a broad set of file based. Okay. So every file has a really well defined structure. Why do you call it unstructured data? Because I think if it's in a file, it's in this, you know, perfect little container that we all know that there's probably open standards around Or or might be open standards around. Why is it unstructured? No. It's a great point. I mean, I think it's partly, it's a marketing, thing just to differentiate. I mean, The kind of structured modern data stack world from from everything else. I do think it's a bit of a misnomer because essentially a lot of what we do is parse files. We there's a known sort of schema or file format
[00:02:41] in all these file types, and I've been dealing with those since, I mean, TIFF files and and,
[00:02:46] fax files back in the day. So there's always a structure there, but I think for a lot of people, they see a document or they see an image and they're they're kinda looking at the content. They're not thinking about the bits on disk. So it I think I do agree. I think it's a bit of a misnomer. Where does metadata play play into this? This idea of unstructured data. Can I have structured data without metadata?
[00:03:08] It seems or unstructured data without metadata, I think, I mean, I think with with structured data, I think a lot of what you're seeing these days is People adding metadata,
[00:03:16] for data discovery tools around databases and things. I think in the unstructured space, the metadata is often there, and that's a bit more of a solved problem where there's EXIF data and images or there's XMP or there's, I mean, Dublin Core.
[00:03:30] And and that's a lot of where we start. I mean, we kinda have that first order metadata that's in the files is is really what we start with, and we parse that out and we use that to kinda figure out what to do next. Okay. So so what is first order metadata for you? I mean, that's sort of my own terminology, but it's it's The Exif metadata, it's the data that you would be in the header of a file.
[00:03:52] It's if you if all you get is the file and you you can't Read the document. I mean, you don't know what what the image is of.
[00:03:58] It's the bare minimum of metadata that that we could get out of a file is kind of what we call it So and when I think about data that's sort of embedded in a file like that, and then I think about geospatial data, does that make geospatial data sort of less unstructured because it has that other extra bit Like that geography
[00:04:13] attached to it? Yeah. Because commonly in, say, access metadata, there's actually I mean, you can get,
[00:04:19] not just GPS location often, like, from a from a phone image or a drone image.
[00:04:24] But there's even things I mean, your speed, your acceleration. I mean, they're now putting a lot more Information and context into the the files themselves. So there is a lot of structure.
[00:04:34] Even, I mean, for GeoTIFFs and and different things like that, There, I mean, you can definitely get a ton of contacts from a from a single file just from the metadata. In our pre interview, you talked about this concept of first order, second order, 3rd order Metadata. Can can you you you mentioned a little bit before what you how you describe first order, but could you walk us through that again, please? I think it'd be really helpful for for the rest of the conversation.
[00:04:56] Yeah. This is sort of a a concept that I've just to sort of, structure our thought and internally on on how we look at at the data that we're we're dealing with. I mean, I've been talking about they're kinda considering the 1st order metadata is really I mean, you open a file, you can get, file headers. There's there's data there without doing much else. We're gonna call that 1st order metadata. So that'll be like your XF or your your XMP metadata in a file. And then 2nd order is, okay, let's start actually reading the data in the file. So there's a document, there's an image, and that would be something like doing object detection on an image. I mean, seeing what's what's in the image and getting maybe bounding boxes of, tag bounding boxes
[00:05:38] or, say, a document getting getting terms that are found. So we gotta call that 2nd order metadata. But then 3rd order metadata is really more inferences
[00:05:46] of, okay, I'm looking at, let's say, a conveyor belt in a picture. So someone's walking around on their maintenance route, they took an iPhone image,
[00:05:55] that has excess metadata in it. They run it through a computer vision algorithm. You can see the conveyor belt. But then third order metadata would be that conveyor belt is actually linked in an SAP database somewhere. And so there's that contextualization
[00:06:10] which
[00:06:10] could just come from inferring across a whole bunch of images or a bunch of data. So we kinda call that third order metadata that's really I mean, that's when you start to get into machine learning and you start to get into more complex inference,
[00:06:23] And and that could be something where we now know this is an image of this piece of equipment or this physical asset that is something that a customer has maybe in another database
[00:06:33] somewhere. So creating that edge essentially and we think in knowledge graphs. So everything is kind of an edge connecting something to something.
[00:06:41] Creating those edges, to us is is kind of that 3rd order method. Is there any limit to sort of how far we could spider out from, you know, first order Metadata, you know, to 2nd order, to 3rd order. Could could this essentially just carry on and on and on depending on our compute capabilities?
[00:06:56] Yeah. I mean, it's and it's that different from, like, what Google is doing with their knowledge graph for the web or other companies, but, I mean, theoretically,
[00:07:03] you could essentially create a web spider. I'd actually I know we talked about this being in the pre call, but it's I mean, I was doing this for podcasts where you have an r s RSS feed that has MP 3 files referenced in it. There's,
[00:07:16] basically terms that are spoken that you can basically do that second order metadata, and then you could relate, say, the show notes from that RSS feed and starts spidering out links and providing more and more context. And,
[00:07:29] really, that data enrichment
[00:07:30] is is kind of recursive. I mean, you can
[00:07:33] Continue on and get, more and more data, really, as as long as you can sort of find a link or some something to connect to, It's it's theoretically kinda infinite. So, again, thinking about geospatial data with that geography element to it, would you ever use that as well? So, okay, this data, The bounding box of this data is here. So let let's say in that object detection stage, you identified there were some pipes. So immediately, you know, okay, the the pipes are here on the In the world, for example, or you have a rough idea where they are and start making inferences from there or building relationships from there as well? Exactly. And that's actually something we're working on right now, and it's it's something we haven't,
[00:08:10] finished yet. But really and say you're looking at at pipes.
[00:08:13] And from the drone
[00:08:15] metadata,
[00:08:16] you know, basically, that camera view, that camera frustum of of what you're looking at, You can project that onto the real world, figure out essentially a geofence of, okay, here. I know this pipe is in this general region, And then what you could do is sort of cross reference that and search that against a database to say, well, where are my pipes in this region? And do sort of a database lookup And try and identify, okay, is this pipe a or pipe b?
[00:08:41] And that's where things get really interesting because then you can start creating those links to say, Okay. Show me all the images of this pipe. And before, I mean, currently, humans have to do that. You have to create those relationships of, oh, I'm taking a picture of this pipe. I know it's this pipe, But what if you could use machine learning and kind of this knowledge graph approach to say, automatically link those things together,
[00:09:03] And then you can kind of pivot on any piece of equipment and say, hey. Show me all the images of this pipe in the last 30 days. And then Those those kinda connections have already been created for you. So you talked about a human being in in the loop before.
[00:09:16] Is a human not In the loop already, even though you you have this process in place? I mean, I'm thinking about the object detection side of it. Is there not someone labeling images saying, oh, this is a pipe, this is a boat, This is a house. But how are you solving that problem? Yeah. I mean, there there always has to be a human in loop at some point, typically for the model training, but also for kind of that review and approval step. So you may have I mean, someone someone has to create a model, for a pipe, and so you have to have that that sample data, that training data to start with that's that's labeled. And so there is a there is, I mean, a definitely a a good amount of work that has to go on to get to that level. But once you can start to, I mean, hopefully, have some commodity type models that are good enough, I mean, they may not Be specific to your environment, but they've been trained on enough data that they can find pipes in a similar environment. That's really a a one starting point. And then really also the the review and approval of, okay, what it found wasn't really a pipe. It was a
[00:10:15] gas hose or something like that. And so you have to reject that data and then go back and continually train the model, tell it saying that the the data wasn't that good. You talked about this idea of commodity object detection. And I remember from a previous
[00:10:28] conversation, you talked about
[00:10:30] Azure feature detection.
[00:10:32] Is that Yeah. Is that what you're talking about? And if it is, could could you describe to us what it what this pin is and and how it works? Yeah. So computer vision is commoditized a lot over the last several years, and so there are, cloud services from Azure, from AWS, and from Third party vendors that you can either get a model off the shelf like Azure Cognitive Services Vision. We use where they just have a model. It's It's an API. You just call it. You give it an image. They give you data back that, essentially, metadata back that they gather, out of that image. But then you can also go to some of these more no code type services to build a model, and that's where you may have, say, 100 images, 300 images, you do the annotation, and then they do everything else. They do all the heavy lifting to train the model, build the model,
[00:11:17] deploy the model. Or you can if you have a data science team, you could do this all just via code. I mean, you might go right to the metal, I mean, and and write Python code and do all this, all this yourself. So there's kind of a A real big sort of, I mean, swath of different types of, computer vision that you could do from sort of easy to hard, but they all give you some level of detail about your images. This is awesome. So it sounds like I can take my images, you know, send it off to This, Azure feature detection service, me an API, and say, tell me what's in the image, and it'll say pipe bucket house, you know, Whatever it is that that it can that can identify and send it back to me as a labeled image. Am I am I understanding that right? Exactly. I mean, the one the downside of that is the models are trained somewhat generically. That's what we found is you run Azure Cognitive Services, I mean, say, on a,
[00:12:04] a drone image, and it'll find things like it'll call it aerial, and maybe it'll say
[00:12:09] outdoor and maybe it'll say building.
[00:12:12] But honestly, I mean, it's it's somewhat useful for filtering, but it's not useful for identifying your specific things in the real world that you care about. And so that's why we kind of look at you're going to have and I think The common term is kind of an ensemble
[00:12:25] of models where you may have sort of this rough cut model that says, hey. I have a building. But if you find a building on the images you find a building, you might wanna run another model and then say, okay. Go find me. I mean, differentiate,
[00:12:39] sheds from
[00:12:41] garages or something. And that could be something where then you have a more tune model.
[00:12:46] And really then you can kinda build up this kinda layering of of different models to get to the the specific things that you you wanna deal with. So it's It's really I mean, there's not 1 model to serve all, I think, is really the the big point. And but we do see that, I mean, more models are really gonna be the the norm than less. I mean, the tooling is is really evolving to to make that possible. That is really interesting. So this was actually gonna be one of my next questions was gonna be around ontology.
[00:13:13] You sort of Hinted it there, I think, with, you know, if you found a found a house, go go and look for these other things. And also this sort of, you know, parent child relationship. Exactly. So you found a house, oh, we can make some assumptions. Maybe it's time to run the window model or the the door model or or something else. Is is that the way I'm I'm supposed to understand this? That this is Perhaps what the the future of object detection looks like? That's the way I see it. I mean, I think it's it it also helps in in cost management because it it costs, I mean, a little bit. I mean, probably, like, a few cents to run run each model. And so, I mean, say a drone flight. You're flying a drone over a,
[00:13:49] an outdoor area, maybe only 10% of the data has buildings in it. So you don't wanna run the the window detection model on those 90%. So you wanna sort of Carve out I I always gonna use this some up buildings I mean, smaller haystacks or bigger haystacks. And so as long as you can start Carve down the size of the data you're running it on. You can optimize the cost and and performance. I mean, they they go hand in hand. So we've been talking about images for for a little while now because I think they're This is an almost endless source of unstructured data. We're we're creating more and more of them, and we need to find out what's in the images and create information from that. But,
[00:14:25] Earlier on, you talked about you used to used to do something with podcasts and sort of spider out and create a knowledge graph of podcasts.
[00:14:32] And I'm wondering about sentiment data as well Because, you know, data comes in many forms as as we've talked about. Can can you read through documents and create sentiment? Like, what what is this document saying? What what is the feeling in it? Like and and sort of build,
[00:14:47] and add that to your knowledge graph? Is is stuff like that possible? Yeah. And that's where it gets really interesting. I mean, it's It's really about that context. And and what we really talk about is contextualizing
[00:14:57] the data that you're capturing to Sort of real world entities, which are, like, people, places, and things.
[00:15:03] And and, yeah, I mean, the the project I had worked on a few years ago,
[00:15:08] which was kind of the start of of what ended up being my company was taking podcast feeds,
[00:15:14] analyzing them for entities. So Topics discussed, people,
[00:15:19] I mean, different different,
[00:15:21] organizations, companies, and then spidering out and creating those links to say, okay. This podcast discussed,
[00:15:27] geospatial data, it discussed Python, the here are the people that were on it, here are companies that were name dropped during during the podcast,
[00:15:35] And what you essentially can do then is find,
[00:15:38] those edges and say, well, hey. Find me and this you can use it for discovery data discovery then
[00:15:43] we'll find the other podcast that had
[00:15:46] this this topic and maybe this cohost or this, this guest
[00:15:51] and but then The show notes become a source of, of value where those,
[00:15:57] for the podcast that actually have good show notes, there's a bunch of links.
[00:16:01] And once you can start to the other interesting part is, hey. I mean, what other data can we gather from the linked Entities on the show notes or the link documents,
[00:16:11] HTML documents, and start to create commonality from that as well. And that's where we started to see a lot of the value. And and there was a so much data. And, essentially, I had to build, like, a web spider to do that where it just would continually start Pulling in the data, reading the document, doing NDA analysis,
[00:16:27] coming up with a list of links that it found in the document, and then spidering out again.
[00:16:31] When I hear you talk about this, it feels like that this is creating the the network effect for data. Exactly. I mean and that's why I've really only got into knowledge heavy maybe 5 years ago. And once you start to see the, I mean, I've done a good bit of database work and understanding, I mean, okay, you have sort of a table and you have a key to something else that lives somewhere else. And I started to look at knowledge graphs as a great way to have sort of dynamic references
[00:16:56] that we can invent new edges on the fly. I mean, you have all your data and you say, oh, well, this entity is actually, I mean, related to this other entity, and here's a new edge We're gonna create. And in the SQL world or kind of classic database world, updating the schema is always the biggest pain in the butt for for everybody. And so schema migration and all that. And knowledge graphs are so much more dynamic, and they give you the ability to pivot on any entity,
[00:17:22] and any edge in the system. And so we can invent new edges And then just kinda see what happens and say, oh, well, let's pivot on that edge and see what what are all the things that relate to that. And that's where I love I mean, I've Been deep into it for about 5 or 6 years now, and it just the ability to to represent your data and and that data model is, is really what's key for us, and then we're learning things every day about it. So you you talked about building these edges
[00:17:47] in terms of the knowledge graph, but there's there's another sort of
[00:17:50] Edge,
[00:17:51] idea. I I wanna I wanna get your opinion on. And this is this is the idea of, edge computing.
[00:17:57] Are you familiar with this? Yeah. Could you tell me what it means to you?
[00:18:01] I mean, to me, it's there is some device, Internet connected, that lives
[00:18:05] typically on premise.
[00:18:07] We've talked to,
[00:18:09] like,
[00:18:09] food and beverage companies who have a manufacturing plant. They have a video camera
[00:18:14] on the site the the shop floor, and there's
[00:18:17] compute being done there. There's typically some Internet connection
[00:18:21] back that some some data is flowing back to the cloud, but it ends up being I mean, it kind of in the IoT world where there's, like, sensors and things like that, but I think it's it's a way to push compute
[00:18:33] closer to the source of data,
[00:18:35] and then take a derivative version of that data and push it back to the cloud for further crossing. I'm sorry to put you on the spot there. I I really appreciate that, that definition. No. No.
[00:18:45] That's that's just the way I see it. No. I I that that that's in line with With my understanding of it as well. I first came into contact with this idea was,
[00:18:53] in relation to,
[00:18:55] satellite platforms where The idea was, you know, if we could compute up there, do compute up there in space, we could save a whole bunch of data being sent down to earth kind of thing. We could, you know, remove the bad stuff and only keep the good stuff.
[00:19:06] That's my very basic understanding of it. But I'm curious because, you know, IoT, you mentioned that before, more and more sensors collecting more and more data. What what do you think Edge computing means for this idea of first order you know, 1st order metadata, 2nd order metadata, 3rd order metadata. Is it Are we gonna see a a dramatic shift there at some stage in terms of the kinds of metadata that are being created, you know, at the source? That's a great question. I haven't
[00:19:32] Thought too much about it, but it's an interesting like, I've talked to a few companies who have essentially live video capture. So they have, basically little video boxes that are sitting connected to a camera, doing some kind of analysis on-site.
[00:19:44] And what we've talked to them about is
[00:19:47] almost jumping past 1st order metadata because there's not really a file there. It's just a string. And so you jump right into 2nd order metadata where
[00:19:55] they're running ML on the device and giving us back object detection. And so we've been talking about a couple projects with with with partners
[00:20:02] that could be in that or in that area, and we were just gonna basically have them send us
[00:20:08] The object detection. And we would then import that into our system, and so we could still do things with it, but we wouldn't have the original files to start with. And so I think what you're gonna see in edge computing is it's it almost ends up being more metadata management than file management because there isn't literally a file. And that's where I mean, it's it definitely fits the model, and and we have looked at could they send us, like, an archive, like, every night? Like, if they're capturing it on-site, could they just upload The last 24 hours or last 12 hours, and then we could kinda connect it up later,
[00:20:40] and run more analysis on it is something we were looking Do you think there's a danger there in, like, getting it wrong at the source and never being, able to come back to it? Like, if we're talking about metadata management as a as opposed to file data management?
[00:20:53] Exactly. And I think that's what, a lot of the folks that I've talked to that continuous training of the models is important. So you're not gonna train the model on the edge typically. And so it's you have to have some data flowing down to validate, is my model even good? And in a in some environments, I mean, you have people clicking, like, Approve, reject, and they're saying, okay. This is good. But how do you know that I mean, we we had talked to a chicken made you factory line. And how do you know if The video
[00:21:20] actually saw that the little hanger that the chicken was on was broken. And, I mean, how do you know if that's good or bad? Like, it they can send an alert, but there has to be that that closing the loop on that data.
[00:21:32] I think that's a tricky part, and and I have talked to companies that are they have that continuous training, but I I think how do you get that data back if it's if it's sort of an approved reject, it's probably the trickiest part of it. Yeah. That's that would be a really difficult problem to solve. I wonder if tagging it would just be enough and, You know, using it as a filtering mechanism at the source where that that first sort of stage of filtering was done for you, where things were where where the model was, you know, And doubt and just tag everything. Like, a human needs to look at all of this stuff, please. Yeah. I mean, or even just have a big red button or something as the the percent of the plan could say, oh, I mean, there was a problem. At least it gives a signal,
[00:22:08] back to them to go look at the data later. It may it may be something like that. So you you've given a bunch of different examples of of projects that you've been involved with or discussed with with other organizations.
[00:22:18] Who typically comes to you? Like, who comes to you and says, look, we've we've got all this unstructured data. Can you structure it for us? Yeah. I mean, we're we're still early. I mean, we just launched about a month ago. So we're we're kind of more the opposite. We're going out trying to find people. But at trade shows have been really useful. I've we've got done a couple conferences where people come up to the booth, And we've gotten some really interesting use cases. I mean, 1 in the geospatial area was an aerial survey company, and
[00:22:43] they
[00:22:43] Typically, I mean, they they fly over. They're actually very savvy around, like, photogrammetry
[00:22:48] and and the data capture they're doing, but they're poor at data management. I mean, they're keeping their data on SharePoint.
[00:22:56] They're not cloud native yet. They don't really have a search angle to what they're doing,
[00:23:01] and the common thread we tend to hear is
[00:23:04] people look at their data in almost with blinders on. They may look at that day's data or that week's data, But once it starts to age out a little bit, it goes dark. And and that kinda sort of dark data concept is something that that is starting to be an industry term, But it's it's once I mean, they've captured the data, but they're not making good use of it. And so we provide a way to look across, like, years of data and start to see trends or commonality, and and that's really where that's where the interest lies, from a lot of the folks we've talked to. And the,
[00:23:36] I mean, sort of bridging the gap in in their kinda daily workflows to
[00:23:41] historical analytics and and things like that. That that's really interesting. So actually actually, a lot of the value is in the archive, in the stuff that that's that's old, making the connections between that. So what it Look like in the past? What does it look like now? How do these things relate back in time?
[00:23:57] Yeah. I mean, we we always talk about everything is indexed geospatially
[00:24:01] and temporally and then via, like, a tagging taxonomy, and that's where we start. So it's how we generate those tags or or we've even generalized that to what we call observations now. And so an observation can be
[00:24:13] of a person, a place, or a physical real world asset, or it could just be a a simple tag, just a generic word, or phrase. And so everything maps back to these observations in our system,
[00:24:24] And those observations can come from document analysis. They can come from audio transcription
[00:24:30] or computer vision. But once you have that data,
[00:24:33] That's where it gets really interesting. And then you can start to look at, like, hey. I've seen these observations this month. We start to see, like, a trend analysis of these observations ramping up,
[00:24:43] and then providing alerting. And that's typically what people want. They want data triage. They either wanna do data discovery that's more user directed, or they want kind of data triage and alerting that's more automated. That that's usually what the the 2 things that it falls into. That's really where it starts is I put a massive data in the system.
[00:25:01] Let's see what's there. Like, I just I was doing a test last night. I uploaded, I don't know, a couple 1,000 images from different drone data that I had and didn't realize some of the data was actually in Europe. And, I mean, if you're just looking at files that are from a DJI drone that are like DJI
[00:25:17] Underscore, like, 1234,
[00:25:19] there's no indexing on it. I mean, there's no obvious metadata.
[00:25:22] But once you process it through a system like ours, you can see when it was done,
[00:25:27] related things that you were taking around that that time frame, both time based and geospatially,
[00:25:33] as well
[00:25:34] as sort of clustering of Is it what kind of data it says? Is this outdoor data? Is it,
[00:25:39] more commercial building data?
[00:25:42] All that kind of stuff is It's so so non obvious, just when there's you're just looking it up at a folder on s three. Could you imagine a world where, like, I came to you with some data, lots lots of sort of different kinds of data. So please, you know, run this through your system, create metadata around it, make it searchable, make it discoverable.
[00:25:59] Let let me know What's going on with my data right here now and and what's happened in the past? And then expose that as some kind of, may maybe a web catalog service or some something like that. Something That was searchable on the web where I can
[00:26:12] expose it to the public and say, here, this is all the data we've got. We've created metadata around it. If you're looking for data, Looking here. Exactly. Yeah. It's it's actually something we're thinking about, and and we're gonna be opening up our APIs.
[00:26:24] So if you're a customer of ours, you can essentially,
[00:26:28] Search, I mean, search the data, get access to it. We are thinking about putting up a public catalog of, like, here's a here's Basically, you can, I mean, publish data into it and and have that kind of site? But also, I mean, I know I think you had passed me on to the the stack,
[00:26:43] format or or API, and and it's kind of like we've done some similar things with image tiling services and things. And it would be a great fit. I mean, really to kind of expose this as a public catalog. And and we've actually talked to a it was a a group that does geospatial work in Los Angeles,
[00:26:58] And
[00:26:59] they had ideas of, well, what if all just people are are posting pictures of, like, and
[00:27:05] things like that. And it could be even like a crowdsource thing that what if if there was some app that they could publish into
[00:27:12] our knowledge graph and then actually, I mean, have public access to it. And so it's something we're thinking about. I mean, we're we don't have a public angle to what we're doing yet. It wouldn't be that hard to expose, but it's,
[00:27:23] it could do some really interesting things that way. The way I understand, up until now, we've been talking about, like, a a database or
[00:27:31] s three bucket full of files, you know, full, you know, blob stored somewhere where you put your, you know, point at that, get access to it somehow, ingest it, and then then create these knowledge graphs and and do all the things we've been talking about. Is there a world where I could take a service like you yours and point it at an API and say, can you just pull that API
[00:27:49] and and Can you do that and build, like, a knowledge graph around what was that and, like, tell me new things about that data? Well, that's where the the funny thing is that's where it all started. So I was I have it, we have this concept of a feed. And so it's it's essentially a it's based kind of on the RSS feed concept where you can Have any API that we can read. It could be an RSS feed. It could be what other ones do I have? Spotify. I'd I had music ones because this is where it kinda started, and it could get, like, the new releases and it chops it up into, like, posts, kinda like RSS posts, and we can process that through the system. So, essentially, we have this polling model or this this pull model that we have, any API out there, we can kind of convert into our common data model
[00:28:33] and kinda have a continuous feed of data. So we could Talk to a SQL database and look for new rows
[00:28:39] and basically generate events on in these posts in our system that could be processed.
[00:28:44] We just added email support recently, but in a file based world. So we could drop in, like, a MSG file or an email file or even a PST file, And we'd like an Outlook file, we crack it open and and do document analysis and everything. But I could see a world where we're listening to the Google Mail API
[00:29:02] or the Microsoft Graph,
[00:29:05] API and things like that. That's
[00:29:07] really, I mean, conceptually
[00:29:09] right in line and and probably wouldn't take, and very long to to integrate. That's really interesting. A lot of those examples for me, they they were, at least in my mind, an example of a feed that's constantly updating. So you could, Like you said, you could sit and and listen to that feed. I was thinking more about some, geospatial APIs there. You you show up with a,
[00:29:28] a geography and say, well, show me everything within this polygon with this in this geographic area and make a request based on that. And I'm wondering if you could do something like that. If If you knew the bounding box of the the API
[00:29:40] and just started pulling it constantly,
[00:29:42] and building, like, this knowledge graph around the stuff that you're finding, That would be amazing. We have a concept of,
[00:29:49] of places. So when you drop in an Esri shapefile into our system, So we ingest the file. We extract,
[00:29:56] basically convert it to GeoJSON internally so we get a geofence,
[00:30:00] and then we, what we call, promote it to a place entity. And so it creates a a place in the graph that becomes more searchable, kind of a it's like a top level entity in our graph, but We also do data enrichment on that. So I go and I look and I call the Google Places API, and I try and map that to, okay, like, is there any other metadata essentially I can get around that
[00:30:23] place. Yeah. But we could also do I mean, I've talked to Nearmap. I've talked to a couple other satellite services.
[00:30:29] We could go enrich And, like, go get me the latest satellite data for that g s that region
[00:30:34] and layer that in. That because of our the way we have our eventing model, We could essentially we now we now support webhooks. So when anything happens to the system, like an entity is created or a tag is added, we can call a webhook. And so that's an area where, For now,
[00:30:49] anybody could build a data enrichment where they could call some other API
[00:30:53] and then call back to us to inject data back into the graph. But we're also looking at other ways where we can audit basically, just do that in in a in a box, I mean, where we could add add that as a feature for a customer to say, hey. For any As where you shape file, you put in here,
[00:31:08] go get me the latest satellite data from the service, and we could just have that as an option. How do you know where to stop with with this? Because I think at some stage, And people might feel like they're drowning in data. How do you know? And, you might cross a threshold where, like,
[00:31:21] the the return on An investment
[00:31:24] is massive. You know, it just goes up into the right, and then it dips off. How do you know where to stop the these these spiders? How do you know when the knowledge graph is, like, Okay. That that's enough to complete this task or for what we're doing today. Yeah. I mean, that logic is is the tricky part. I have I in the development, I have created bugs where I kinda created infinite loop of spidering.
[00:31:43] So it's there's there's definitely a risk there. I think at some point, you have to kind of see I think what we did is If we start to enrich and we're not making any changes, if we're kinda seeing like, okay, I'm I'm getting more data, but it's literally the same data that was already there, I would start to cut off the spider at that point.
[00:32:00] And so it's that that is really one of the big problems though because you can, I mean, spend a lot of money in calling out to other APIs and doing enrichment that may never be needed?
[00:32:09] So a lot of the the logic kind of around that is is really around how far the spider.
[00:32:15] But yeah. Oh, it's it's I mean, that that part of it is data enrichment is almost boundless, and what we're doing is It's almost boundless, and really it comes down to the customers.
[00:32:24] So this is it sounds incredible, and I've learned a ton just through this conversation alone. But, I'm wondering, like, Who is this not for? So it sounds like a lot of people could use it, but but if you had to say pointed,
[00:32:37] you know, an organization or
[00:32:40] A a particular industry or someone who is this who's not
[00:32:44] who who are you not building this for? Who shouldn't be considering this? And I think a lot is It's a difference between the technology and the company. I mean, the technology is really broad, and that's why, I mean, it started as a podcast discovery. I mean, it could be used for the, media entertainment side, it could be used for a lot of different things. But as a company, what we've done is we focus the technology
[00:33:03] for anything that, Essentially, we call it extracting insights from unstructured data that is perceiving real world assets. So we've we've constrained it to there's some geospatial element. You're typically the data is about something in the real world. So, I mean, we've had some interest from, like, health care and,
[00:33:23] different main medical research and things like that for maybe like,
[00:33:27] scanning images of, X rays and stuff like that. And We just haven't really gone down that road because there's no geospatial angle to it. There's a temporal angle, and there's a lot of overlap. There's a lot of data there, but it's just not our sweet spot. And so we're trying to focus on
[00:33:41] carving out something where there is, I mean, that that geospatial angle. Is that just because is that because of the extra context you you're You're gonna get with with the geospatial,
[00:33:50] data.
[00:33:51] Yeah. And, I mean, it's it's it's not even a technology problem. It's just a I mean, we can't
[00:33:56] really boil the ocean and do everything Problem and just as a small company, we we gotta have some focus. But I think it's also what I'm really excited about about the contextualization is,
[00:34:07] being able to link
[00:34:09] it it I mean, it is a little bit of kind of the digital twin concept of, I mean, where physical assets and and entities exist in the real world. But for us, if we can map simply by uploading a photo, we can know
[00:34:20] what piece of equipment or pieces of equipment or in that photo, and we can give you almost like a little heads up display on here's the current sensor data from that conveyor belt. Here's the, Like, the sound vibrations,
[00:34:33] the last 5 sound recordings of vibration of that literally by just taking a picture. That's where we really wanna get to. And and we're we're not all the way there, but it but it the fact of if we can reverse engineer kind of What you're looking at in three d kind of map back to
[00:34:50] looking that up in a database, getting that data from some other maybe a SQL database or a time series database,
[00:34:56] And then start to look at contextualizing
[00:34:59] that across
[00:35:00] weather. I mean, oh, we see water pooling. Oh, but they did just rain 3 days ago, so that's not a problem.
[00:35:06] Trying to pull in other data sources,
[00:35:08] that's our long term vision. I mean, it's really this kind of knowledge hub for the real world,
[00:35:14] in in a in a business case,
[00:35:16] enterprise and, and business sense, essentially. But when you think about the the big drivers of unstructured data today, What do you what do you think about? Do you think about satellite data? Do you think about imaging? Do you think about IoT?
[00:35:29] Where where do you go? Yeah. I I think I mean, we typically look at the 3 main sources of data,
[00:35:34] for image imaging video we get, and even what generates three d is. It's It's drones, robots, or mobile phones. So it'd be like a spot robot, a drone, or just somebody with an iPhone walking around. Those are, like, the 3 main sources of data that we get
[00:35:49] other than documentation,
[00:35:51] or CAD drawings and things like that. So But typically, those are data about
[00:35:57] your real world assets, and so they're sort of they're the documents are like maintenance reports Or there might be a Zoom meeting that was recorded about, say, an inspection going on.
[00:36:06] And so those are kind of provided context, but The the image we we tend to be more imagery heavy and,
[00:36:12] just because that's a lot of where the volume of data is. But more and more, we wanna pull in other data formats that that kinda relate to that. If you get to the stage where you can,
[00:36:23] join,
[00:36:24] Like, as built documentation, those PDF documentation that all engineers love of CAD files of of structures in the real world with
[00:36:32] other data, Like, current data about those, then you are on a gold mine, my friend.
[00:36:37] I hope so.
[00:36:38] I mean, that's that's where we're trying to get to. I mean and from the folks we've talked to, I mean, essentially, they just have a massive data, and it's they just want kind of and and we talked to an oil oil and gas customer who said, look, we don't want Google search. Like, just
[00:36:52] Searching
[00:36:53] file names isn't enough. Searching,
[00:36:55] full text isn't enough. You essentially want, like, a semantic search, and that's what we're creating is A way to search across the relationships
[00:37:03] that we've gleaned from their data. And Okay. The the the sorry to interrupt, but then that's a brilliant Point. Like, are you not afraid that Google is gonna do this? I mean, that they are what what is their whole mission to make everything searchable or index everything in the world? Like, are they are they competition for you? In a in a pattern sense, I mean, what we're doing is a lot like Google Knowledge Graph, But they're they're so consumer focused.
[00:37:25] I would be more concerned about, like, Palantir or a c three or Cognite or any of these companies that are focused on Kind of these real world things. I mean, those are more the people we would see ourselves pairing up against, but I think there is unique things that we do by leaning in more on the unstructured data.
[00:37:41] We're not so we're not as vertical.
[00:37:44] So we what we also wanna engage with is ISVs and companies
[00:37:49] to build companies on top of our platform. So we our goal is to be more of a snowflake or Databricks. So that's more of a data platform that people can build around and on top of.
[00:37:58] In addition, I mean, you can use those kinda out of the box as well, but we really see us kinda doing the heavy lifting
[00:38:03] to get some really interesting vertical applications, like Property inspection is 1. I mean, taking iPhone photos of,
[00:38:11] rental apartments and and pulling in all the data around that and automatically
[00:38:15] kind of creating a representation of, oh, like, when did this sync break? I mean, was this sync broken in the last inspection? And, I mean, do you track all the maybe we could track the emails about it with the with the,
[00:38:28] the renter and things like that. That's where I think there's some really interesting vertical products that could be even built on top of our technology.
[00:38:35] Well,
[00:38:36] I I said it before, I've learned a ton in this conversation. I I really appreciate it. I think probably now is a good time to sort of wind things down a little bit.
[00:38:44] But part of me is super
[00:38:47] excited, fascinated. Another part of me is a little bit terrified because it feels like you go to Everything is gonna be linked to everything else at some stage. And, personally, I I just don't know if I'm ready for it yet. It's interesting. I mean, I I know I mean, we're not really going down that kind of intelligence route. I mean, I think a lot of these concepts are being done in in the true, like, NSA type Palantir intelligence community. And I've sometimes jokingly called us kinda Palantir lite, because there's there's there's a lot of I mean, I love with I mean, I love the concept of how they approach, like, their knowledge graph and things like that. And I totally understand why it's, like, multibillion dollar contracts and the value you can get out of it. But I think there's a there's a smaller
[00:39:27] Swath of that, you can take in kind of a no code environment
[00:39:30] for just normal business. I mean, it could be I mean,
[00:39:33] I don't know. We've like, a paving company that just wants to track each of their jobs, and and map that against their work orders and their their emails and their meetings.
[00:39:42] So we really see it can go all the way from SMB up to kinda mid enterprise. I don't know if we'll go I I don't anticipate us going for, like, the massive
[00:39:51] enterprise deals.
[00:39:53] Just we we're really focused a bit more a a notch below that and downward. Well, I am gonna be following along because I I think this sounds really, really interesting. And if if there's people out there listening, where where can they go to follow long. Where where can they go to to learn about this, to catch up, to see it in action?
[00:40:09] Yeah. For sure. So, we are launched on the Azure Marketplace now.
[00:40:13] Our website is,
[00:40:14] unstruct.com.
[00:40:16] A better one's coming out soon. It's still a bit of a placeholder.
[00:40:18] And then just LinkedIn.
[00:40:20] I mean, it's the best place to Watch company and and connect with myself. Love to if anybody has problems in the space, I'd love to talk to them. Just love talking to people about The data they have, the problems they're seeing, and and just the that discovery part of it is super fun. Well, I'm gonna keep my eye out. If I meet Any one of my travels, I will I'll definitely make some introductions.
[00:40:39] Appreciate it so much. Thanks very much for your time, Kirk. I've really enjoyed this conversation. Same here.
[00:40:46] Well, I really hope you enjoyed that episode with Kirk. I'll put links in the show notes to where you can catch up with him, where you can reach to him if you're interested in perhaps working with Unstruct Data or finding out more about what they do. And, of course, I would love to hear from you too. You can connect with me On Twitter at Mapscaping,
[00:41:02] or there'll be links to my LinkedIn profile and to our website, mapscaping.com,
[00:41:06] in the show notes of this episode. So feel free to reach out. I would love to hear from you. Okay. That's it for me. That's it for another episode of the Mapscaping podcast. I'll be back again next week. We'll talk then. Bye.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment