Skip to content

Instantly share code, notes, and snippets.

@barakmich
Last active August 29, 2015 14:14
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save barakmich/3162dca81838d3501e2e to your computer and use it in GitHub Desktop.
Save barakmich/3162dca81838d3501e2e to your computer and use it in GitHub Desktop.
CA Triples Feedback

First, let's get a summary of what's going on:

$ cat ca | cut -f 2 -d " "| sort | uniq -c
  15066 </bill/id>
  15066 </bill/session>
  51722 </bill/sponsor>
  35194 </bill/sponsor/cosponsor>
  16528 </bill/sponsor/primary>
  15066 </bill/state>
  23086 </bill/subject>
    678 </committee/member>
    249 </legislator/name>
    118 </legislator/party>
    249 </legislator/state>
 159765 </legislator/vote/no>
 119163 </legislator/vote/other>
1349215 </legislator/vote/yes>
  53677 </vote/passed>
  53677 </vote/state>

This is a handy way to talk about the structure. You can bake your metadata into your data directly by means of something like

</vote/passed> </type/object/type> </type/property>
</vote/passed> </property/name> "Vote passed"
</vote/passed> </property/description> "Did the vote pass or not? ...."
</vote/passed> </property/expected_type> </type/bool>

And so on. This is the schema graph for the dataset, which can be merged or loaded separately or whatever. So my quick uniq -c above is just taking a quick look around.

Your intermediate IDs are generated fine, for the most part the content itself looks pretty good. If you wanted to make </vote/state> go to </state/ca> </state/name> "California" and have a separate state subgraph, that's another possibility. But "ca" suffices for now. Much like you did political party already.

Two structural things jump out, one of them creative, the other understandable, but just a convention.

The creative one is /legislator/vote/yes and /legistlator/vote/no -- making these properties instead of CVTs is kind of cool. It makes sense from the reverse perspective (for this vote, these are the yes votes, and these the no), but the natural counterargument one might have is, well, what if I want all votes by a legislator? One, because it's binary, the query isn't much harder (.Out(["/legislator/vote/yes", "/legislator/vote/no"])) which we avoided in the MQL days (for obvious reasons) but are nicer now. I just had a conversation about storing Morphisms, and this is the perfect example case for things you can do well once you store them. For instance, a /legislator/vote is uniquely defined as the union of those two properties. I like where that's going structurally and conceptually. So I'm a fan.

Compare/contrast to the second strucutral note. The trio of /bill/sponsor, /bill/sponsor/primary, /bill/sponsor/cosponsor. You could build /legislator/vote the same way as /bill/sponsor -- just another triple. But you chose to materialize it here and not there.

I think the reason for this is because there's often the pattern of "a set of things, of which one is primary". In fact, they are all sponsors or cosponsors, but one of them is the primary sponsor. This happens a lot with, say, music "A set of album releases of which this is the primary" or literature "a set of book editions of the same book, of which the most published is the primary", or so on. Happens all the time.

However, you can have multiple primary sponsors as well. This makes it trickier. If there were one-and-only-one, I'd say that the right structure would be (roughly) /bill/sponsor to all sponsors (as done) /bill/sponsor/primary to the primary single sponsor as well, and drop the /bill/sponsor/cosponsor. However, because these are two separate sets, and while one (/bill/sponsor) is technically the union of the other two, writing more triples never hurt anyone. So I think it's right, but it's worth noting the common case that's similar.

None of this is set in stone -- all points are debatable, and this is where it gets into data philosophy and ontology. In general, a damn well workable set.

You might also consider the /bill/subject to be instances of their own. Generate more IDs for them, and tag them together, eg:

</subject/sub20002> </subject/name> "Labor and Employment"
</subject/sub20002> </type/object/type> </subject>

So I can find all subjects by their category instead of by the fact a bill has them for a subject. I can then later group them as well.

I notice some subjects seem to be, in fact, titles? If so, /bill/title as you'd expect.

All in all, super cool :) Now I'm wondering what I can add to best help you make use of this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment