Skip to content

Instantly share code, notes, and snippets.

@Twipped
Last active March 20, 2024 06:12
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Twipped/3b6c98dc80b00d5aed75f1c4b6456bd0 to your computer and use it in GitHub Desktop.
Save Twipped/3b6c98dc80b00d5aed75f1c4b6456bd0 to your computer and use it in GitHub Desktop.
Data storage lessons that I learned the hard way

Data storage lessons that I learned the hard way

Edit: Shoutout to kdeldycke/awesome-falsehood.

General

  • Never assume what's in the database is sane

Addresses

(Related reading)

  • Addresses should probably get their own table, unless it's recording historical behavior (eg, address mailed to)
  • Mailing addresses and property addresses are not the same thing
  • A street number may be alphanumeric
  • A street number may come after the street name
  • Not all addresses have street numbers
  • Not all addresses have streets
  • Not all addresses have cities
  • A postal code may come before the city
  • Not all countries have states/provinces
  • There are more than just 50 US states
  • 5 digit Zip codes can belong to more than one county
  • 5 digit Zip codes can belong to more than one state
  • Geocoding based on zipcode (even 9 digit zip) is unreliable, as zip codes have funky shapes
  • MLS data is not a reliable way to identify property owners
  • MLS data is not a reliable way to validate addresses
  • You cannot uniquely identify a property with five fields
  • You cannot uniquely identify a property with twelve fields
  • You cannot uniquely identify a property.
    • Address validation is extremely hard and you will save yourself a lot of hurt by paying someone else to do it, preferably a company who does it as a commercial product.
    • Google is not one of those companies
  • If you need to know the county for an address, get it from an API and denormalize on insert, don’t try to infer it from postal code.
  • Don't make a table to hold every 9 digit zip code in the US, it will be huge and quickly out of date. Get what you need from an API and insert denormalized.

Names

(Related reading)

  • The correct minimum length for an individual name is 0 characters
  • The correct maximum length for a name is "bigger than a varchar".
  • Names can contain any character, including spaces.
    • Names can use any character set (cyrillic, chinese, japanise
    • Names should be stored in utf8 at a minimum, utf16 is better
  • First name and last name have different purposes in different cultures
  • A full name is not just a combination of given name and family name
  • Not all people have middle names
  • Some people have multiple given names
  • Some people have multiple family names
  • Some people have no family name
  • Full name, legal name, and latinized name are three completely different things.

Gender/Sex

(Related reading)

  • Gender and Sex are not the same thing, make sure you know which is which
  • Gender and Sex cannot be used to infer ANYTHING, especially medically
  • There are more than three genders
  • There are more than two sexes
  • Some people have multiple genders
  • Some people have no gender
  • Some people defy sexual categorization
  • Title/honorifics do not infer gender
  • Name does not infer gender
  • Unless it's required by a downstream dependency, you probably don't actually need to know a user's gender or sex.

ACL

  • Identifying what functionality users have access to should always be explicit, not implicit or inferred.

Column types

  • If an enum has more than five values, make it a reference table
  • If an enum has different values for presentation, make it a reference table
  • If an enum’s values are longer than ten characters, make it a reference table
  • If the reference table will continue to get larger over time (and doesn’t have presentation values), just make the column a string. You’ll thank yourself later
  • Do not use enums for booleans
  • A boolean should only be used to represent true and false
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment