The Inane Insistence on Patterns Pertaining to Persistance
An annoying pattern
I have been noticing an annoying pattern lately. Whenever I hear about a new project starting, more often than not, I see someone on the team jump into "DB architect" mode. For those individuals, literally every last ounce of energy goes into designing the database. Worse, they want sole ownership and literally forego all other development activities until they are satisfied.
It has been a long time since I've seen a requirements document that was thorough enough or accurate enough that I could devise a perfect database on day one of the project. I have two guesses why some developers are drawn to early database design: 1) it gives them a sense of control over the project, thinking it establishes them in some essential role; or 2) they have been burned too many times in the past with bad database designs.
My experience is that quality software projects grow organically. When you start a new project in Visual Studio, you're not going to know on day one what all settings need tweaked in your
Web.config or have your authentication/authorization scheme all worked out. You can't possibly layer your application appropriately or know enough about the domain to correctly design your classes. That's just not how software works. Whenever I think about all the small adjustments made over the lifetime of a project, I get overwhelmed. The only way I can start breathing again is by realizing that big things happen in small steps. Do people believe databases are somehow immune to this reality? Perposterous!
A database is just a data store
We live in a world where people still design brand-new systems where 90% of the application is hosted in a database. Stored procedures cover the majority of the business logic and the UI just displays the direct output of queries. You know you're working in one of these sytems when you have to package up data in XML, JSON, etc. and process everything on the database. Performance is simply not a valid justification for this type of architecture anymore.
When you accept in your heart of hearts that a database is just a data store, you are empowering yourself. You employ general-purpose languages like C# or Java to solve hard business problems, which is what those languages excel at. You address performance issues by cutting down on database hits, using concurrency, doing bulk operations or using alternative technologies.
By treating a database as just a database, you stop getting tied to specific RDBMS vendors and technologies. You can investigate NoSQL options. Best of all, you can adjust and adapt your database with a lot less pain.
Isolate your database from other systems
Back in the 80's and 90's, the easiest way for two systems to talk was directly between databases. However, how a system represents its data should be a well-guarded secret. Off-the-shelf products usually include a clause in their support contracts saying any modification to the system voids the contract. Instead, all changes have to be in the form of plugins, configuration files and extensions using standard API calls. In OOP, hiding your internal representation behind an interface is a well-accepted practice. So why would it be any different at the system level? If anything, we should be even more protective.
Systems that talk directly between databases make it almost impossible to upgrade one system without the other. If those systems are managed by different development teams, coordinating releases can be a massive undertaking. In today's world of microservices, being able to switch from MySQL to Cassandra between releases can be the difference between scaling out and your start-up going under. As easy as it is to create new APIs these days, the need for direct database interaction has gone way down. The beauty of APIs is that a compiler or unit tests will capture any breaking changes.
If you must support direct database communication, for performance reasons or whatever, never talk directly to the same tables used by the application. Expose queries behind views so you can swap out the underlying database structure. Write data to batch tables and bring the data into your main tables using a separate process, so you change how that data is imported later without affecting clients.
Support database refactoring
Once you have isolated your database, it becomes much easier to refactor your database. Here are some things that will help you to support refactoring:
- Limit your use of views, stored procedures and other DB-specific technologies
- Limit how much SQL you hard-code. Rely on ORMs to generate queries based on a configuration.
- Hide interactions with your database behind repositories in your data layer.
Unfortunately, relational databases don't warn you if altering a column will break views or stored procedures. You have to manually go through each and re-run it to make sure it still works. Some tools like Visual Studio SQL Projects (SSDT) will give you compile-time errors.
Defining your schema in code comes with the advantage that once you reflect changes in the configuration, you will get compile-time errors for any breaking changes. The only negative is you have to keep your database schema and configuration in-sync. Changes like renaming a column or table become painless. Larger changes will have equally larger impacts on your code. Still, I'd rather deal with fixing code than hunting down changes in my database. In dynamically typed systems, only a suite of unit tests can provide a decent safety net.
"Future use" my foot
Maybe I'm a horrible person, but I'd rather pull the cord than hide in a corner whenever I encounter a column that's not being used. Let there be fire and brimstone if that column is actually doing something. I'd rather face that than go on worrying about a column that does nothing. In software, we call code that never gets called the Lava Flow Anti-pattern. Unused columns are basically the same thing.
I hear people using the phrase "future use" to justify the most horrible design atrocities. Please don't add columns to your database unless you are absolutely sure you need them. Don't try to anticipate future features or requirements. If you find you accidentally added a column that you no longer need, just DROP it. Don't reuse an unused column for some retrofitted purpose either.
Understanding audit fields and optimistic concurrency
A lot of people put audit columns in every table, such as
ModifiedOn. Be thoughtful when following a convention like this. If your
User table has an
ActivatedOn column, do you really need a
CreatedOn field? How will you populate
ModifiedBy? Many systems use the same service account to connect to the database, so the user information isn't available. Will you have to specify this in code? If you do it in code, should you set
ModifiedOn in code, too?
Figure out if a single timestamp is adequate auditing for your system. Maybe your client wants you to track all changes! If you do end up tracking changes in separate tables, including these fields is just overkill. Are these fields something the business wants or just to help you diagnose issues? Would auditing via logging be more appropriate?
A similar technique that people use blindly are timestamp or version columns. This practices stems from optimistic concurrency, where the developer takes it upon themselves to prevent lost updates. When you query data, you keep track of the timestamp or version that is sent back. When it is time to
DELETE, you include the timestamp or version in the
WHERE clause. If
0 records are updated, then you know that the record must have changed since you queried it and you can alert the user. This practice can help to reduce table locking when you feel conflicts will be rare. Unfortunately, many developers include these columns in their databases with no idea what purpose they serve. Choosing between optimistic and pessimistic concurrency needs more careful consideration.
The point is, don't just include columns because you've seen it done in a thousand other systems.
Many-to-many tables aren't inherently evil. Don't prematurely optimize and ugly up your database design without seeing if many-to-many tables are really going to be a performance issue. They are an unfortunate reality of working in a relational model. They don't necessarily correspond to entities in your domain. Still, I would rather work with a database that captures many-to-many relationships using many-to-many tables than a database that employed one of many techniques invented to avoid them:
- Using a fixed number of columns (e.g.,
- Using an unconstrained ID with a discriminator column (e.g.,
- Using a one-to-many relationship and just duplicating data
- Storing delimited IDs (e.g.,
Going the opposite direction, don't assume the need for a many-to-many table until you are absolutely sure. It's easier to migrate from a one-to-many table to a many-to-many table than the other way around.
Recognize when many-to-many tables are entities
If you do need a many-to-many table, ask yourself if it belongs as a separate entity in your domain. Consider a system that allowed users (
User) to join multiple groups (
Group). Furthermore, that system allows users to be elected to different roles (
Role) in a group (e.g., President, VP, Secretary, Treasurer).
An initial design might look like this:
In my experience, whenever many-to-many tables carry their own state, it indicates a missing concept in the domain model. In this case, I would rename the table to
Membership and give it its own primary key. In order to preserve uniqueness, I use a unique key:
Membership it's own primary key is a judgement call. However, I find composite keys are less supported by tooling, like ORMs. Furthermore, I think it emphasises that Membership is a real thing and that uniqueness is more a business requirement or artifact of the domain itself.
Consider this system eventually needing to support members taking on multiple roles. You'd need a new
MembershipRole table. Which would you prefer?
At this point, using a composite key in
Membership will lead to all the
Membership fields being duplicated for every role. You might start asking whether you can drop
Membership altogether. But what if you support members that don't have a role? Erm...?
Avoid redundant relationships
Consider a system that allows customers (
Customer) to provide multiple forms of payment (
PaymentMethod). Let's also say the system allows them to pick a "primary" payment method. Here's one potential design:
In code, this will result in two navigation properties on your
PaymentMethods (a collection of
PaymentMethod objects) and
PaymentMethod will have a navigation property back to
With this approach, is there anything to prevent you from setting a primary payment method associated to a different customer? If you remove the primary method from the collection, do you also have to set the primary navigation property to
Here's an alternative approach:
Now every customer payment method must indicate whether it is the primary method.
I wish there was a way to constrain the database so there was exactly one primary payment method. One option might be to assign an "order" rather than a primary flag. The primary payment method would then be whichever payment had the smallest order value.
These types of unique constraints can be tricky to work with. Trying to update one record at a time will result in violations. You need to temporarily disable the constraint or perform the updates within a transaction. In a heavy traffic environment, these kinds of constraints can lead to a lot of locking, so it might be easier to just move the check into code.
Stop perpetuating bad practices
I still stumble on databases using old naming conventions, the equivalent of Hungarian notation in code. Please stop prefixing all of your tables with
tbl, all your views with
vw and all your stored procedures with
sp. Instead, give your tables names similar to what appears in your domain model. Singular? Plural? I don't care. Give your views a name that indicates what information they provide, e.g.,
AccountsWithSalesByMonth. Use verb names for stored procedures
UpdateAccount. With names like these, it's hard to confuse them.
In your tables, standardize whether you want to call your keys
Key. If you include the table name in the primary key name, such as
Id, be sure to do that consistently.
When naming foreign keys, add meaning to the column names if necessary. For example, on an
Event, name the initiator column
InitiatorUserId rather than
UserId. On an
Approval, name the approver column
ApproverUserId rather than
UserId. Notice I include the type the foreign key points to; if it's obvious what the foreign key refers to, feel free to drop it.
Use the equivalent of
BIT on your database rather than
CHAR(1) for booleans. Prefer
DATETIMES. Date ranges rarely need to be
DATETIME and timestamps almost always need to be
DATETIME. Design ahead and anticipate the need for UTC. Try not to include "date" or "dt" in the name of all your date/time fields.
CreatedOnDate feels redundant, so stick with
ExpiresOn instead of
On for dates and
At for date/times, such as
Stop worrying about saving space. Stop using
TINYINT and just use
INT. When it comes to floating-point numbers, be a little more anal about matching your precision and scale to your requirements. If you aren't sure, don't worry about wasting space and just use the default precision/scale.
Just like with floating-point numbers, be careful to give your text fields adequate length, based on the requirements. If you are not sure, just use
MAX -- there's less performance impact than in the past and you can always clean up later. Prefer Unicode over ASCII, using
Be vigilent about nullable columns. My experience is that nullable columns are often a sign of bad design or denormalization. Booleans should almost never be
NULL, as there's almost always a default and reserve the use of
NULL to mean indecision (maybe) or indetermined (unknown).
Batch tables need batch IDs
If you must allow other systems to interact with your database, do it through batch tables rather than allowing them to directly
delete records in your main tables. Create a separate
Batch table with a primary key (
BatchId) that should be included in all records uploaded to your batch table(s). Then you can process batches on a schedule or on-demand via a stored procedure call.
If the record being batched has it's own primary key (e.g.,
AccountId), remember to include the batch key as part of the primary key (or unique key) in your batch tables, so you don't get constraint violations the second time a batch is run.
Batch tables are great places to store important information about a batch, such as when it was imported, when it was processed, the status of the batch processing, etc.
The nice part about batch tables is you can clean up after yourself. Simply
DELETE FROM Batch WHERE BatchId = @BatchId. While this isn't as performant as a
TRUNCATE TABLE, it allows your batch process to be run simultaneously by multiple users. It also lets you keep records around in the case of errors, so you can research them.
Write an ETL process
ETL scripts are one of my favorite programming tasks. They are typically well-defined, do not involve much human interaction and can usually be spit out in less than a day. There's an entire industry surrounding writing ETL scripts using tools like SSIS. Personally, I find it takes longer to use these tools than just whipping out a quick console application. How SSIS integrates with the rest of your application is awkward: basically calling a command-line tool from a stored procedure and just checking once every minute to see if the task completed with almost no support for error reporting. AHHH!
Say you just need to read a CSV file (E), transform that data (T) and load it into your database (L). There are about 10,000 .NET libraries for reading CSV that work about 10,000 times better than whatever broken CSV parser SSIS and SQL Server are using. You can dump that data into simple .NET classes (POCOs). Then you can use LINQ to transform that data into the same database entities the rest of your system is using. Then you just have to dump them into your database. No more staging tables for you!
Entity Framework taking too long? Use Dapper. Dapper taking too long? Use SQL Bulk Copy. Still not fast enough? LINQ parallel extensions! Memory issues? Process the data in chunks! Honestly, you're going to find a home-spun ETL process more demanding than your typical development activities. If that doesn't sound like fun, I don't know what does!
A few decades ago, the tooling simply wasn't there. These days, the database has really taken a back seat to more scalable, powerful options. Actual code is taking the fore-front. Strangely, with all the advancements in other areas, we as a development community have seemed resistent to new ideas when it comes to databases.
I think the gradual influx of agile development passed up database development somehow. I am not sure why developers still think they need to have a perfect schema on day one of a project. I don't know why teams don't work more closely together to decide what belongs in a database when the time is right, when the requirements are more clear. Think of all the strife among team members when someone checks-in a schema change without telling anyone.
Finally, I feel like teams don't take the necessary steps to protect their underlying technology choices and lock themselves into static schemas. Mostly, I feel it is a lack of education about alternative solutions; we're too embroiled with relational databases. Worse, we blindly continue practices from older generations without taking time to understand them.