Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save bluet/23e7697b86144561c4a3d804903d059d to your computer and use it in GitHub Desktop.
Save bluet/23e7697b86144561c4a3d804903d059d to your computer and use it in GitHub Desktop.
Database Naming Convention and Data Warehouse Design Principles

Database Naming Convention and Data Warehouse Design Principles

[TOC]

ETL: From Data Lake to Data Warehouse

  1. Extract 部份:取出要的資料、去雜訊、資料標準化、parsing...
  2. Transform:aggregation、mapping 、combined、Change Data Types
  3. Load: to dimensional model.

Example of Data Lake schema

Build One Truth by merging multiple data sources

De-normalize, and identify types and meanings of columns and values

Cleanup

Single Source of Truth

Example Result Table

id name display_name email location access_level status
1 Matthew Lien BlueT bluet@bluet.org Taichung owner active

Data Pipeline

Data Source -ETL-> Data Warehouse -ETL-> Data Mart

Data Warehouse Schema Model and Design

De-normalized

  • In a Data Lake, the schema reflects in transactional logic of an application and follows best practices (such as a 3rd normal form) so that updating values will not produce errors. But this type of schema can be difficult to navigate and many tables will never be used in an analysis. In the past, books recommended using dimensional modeling to reduce the schema complexity, make it easier to run queries, and enhance performance. Today, due to advances in BI tools such as Chartio and Data Warehouse technologies, dimensional modeling is no longer worth the effort.

Dimensional Modeling

But, if required, for analytical processing without BI tools, two major types of dimensional modeling are used commonly. But remember that dimensional modeling should be designed from business aspect, instead of technical point of view by engineers.

Star Model

Star schema model are deliberatly denormalized to speed up the process.

  • Characteristics of Star Schema
    • Dimensional Tables are not normalized.
    • The dimension table should contain the set of attributes.
    • Dimension table is joined to only Fact tables. They are not joined to each other.
    • Fact table stores keys from dimension tables and measure.
    • The Star schema is easy to understand and provides optimal disk usage.
  • Pros - Star schema model are deliberatly denormalized to speed up the process.

    • Simple queries: Join conditions are simple joins in this schema
    • Simplified business logics: This model simplifies common reporting business logics
    • Performance: This model provides performance enhancements for reporting applications
    • Fast aggregation
    • Feeding cubes: This model is generally used by OLAP systems to build cubes. Building cube is very fast process
  • Cons

    • The main disadvantages is, that data integrity is not enforced as in OLTP databases
    • This is not flexible inters of analytical application as normalized databases.
    • This model don’t support many to many relationship between business entities. These types of relationships are simplified in star schema

Snowflake Model

Snowflake model is in same family as the star schema. In fact, it is a special case of star schema.

  • Pros

    • Some OLAP multidimensional model tools are optimized to use snowflake schema
    • Normalizing table saves the storage
    • Improvement in query performance due to minimized disk storage and joining smaller lookup tables
  • Cons

    • Additional maintenance efforts needed due to the increase number of lookup tables
    • SQL queries would have more joins in order to fetch the records
    • Data loads into the snowflake model must be highly controlled and managed to avoid update and insert anomalies

Style guide

When going through and recreating the schema with views of the relevant tables you should also clean up what’s in each table. Exclude irrelevant columns and rename any columns that are confusing. Naming conventions help people analyze data consistently because they clarify what each column is and how it can be used.

Simplify

It’s quite common for raw data to be extremely complex. Data was typically meant to be consumed by applications and not directly by business users. By taking some time to simplify data, we can greatly improve business user success when querying.

Only include fields with obvious analytical purpose

It's best to start modeling with only the most relevant columns, excluding any columns that has no immediate or obvious analytical purpose.

Extract relevant data from complex data types

Application data sources may contain JSON, arrays, hstore and other complex data types. These are typically hard to query from business intelligence tooling and should have relevant data extracted into new columns.

Example: Supposed a table books contains an id column and the following JSON column.

  {
  title: "Moby Dick",
  author: "Herman Melville",
  genres: \["novel", "fiction"\]
 }

The resulting modeled books table would contain an id, title, and author columns. Genres could be modeled as an independent table, reduced to a single genre based on custom rules, or some other method.

Change flags and cryptic abbreviations to meaningful values

It's common for application databases to have flags or cryptic abbreviations in columns that work well for the application and terrible for a business user. It's important to transform these values into easy, human readable values. Some examples:

  • Boolean values 0 and 1 should be transformed to relevant strings, such as true and false or on and off.
  • Flag values should be transformed into relevant strings. If a column billing_status has three n_meric values (i.e. 0, 1, 2) that represent some status, they should be transformed into a relevant business concept such as Active, Not Active, Delinquent.
  • Cryptic values should also be transformed into easy to understand business concepts.

De-normalize where possible

Applications typically have highly normalized tables to prevent duplicates, reduce space, and make modification easier. This typically makes it harder for business users to browser the schema however because the complexity of the joins may be hard to follow. Build wider tables where appropriate, collapsing common concepts into a single table. Some examples could be:

  • Combine the sources, sources_redshift, sources_postgres, and sources_myself tables into a single sources table with the lowest common denominator of values that make sense for a business user.
  • Combine the users and addresses tables into a single users table since addresses are meaningless on their own.

This simplification requires trial and error and you may not always get it r_ight.

Cleaning

Data is messy and requires some cleaning to ensure accurate results. Cleaning prevents common problems that might cause a query to produce incorrect results.

Attempt to eliminate NULLs

NULL values have unexpected consequences in SQL (is "string" <> NULL?). It's best to replace all nulls with values. Some examples:

  • Change all NULL values in the first_name column to the string Blank.
  • Change all NULL values in the last_login_type column to the string Never Logged In for customers that have never logged in.

Fix common data inconsistencies

Bad data always makes its way into raw data sources. Whether it is misspellings or just junk data, it is important to clean up the data as much as possible. Some examples:

  • State names that have a mix of abbreviations, full names, and junk data should be transformed into a single, consistent format such as the full state name.
  • Phone numbers might be garbage text entered by users to avoid getting phone calls.

Follow Naming Conventions

Schemas, tables, and columns should all be named in accordance with naming conventions listed below. At a minimum, names should be human readable and be representative of the data type and values stored.

Remove irrelevant data

Rows that are irrelevant for various reasons should be removed entirely from the data set. Some examples could be:

  • Employee testing
  • Fraud or spam
  • Inactive

Obviously, if analysis is being done on fraud or spam, that data should not be removed but in most causes, if a row would always be excluded from a query, go ahead and remove it in modeling.

Change Data Types

Modeling is a great time to change data types to more appropriate types. Unix timestamps could be converted from int columns to datetime for example.

Naming Conventions

Initially there will be a variety of naming conventions used for tables, columns, and values. Creating a standard for all of these makes it easier for others to find and understand the data they are looking for.

  • names should be in American English

Table Names

Plural

A table of Leads should be titled "Leads" not Lead. When there are more than two words on the last needs to be pluralized: opportunity_histories

Leads
Column 1 Column 2 Column 3
opportunity_histories
Column 1 Column 2 Column 3
Opportunity_Histories
Column 1 Column 2 Column 3

Prefix and Suffix

Fact Table or OLAP-ized tables should not have prefix. The advantage of this method is, given most database tools sort tables alphabetically, all tables, which are related to a business object, will be grouped together.

Some exceptions (including but not limited to):

  • [tablename]_dim: dimension table
  • ref_[tablename]: reference table e.g. 國定假日列表
  • tmp_[tablename]: temporary table pipeline 中間產生的中繼表格,如果有跨不同 DB 的操作可能會出現 e.g. from postgres to bigquery (bigquery 上面可能就會有些中繼表格)
  • test_[tablename]: testing table 測試用表格,通常應該要設 expired date 或是會有 script 定期清掉 test table

Field / Column Names

id as primary key

A simple numeric primary key labeled id should be standard for all tables.

id (PK) Column 2 Column 3
1234 (autoincrement or manually assigned) Text Text

Name abbreviation and acronyms

For well known name abbreviations among software developers, such as config and spec, the abbreviations should be used instead of the full spelling. This will make the source code easy to read and write. In formal documentations, the full spelling should be used. Examples:

  • config (configuration)
  • id (identifier)
  • spec (specification)
  • stats (statistics)

Some systems enforce character limit on object names, e.g. Oracle 12.1 and below only allows for a maximum object name length of 30 bytes. Therefore, abbreviations and acronyms may be taken into consideration during the object naming process, despite the fact that they can often lead to misinterpretation. To combat this, it is suggested to compile a document containing a list of the abbreviations being used with detailed description of their meanings. However, to limit any possible confusion avoid excessive use of abbreviations and acronyms.

In logical models, it is advisable, that object names are as self-explanatory as possible, i.e. most words should be fully spelled out, except common abbreviations for longer words such as “dept” for “department” or “org” for “organisation”. However, abbreviations and acronyms are typically used in physical models, to keep object names short.

foreign keys follow [tablename]_[id] format

ForeignKeys should follow this format to make it very clear on where the table is linking to. If there are two foreign keys to the same table you can preopend a name to them following the format:

  • [uniquename]_[tablename]_[id].

An accounts table linking to a users table with both a billing contact and a main owner would look like this:

Users
id fullname address phone_home
Bills
id account_id (FK)
Accounts
id owner_user_id (FK) billing_contact_user_id (FK)

Start columns with a _ if they are needed but should be hidden for Visual mode.

If there are columns you need in the model for joining or other purposes but don’t want visible by default in visual mode you can prefix them. They will otherwise be treated just as any other column.

Let’s say you didn’t think the foreign keys in the accounts table above needed to be shown in Visual mode. You can simply prefix them as shown below. The relationships will still be detected. It’s a best practice not to show the foreign keys visually.

Users
id fullname address phone_home
Accounts
id name _owner_user_id (FK) _billing_contact_user_id (FK)

This should not be used for columns you're on the fence about needing. Those just shouldn't be included. These are for columns that are needed for querying purposes but have no use in a Visual setting - primarily foreign keys.

Lower case, underscored naming

These names will be mapped to the native naming convention in generated code for each programming language. Our data model needs to be easily editable in SQL mode so we should follow conventions that make editing raw SQL easier. Therefore, we should attempt to have column names like id, first_name, last_name, and last_login_type instead of more human readable forms in the model. Chartio will handle that conversion.

IWonderIfYouReallyLikeThis this-has-comatilibity-issue this_looks_cleaner
Value Value Value

Field names should not include prepositions (e.g. "for", "during", "at"), for example:

  • reason_for_error should instead be error_reason
  • cpu_usage_at_time_of_failure should instead be failure_time_cpu_usage
reason_for_error error_reason cpu_usage_at_time_of_failure failure_time_cpu_usage
X Good X Good

Field names should not use postpositive adjectives (modifiers placed after the noun), for example:

  • items_collected should instead be collected_items
  • objects_imported should instead be imported_objects
items_collected collected_items objects_imported imported_objects
X Good X Good

Plural field names

A filed which has one-to-many mapping, must use proper plural forms. This matches the convention and the common expectation of external developers.

pending_salary pending_salaries
only 1 unpaid salary? OK now I get it, there are more than one

Quantities

Quantities represented by an integer type must include the unit of measurement.

xxx_{bytes|width_pixels|meters}

If the quantity is a number of items, then the field should have the suffix _count, for example node_count.

Time and Duration

To represent a point in time independent of any time zone or calendar, TIMESTAMP (UTC) should be used, and the field name should end with time, such as start_time and end_time.

If the time refers to an activity, the field name should have the form of verb_time, such as create_time, update_time. Avoid using past tense for the verb, such as created_time or last_updated_time.

  • Always use UNIX Timestamp (UTC)
  • [verb]_time

To represent a span of time between two points in time independent of any calendar and concepts like "day" or "month", _duration should be used.

  • _duration

If you have to represent time-related fields using an integer type for legacy or compatibility reasons, including wall-clock time, duration, delay and latency, the field names must have the following form:

  • xxx_{time|duration|delay|latency}_{seconds|millis|micros|nanos}
mysql> CREATE TABLE ts (
    ->     id                     INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
    ->     takeoff_time           TIMESTAMP NOT NULL,
    ->     flight_duration        INT NOT NULL,
    ->     receive_time_millis    INT NOT NULL
    -> ) AUTO_INCREMENT = 1;

If and only if you have to represent timestamp using string type for legacy or compatibility reasons, the field names should not include any unit suffix. The string representation should use RFC 3339 format, e.g. "2014-07-30T10:43:17Z". Or use INT to store Epoch.

MySQL converts TIMESTAMP values from the current time zone to UTC for storage, and back from UTC to the current time zone for retrieval. (This does not occur for other types such as DATETIME.) By default, the current time zone for each connection is the server's time. The time zone can be set on a per-connection basis. As long as the time zone setting remains constant, you get back the same value you store. If you store a TIMESTAMP value, and then change the time zone and retrieve the value, the retrieved value is different from the value you stored. This occurs because the same time zone was not used for conversion in both directions. The current time zone is available as the value of the time_zone system variable.

As of MySQL 8.0.19, you can specify a time zone offset when inserting TIMESTAMP and DATETIME values into a table. The offset is appended to the date part of a datetime literal, with no intravening spaces, and uses the same format used for setting the time_zone system variable, with the following exceptions:

  • For hour values less than than 10, a leading zero is required.
  • The value '-00:00' is rejected.
  • Time zone names such as 'EET' and 'Asia/Shanghai' cannot be used; 'SYSTEM' also cannot be used in this context.
SET @@time_zone = 'SYSTEM';
INSERT INTO ts (col) VALUES ('2020-01-01 10:10:10'),
    ('2020-01-01 10:10:10+05:30'), ('2020-01-01 10:10:10-08:00');
 
SET @@time_zone = '+00:00';
INSERT INTO ts (col) VALUES ('2020-01-01 10:10:10'),
    ('2020-01-01 10:10:10+05:30'), ('2020-01-01 10:10:10-08:00');

Time with more precisions

MySQL has Fractional Seconds support for TIMESTAMP values, with up to microseconds (6 digits) precision:

SELECT UTC_TIMESTAMP(), UTC_TIMESTAMP() + 0;
-> '2003-08-14 18:08:04', 20030814180804.000000
CREATE TABLE t1 (ts TIMESTAMP(6);

Inserting a TIME, DATE, or TIMESTAMP value with a fractional seconds part into a column of the same type but having fewer fractional digits results in rounding. Consider a table created and populated as follows:

CREATE TABLE fractest(c2 DATETIME(2), c3 TIMESTAMP(2) );
INSERT INTO fractest VALUES
(2018-09-08 17:51:04.777', '2018-09-08 17:51:04.777');

The temporal values are inserted into the table with rounding:

SELECT * FROM fractest;
+-------------+------------------------+------------------------+
| c1          | c2                     | c3                     |
+-------------+------------------------+------------------------+
| 17:51:04.78 | 2018-09-08 17:51:04.78 | 2018-09-08 17:51:04.78 |
+-------------+------------------------+------------------------+

Date and Time of Day

For dates that are independent of time zone and time of day, DATE() should be used and it should have the suffix _date. If a date must be represented as a string, it should be in the ISO 8601 date format YYYY-MM-DD, e.g. 2014-07-30.

For times of day that are independent of time zone and date, TIME() should be used and should have the suffix _time. If a time of day must be represented as a string, it should be in the ISO 8601 24-hour time format HH:MM:SS[.FFF], e.g. 14:55:01.672.

opening_date opening_time
2014-07-30 14:55:01.672

通常只接受 timestamp 轉 unix timestamp 而不接受 datetime 轉 timestamp,因為 Datetime 沒時區轉了會爆炸 XDDDDD

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment