Skip to content

Instantly share code, notes, and snippets.

@JJediny
Created February 28, 2020 02:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save JJediny/31fce32701656f39eede263e4eea0805 to your computer and use it in GitHub Desktop.
Save JJediny/31fce32701656f39eede263e4eea0805 to your computer and use it in GitHub Desktop.
This bad c/p base64 converted image using https://word2md.com/ from a .docx to .md crashed chrome browser and temp rendered the server (w/ 2 cpu 4GB) offline.

Introduction

The goal of data architecture is to make any data driven organization more effective in the pursuit of its mission. It does this by incorporating the appropriate design in databases and data systems. This document starts with these facts and derives principles at various levels. Level i is derived from level i-1.

Level 1

Figure 1 shows the sequence of processes required by a data user in a typical data driven organization such as DOJ.

Figure 1: This workflow diagram shows the sequence of processes carried out by data users. Some of these steps may be automated for the user and there may be multiple users in a single flow. Blue rectangles indicate data management processes, gray represents processes outside data management. In the end, the mission success depends on whether data and data systems were properly designed.

Principle 1

**In order to enable mission success, data and data systems must be properly designed. **

Users depend on four top-level data processes: Discovery, access, combination, and analyses. These processes are executed via a set of databases (or more generally, data stores) and data systems. Data stores hold data in queryable configurations. Data systems process data and move it through a network of storage and processing nodes. Processes, data stores, and data systems may have sub-processes, sub-data stores, and sub-systems respectively.

The principle of proper design may seem obvious. However, practitioners often assume that one design is about as good as another. In reality, design choices have a major impact on overall system capabilities, accuracy, speed, and efficiency.

Principle 1.2

**Systems should be designed to maximize efficiency of implementation and operation. **

Time and budgets are finite, yet we want to get the most from these resources. This requires efficiency. Efficiency is defined as (quality output)/(resource expended). Resource is generally either time or money. Doing more with less input, doing the same with less input, and doing more with the same input all represent increases in efficiency. Herein, efficiency is meant as something precise and measureable.

Level 2

Principle 2.1

**Data discovery must insure that the right data gets discovered by the right people at the right time. **

There are two forms of data discovery, the first where the user searches for data and the second where the user is notified. 9/11 and many other national security failures are due to failure to notify the right people at the right time. A common mistake is to assume that the people that need to be notified are within one's own organization or sub-organization.

This principle implies the need for an enterprise data catalog.

Principle 2.2

**Data access must insure that the right data gets to the right people at the right time. **

This principle implies a number of things:

  1. a)"Right data" implies accuracy, content, and quality of data.
  2. b)"Right people" implies that a potential user who will benefit from having the data should not be denied if he can be trusted and authorized.
  3. c)"Right people" excludes those who should not have access, such as hackers and inside threats. This is cyber security.
  4. d)"Right time" implies speed or reporting within some time limit.
  5. e)Data is keep secure. Corrupted data prevents access to the valid data.
  6. f)Data is protected from loss. Lost data is not accessible.

Principle 2.3

Whenever possible, integrate systems that report on the same events, people, or things.

Most sophisticated analyses depend on the combination of multiple data sources.

Principle 2.4

Design databases and data systems to maximize analytic power.

A wider range of sophisticated analyses are made possible via principled database design and data system design.

Principle 2.4

Automate systems whenever possible.

Automation improves efficiency in terms of time and also improves accuracy by reducing human error. It can also reduce costs for human labor.

Principle 2.5

Document database and data system designs.

This is necessary so that

  • Users understand what the data represents.
  • Users understand how the data was computed.
  • Engineers can maintain the databases and systems.

Level 3

Principle 3.1

Document all planned and existing databases and systems using appropriate diagrammatic and text based methods.

These include

  • Data flow diagrams – to document how data flows between data processes and stores. Also documents processes.
  • Workflow diagrams – to document sequences of processes
  • Entity relationship diagrams – to document database design
  • Universal Modeling Language (UML) object diagrams – to document data formats and metadata
  • Data dictionaries- to document databases and metadata
  • Structured English – to document processes

Arbitrary combinations of boxes and arrows are not informative. Each method serves as a language with its own grammatical rules. Practitioners should follow each method rigorously.

Principle 3.2

**Identify all users who need data contained within your databases. Provide them with the appropriate discovery tools and access. **

This identification should take the form of a mapping between each significant data item and associated users. Pay particular attention to users outside your organization. Users with a valid need for data, and holding appropriate clearance, should be granted access.

Principle 3.3

**All data should be discoverable via search and some via notification. Determine which data needs to be discoverable via notification. **

There are data items that, if not known to the user at a particular time, would result in mission failure. These must be selected for discovery via notification.

Principle 3.4

**In the case of confidential or classified data, access should be restricted to authorized users. **

Everyone knows Principle 3.4. It is included here for completeness and as a constraint on 3.2. The balance between 3.4 and 3.2 is critical. Neither should be assumed predominant over the other.

Principle 3.5

When integrating two systems (which become sub-systems) that report on the same event, person, or thing, there should be a single report for each event, person, or thing. Each newly formed subsystem should contribute to a single report for each event, person, or thing. This single report will have a unique identifier. Further, the system will support the coordination of multiple human reporters.

The reason for this design principle is that the alternative often generates a large numbers of errors. That alternative is a combined system where multiple reports per event, person, or thing are retained in the combined system and matching occurs between these reports. Matching is a step which is both unnecessary and error prone.

If the single report design is deemed not feasible for some reason, a logical equivalent should be implemented. The logical equivalent design employs the same unique identifiers across all subsystems. These unique identifiers are passed automatically between subsystems without human transcription.

Principle 3.6

**Collect or generate structured data and use it for analysis. **

Structured data provides the basis for the most powerful analyses. Structured data is organized according to well defined architectures with strong theoretical foundations. Other dataset architectures that lack clear definitions and theoretical foundations are faddish and are not as useful as sometimes claimed.

The reason that structured data provides the basis for powerful analyses is that structured data is semantic whereas unstructured data is not.

One of the best known forms of structured data is the relational database. It is based on first order predicate logic. Technology may change but the critical role of logic remains constant over time.

One way to create structured data is to design your collection process so that data is structured upon collection.

Unstructured data includes sensor data and human natural language. Humans can easily extract semantics from natural language but traditional computer algorithms cannot. Artificial intelligence (AI) is an increasingly capable technology for converting unstructured data into structured data. Examples of AI subtopics include

  • Voice recognition
  • Object recognition
  • Expert systems

Principle 3.7

When collecting data from human reporters, constrain choices to valid values whenever possible.

This is a quality assurance (QA) principle. QA is the discipline of error avoidance. It helps assure accuracy. If only valid values are permitted, through pull down menus, check boxes, etc., then invalid input becomes impossible.

Free text fields should be avoided if possible. Free text fields not only permit invalid data collection, they also permit multiple representations of the same item. For example, Robert Smith might also be written, Robert J. Smith, or Rob Smith.

Constraining choices brings the data closer to structured condition upon input.

Principle 3.8

Never collect data from a human that can be auto-generated by the system.

This is a QA principle. Often we require humans to enter data when the system can compute it. This has two disadvantages. First, human data entry is highly error prone and second, it adds needlessly to the reporting load.

Examples of data that can be auto-generated include, today's date, location (via GPS), and unique identifiers. In the case of users who are logged in, much information about them is likely pre-stored by the system.

Principle 3.9

Minimize the reporting load for human reporters.

This serves multiple purposes. It helps insure good relations with the reporter, saves them time, and makes them more willing to comply fully and accurately with the reporting requirements.

Principle 3.10

Quality control (QC) your data once soon after collection or generation and whenever errors are detected.

Quality control is the process of detecting and correcting data. It is inferior to QA in that it often involves guesswork. What should the answer have been? How should things have been matched?

Don't assume that QCed data is done being QCed. It is never done being QCed. Errors become evident during analysis in ways that escape the original QC process.

Principle 3.11

Perform QC in a way that identifies the cause. Rank each cause on the number of and/or severity of errors generated. Fix the causes having the greatest impact first and work backwards from there.

Finding and fixing individual errors is worthwhile but not nearly as powerful as QC with cause tracking and elimination.

Principle 3.12

Perform QC on raw data to identify causes associated with collection error.

After data has been "fixed" or combined with other data, it becomes confounded and difficult to identify errors and their causes.

Principle 3.13

Quarantine data found to be in error until fixed. Do not allow it to flow downstream, take part in calculations, or be combined with other data.

During processing, errors generate a multitude of other errors.

Principle 3.14

Archive data as soon as possible.

This is a data access principle. If your data is lost due to a storage failure or the location is not recorded, you cannot access it.

Traditionally, archiving is something that is done when an organization is done using data. However, the data management community has come to realize that this makes little sense. The purpose of archiving is to prevent loss, which can happen at any time. You want to insure you have the opportunity to use data before it is lost. Therefore, archiving after use has the least value. If you are truly done using the data you don't need to archive it, you can delete it.

An archive is more than a backup copy. Archive facilities should meet NARA physical standards and should include procedures for migration to fresh media and file formats. If possible, valuable data should be stored in two separate locations to prevent loss from natural and human caused disasters.

Principle 3.15

Archived data should be accessible.

Otherwise, what is the point of preserving it. This diverges from the traditional view of archiving, in which the archive may have minimal accessibility.

Principle 3.16

Access via transfer to very large datasets over networks is limited. Therefore, collocate compute and storage for these very large datasets.

This often is required in the case of sensor data.

Principle 3.17

Index sensor data.

This is more a statement of a problem than a recommendation. Still it is a principle. Sensor data can be so large that it would take a human thousands of hours to index manually. Artificial intelligence (pattern recognition) can be used but this technology is still in its early development for some datatypes.

Principle 3.18

Provide access to data as human accessible and machine accessible.

The reason for providing human access is obvious. Humans are usually the ultimate users, except where robotics are employed.

Machine accessible data is data that is accessible to a computer program. The advantage of this is that data can flow automatically through a network of computer systems. This is superior to having humans transfer data among systems. Human transfer is slower, error prone, and inefficient.

Principle 3.19

The type of machine accessibility depends on the data type.

Provide access as appropriate:

Data Type Access Type
Small single table It is pointless to provide sophisticated query capability for this type of data. Users typically want to download the table in its entirety and inspect all of it at once.
Relational databases SQL query capability, or at a minimum, access to the original tables.
Large complex tabular data (non-relational) Faceted query capability
Sensor data and geospatial data Access by time and spatial intervals in various combinations. Search by index if indexed.
Unstructured data with textual labels or all text Word search with logical combinations, AND, OR, NOT

Principle 3.20

Include all datasets in the Data Catalog.

This provides implementation of higher-level discovery principles.

Level 4

Principle 4.1

Search based data discovery on structured data should include SQL capability or, at a minimum, be faceted.

Keyword based searches, like those used in web search engines, are weak. They are used in that context because the data being searched is largely unstructured. SQL and faceted queries are less prone to false hits.

Principle 4.2

**Each field in metadata should represent a single dimension or facet. **

Combining multiple dimensions in a single field does not support SQL queries or faceted search. It also prevents those providing the metadata from giving a complete description of the data.

The classic example is a field like "data type" in the metadata schema. This easily has multiple meanings. It could be the file format, the sensor type (video data vs acoustic), the subject of observation (people vs buildings), or a topic (criminal vs civil). Break such a field into its subcomponents as separate fields.

Principle 4.3

Relational databases should satisfy 3

rd

normal form, at a minimum. Consider higher normal forms when feasible.

The primary cause for errors in analyses using relational databases is redundancy. Normal forms insure a lack of redundancy. 6

th

normal form insures all the lower normal forms.

Principle 4.4

Clean up databases by removing columns and tables no longer intended for use. Archive out-of-date tables and columns before deleting.

An analyst, seeing data in out of data tables and columns, will reasonably assume these are available for use. The data may be out of date or incorrect, resulting in a faulty analysis.

Principle 4.5

**Every database should be documented with an entity relationship diagram and a data dictionary. These should be updated whenever a change is made to the design of the database. **

This provides continuity to new analysts and eliminates guesswork.

Principle 4.6

**Every data system should be documented with a dataflow diagram, at minimum. **

If we don't know how a data system works, we don't know what is coming out of it. If we don't know what is coming out of it, it has little analytical value.

Principle 4.7

Use standards to communicate between data and metadata repositories. Within data and metadata repositories, use the fields and design that best support a data system's intended function.

Communication always requires a common language. However, your system could have the most perfect implementation of some standard, but if it does not contain the organization and data fields your functionality requires, it is not much use.

Don't expect perfect one to one correspondence between data fields in source and target databases. This rarely occurs due to differences in functionality and hence variable sets. Map to the greatest extent possible.

Principle 4.8

Standardize on variable names, code lists, and units of measure across all databases in your organization.

This eliminates the need for name correspondence tables. A code list is a list of valid values for a data field. If code lists are the same across databases then mapping is easy. If they differ, then one is faced with many-to-one, one-to-many, and one-to-none mappings. The second and third of these are most problematic.

Principle 4.9

Use journaling in your database design.

Journaling involves the recording of information about any changes made. This allows one to roll back to any previous state and also allows one to explain the effects of substantial changes, when such questions arise.

Example one – Suppose statistics from a database is referenced in some journal article. Also suppose that the data in that database is being collected in an ongoing basis and changes over time. The reference can only be meaningful if the previous state is referenced by date and it is possible to roll back to that date.

Example two - Suppose a time series shows that the cumulative number of murders in city X decreased from February to March. Since this is impossible, you know that it is based on a correction. People will want to know what that correction was, especially if it was large.

Principle 4.10

Every table in a relational database should have a primary key.

Keys prevent duplicate record errors and are the means by which tables are joined.

Principle 4.11

Use compound primary keys in tables that represent many to many relationships.

In other cases, simple primary keys are preferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment