Skip to content

Instantly share code, notes, and snippets.

@druu
Last active December 28, 2015 18:09
Show Gist options
  • Save druu/7541557 to your computer and use it in GitHub Desktop.
Save druu/7541557 to your computer and use it in GitHub Desktop.
Single File Data-Normalization Demo

normalizedSync.php - Single file demo


Are you trying to sync de-normalized data from a remote source (e.g. webservice)
and struggle importing it into your nicely normalized database?

Maybe this little demo might give you an idea!

Part 1:

Normalize existing DB-Data ( http://sqlfiddle.com/#!2/aa28e/1 )

Part 2: Keepin' it all in sync

run this from your CLI: curl -Ls http://git.io/6VTovw | php
or download the normalizedSync.php file and run it locally

So... just what exactly is going on here?

Picture this:

You're facing yourself with the task to frequently pull data from theSERVICE via their API and keep a synchronized copy on your DB cluster.
You skim the docs, run some little test scripts and have soon found just the right combination of URL and request data to get exactly the data you need to have.
You head happily and motivated into implementing and testing your little cron jobby.

And then you realize...

Yeah, but what happened???

Well... Someone over at theSERVICE thought

Nah, our users surely won't need the relational data from our systems... Let's give 'em the human friendly, easy readable, flat version.

Or maybe someone though

3NF - WHAT? Ain't nobody got tyme fo' dat!

So basically, you're facing this situation:

   +----------------------+---------------------------+--------------------+-------------+
   |  vendor              | category                  | name               |    ....     |
   +----------------------|---------------------------|--------------------|-------------+
   |                      |                           |                    |             |
   |  Vendor 1            | Category 2                | Product 1          |             |
   +-------------------------------------------------------------------------------------+
   |                      |                           |                    |             |
   |  Vendor 1            | Category 2                | Product 2          |             |
   +-------------------------------------------------------------------------------------+
   |                      |                           |                    |             |
   |  Vendor 2            | Category 1                | Product 3          |             |
   +-------------------------------------------------------------------------------------+
   |                      |                           |                    |             |
   |  Vendor 3            | Catefory 3                | Product 4          |             |
   +----------------------+---------------------------+--------------------+-------------+
   \\===> This is the structure of the data you received...

But:

Your system looks like this:

   +-----+-----------------------+          +-----+--------------------------+
   | id  |  name                 |          | id  |  name                    |
   |-----|-----------------------|          |-----|--------------------------|
   |     |                       |          |     |                          |
   | 1   |  Vendor 1             |          | 1   |  Category 43             |
   +-----------------------------+          +--------------------------------+
   |     |                       |          |     |                          |
   | 2   |  Vendor 2             |          | 2   |  Category 1              |
   +-----------------------------+          +--------------------------------+
   |     |                       |          |     |                          |
   +-----+-----------------------+          +-----+--------------------------+
   \\==============> vendors table          \\==============> categories table
     
    
   +-----+--------------+-----------------+----------------------+-----------+
   | id  | vendor_id    | category_id     | name                 |    ...    |
   |-----|--------------|-----------------|----------------------|-----------|
   |     |              |                 |                      |           |
   | 1   | 1            | 2               | Product 1            |           |
   +-------------------------------------------------------------------------+
   |     |              |                 |                      |           |
   | 2   | 1            | 3               | Product 2            |           |
   +-------------------------------------------------------------------------+
   |     |              |                 |                      |           |
   | 3   | 2            | 12              | Product OVER 9000    |           |
   +-------------------------------------------------------------------------+
   \\=========================================================> products table

And:

Your cron-script takes an hour to complete...

Dis no good.

Enter: This Demo Script!

Again: just run curl -Ls http://git.io/6VTovw | php from your CLI and see the data flowing in.

Let me quickly describe what's going on there:

  1. PHP parses the source file for any class declarations
  2. The script grabs Jeremy Dorn's (@jdorn) SqlFormatter to get some juice ;)
  3. The main loop is called. Just 3 iterations of a formloop, with constantly growing result sets
  4. Within the loop: * Related data (vendors, categories) get extracted from the flat dataset * Then 2 batch insert queries are fired (ON DUPLICATE KEY UPDATE) and the results are instantly retrieved * And a lookup transformation is applied:
    Before: vendors[] = {"id": 1, "name": "Vendor 1"}
    After: vendors["Vendor 1"] = {"id": 1, "name": "Vendor 1"} * So with that easy accessible array of meta-infos, for every row of the received data, the columns vendor and category will be replaced by vendor_id, respectively category_id * Out of this normalized result set, we can quickly create another batch insert
  5. And be done.

But seriously, read the source, get a hang of what's happening.
This was more or less hastily done, and probably has loads of things to optimize.

But you get the idea ;)

One more thing...

There are a few things I'd like to point out and/or explain:

  • Why do you use array_map() that much? Why no loops?

    I easily could've used loops, but I wanted to keep it relatively short

  • Why the hell did you even write this?

    Well, I started answering a question about how that particular person could implement an efficient way to solve exactly this problem.
    The things got out of hand...

  • So, is it efficient?

    To be fair, I just lab-tested it. I have no real-life benchmarks or anything. This particular thing uses 15 db queries in total, and the queries aren't super complex either. But again, I have not analyzed in regards of performance

  • Okay, Lab-tested? You sure have run it against a DB, haven't ya?

OK! Here's the interesting bit:

We've got 3 classes (Sorry @jdorn ):

  1. Helper
  2. Extractor
  3. DB

It is Helper's job to simulate the API-call, perform the lookup transformations, generate the query string, and so on. Just a little collection of sort of generic methods.

When you run the script you will see some output like

*** SysStats: ******************************************************************
	$> Received 918 products!
	$> DB Stats:
	$> COUNT(vendors.*) : 100 rows
	$> COUNT(categories.*) : 100 rows
	$> COUNT(products.*) : 1867 rows
	$> Current query count: 15
*** End of SysStats ************************************************************

When you take a look at Helper::get_products() you will find that I've limited the number of unique vendors and categories to a hundred results each.
That's just for the sake of creating key-collision so the update part of the query is fired to, demonstrating the desired behaviour. (Albeit only accumulating data at this point)

Then we've got the Extractor (cue dramatic music)
It's whole purpose is just to extract the meta-info from the API's data, induce the assembly of the query string, and initiate the lookup optimization.

...sigh.

Yeah, and then there's the weird uncle at the family reunion, that no one is sure of his actual relation to your family. He also stinks, and doesn't like you.

Ladies and gentlemen, please welcome: the Database-Emulator!

I can't quite explain what led my to produce this piece of... this thing.
But hey! It does it's job!

It emulates an extremely reduced subset of a DBMS, taking an actual SQL-Query, extracting the wanted data from it, and storing it in a projection of what should've some sort of Relational DBMS scheme.

Also, not wanting to overcomplicating it too much, I just gave it a few quick'n'dirty shorthand methods, to make it obey. :)


As I said, check the source!
It's always fun to see, what the human brain comes up with, when you're actually drifting into insanity :D

Drop me a note, if I haven't scared you already :)


And just btw:

Here's a call graph:

image

Have Fun! druu

P.S.: Thanks to Jeremy Dorn (@jdorn) for that supreme SQL Formatter :)

<?php
// *** PART 1: Normalizing existing data
// http://sqlfiddle.com/#!2/aa28e/1
//
//
//
// *** PART 2: Getting that webservice's data in
//
// * Assumptions for this example:
// $data = array(
// (object) array('vendor' => 'Vendor ###', 'category' => 'Cat ###', 'name' => 'Bla...'),
// (object) array('vendor' => 'Vendor ###', 'category' => 'Cat ###', 'name' => 'Bla...'),
// ...
// );
//
// JSON (starting @ actual data):
// var data = [
// {"vendor": "Vendor 1", "category": "Category 1", "name": "Product 1"},
// {"vendor": "Vendor 1", "category": "Category 1", "name": "Product 1"},
// ...
// ];
//
// Or something like this, you get the idea
//
// Just one thing: getting some external help
// PREPARATION: SQL Formatter (just for prettier output)
require_once ('https://raw.github.com/jdorn/sql-formatter/master/lib/SqlFormatter.php');
// Also: Wanna show SQL-Queries in output?
define('SHOW_SQL', true);
// YOU READY???
Helper::print_banner('Starting Demonstration...');
// Now here's what you'd have to do (aka BLACK FUCKING MAGIC):
// Connect to DB (No shit, Sherlock...)
DB::connect();
$max_products = array(50, 3000, 5000); // Values are chosen to definitely hit duplicates
Helper::print_banner('Syncin '.count($max_products).' times with up to [' . implode(', ', $max_products) . '] items...');
for($i = 0, $l = count($max_products); $i < $l; $i += 1) {
Helper::print_banner('Iteration #'.($i+1).': up to ' . $max_products[$i] . ' items...', true);
// Get data from Webservice (while increasing the fake products)
// Duplicates have to happen at some point
$products = Helper::get_products(5, $max_products[$i]);
$recv_products = count($products);
// FIRST STEP: Sync & Update Meta-Data
// Set Up: Extraction-Automation
// Structure: $conf = array(
// array( <db_table>, <desired_key>, array( <insert_key_list> ),
// array( <db_table>, <desired_key>, array( <insert_key_list> ),
// ...
// );
//
// Don't ask me, I'm going slightly mad....
$extract_conf = array(
array( 'table' => 'vendors', 'in' => 'vendor', 'keys' => array('name')),
array( 'table' => 'categories', 'in' => 'category', 'keys' => array('name')),
);
// Get Queries
$queries = Extractor::create_meta_queries($extract_conf, $products);
// Sync dem metas
foreach ($queries as $query) {
DB::statement($query);
}
unset($queries);
// GET 'EM BACK!
// And while we're at it: set result-array indexes to name-value of row
// because cache and faster lookup you know
$cache_conf = array(
'vendors' => array('key' => 'name', 'table' => 'vendors'),
'categories' => array('key' => 'name', 'table' => 'categories')
);
$meta = Extractor::create_cache($cache_conf, $products);
// Now let's get that product array sorted
// Vendor <> Venoder_id && Category <> Category_id
$replacements = array(
'vendors' => array( 'search' => 'vendor', 'replace' => 'id', 'replace_key' => 'vendor_id' ),
'categories' => array( 'search' => 'category', 'replace' => 'id', 'replace_key' => 'category_id' ),
);
$products = array_map(function($product) use ($replacements, $meta){
$normalized_product = new stdClass();
foreach ($replacements as $table => $conf) {
$conf = (object) $conf;
$product->{$conf->replace_key} = $meta[$table][$product->{$conf->search}]->{$conf->replace};
unset($product->{$conf->search});
}
return $product;
} ,$products);
// DONE! Let's smash it into the DB
// This'll hurt...
$query = Helper::generate_batch( 'products', array('name', 'vendor_id', 'category_id'), $products );
// PERMANENT STORAGE!!!!
DB::statement( $query );
Helper::print_banner('SysStats:');
Helper::print_message('$> Received ' . $recv_products . ' products!');
Helper::print_message('$> DB Stats: ');
$tables = array('vendors', 'categories', 'products');
foreach ($tables as $table) {
list($desc, $count) = each(DB::count($table));
Helper::print_message('$> ' . $desc . ' : ' . $count . ' rows');
}
Helper::print_message('$> Current query count: '. DB::get_query_counter());
Helper::print_banner('End of SysStats');
Helper::print_message('');
}
// Yup...
Helper::print_banner('Thanks for watching...');
/*******************************************************************************
* WAY TOO MOTIVATED! Only the Helper and Extractor Class are relevant, really..
******************************************************************************/
// PREPARATION: EXTRACTOR
class Extractor {
public static function create_meta_queries($extract_conf, $products)
{
// Create Meta queries
$queries = array_map(
function($conf) use ($products){
return Helper::generate_batch( $conf['table'], $conf['keys'], Helper::extract_meta_value($conf['in'], $products) ); }
, $extract_conf);
return $queries;
}
public static function create_cache($conf, $products) {
return array_map(
function($c) {
return Helper::transform_lookup( DB::select($c['table']), $c['key'] );
}
, $conf);
}
}
// PREPARATION: HELPER THINGY
class Helper {
public static function get_products($min = 5, $max = 20) {
$products = array();
$length = $counter = rand($min, $max);
while ($counter--) {
$product = new stdClass();
$product->vendor = 'Vendor ' . rand(1,100);
$product->category = 'Category ' . rand(1,100);
$product->name = 'Product ' . ($l - $c);
$products[] = (object) $product;
}
return $products;
}
public static function extract_meta_value( $key_in, $data) {
return array_unique( array_map( function($item) use ($key_in) {return $item->$key_in; }, $data));
}
public static function transform_lookup($arr, $key = 'id') {
if (is_array($arr)) {
$out = array();
foreach ($arr as $a) {
$b = (object) $a;
if (is_array($a)) { $b = (object) $b; }
$out[$b->$key] = $a;
}
return $out;
}
return array();
}
public static function generate_batch ($table, $keys, $data) {
$data = array_map(function($item) { return '("' . implode('", "', array_values( (array) $item )) . '")'; } , $data);
return "INSERT INTO `".$table."` (`".implode('`, `', $keys)."`) VALUES ".implode(', ', $data)." ON DUPLICATE KEY UPDATE `updated_at`=NOW()";
}
public static function print_banner($title, $empty_line = false) {
echo str_pad('*** ' . $title . ' ', 80, '*' ) . PHP_EOL . ($empty_line ? PHP_EOL : '');
}
public static function print_message($message, $empty_line = false) {
echo implode( PHP_EOL, array_map( function($l) {return "\t".$l;}, explode("\n", $message ) ) ) . PHP_EOL . ($empty_line ? PHP_EOL : '');
}
// Don't ask...
}
// PREPARATION: DB Class Emulator (so that this example runs standalone)
// Am I weird?
class DB {
protected static $_database;
protected static $_query_counter;
// FAKE CONNECTOR + Schema Generator
public static function connect() {
$scheme = new stdClass();
$scheme->vendors = (object) array(
'auto_increment' => 1,
'columns' => array('id', 'name', 'created_at', 'updated_at'),
'_hashkeys' => array('name'),
'data' => array()
);
$scheme->categories = (object) array(
'auto_increment' => 1,
'columns' => array('id', 'name', 'created_at', 'updated_at'),
'_hashkeys' => array('name'),
'data' => array()
);
$scheme->products = (object) array(
'auto_increment' => 1,
'columns' => array('id', 'vendor_id', 'category_id', 'name', 'created_at', 'updated_at'),
'_hashkeys' => array('vendor_id', 'category_id', 'name'),
'data' => array()
);
self::$_database = $scheme;
self::$_query_counter = 0;
if(SHOW_SQL) {
Helper::print_banner('Created Database...');
}
}
// Fake Insert query executor
public static function statement( $query ) {
self::increment_query_counter();
if(SHOW_SQL) {
Helper::print_banner('EXECUTING SQL QUERY:', true);
//SqlFormatter::format($query, false)
Helper::print_message(SqlFormatter::format($query, false), true);
Helper::print_banner('END OF QUERY', true);
}
self::store($query);
}
// Fake Select executor
public static function select( $table, $keys = '*' ) {
self::increment_query_counter();
$db = self::$_database;
$keys = array_map(trim, array_values( is_array($keys) ?: explode(',', $keys)));
if(!isset($db->$table)) { die ('!!! TABLE DOES NOT EXIST'); }
$result = array();
foreach($db->$table->data as $row) {
$tmp_row = clone $row;
if(!(count($keys) === 1 && $keys[0] === '*')) {
foreach ($tmp_row as $col => $val) {
if(!in_array($col, $keys)) { unset($tmp->$col); }
}
}
$result[] = $tmp_row;
}
return $result;
}
public static function count($table)
{
return (object) array("COUNT($table.*)" => count(self::$_database->{$table}->data));
}
public static function get_query_counter() { return self::$_query_counter; }
// This is getting out of hand....
protected static function get_relation_info( $query )
{
$info = new stdClass();
$cmd = strtoupper(trim(strstr($query, ' ', true)));
preg_match_all('~`([a-z_]+)?`~i', $query, $identifiers);
switch ($cmd) {
case 'INSERT':
$info->table = $identifiers[1][0];
$info->keys = array_slice($identifiers[1],1,-1);
$info->update_key = end($identifiers[1]);
preg_match_all('~VALUES \((.*)\) ON DUPLICATE~i', $query, $values);
$tuples = array_map(function($val){ return trim($val, " \t\n\r\0\x0B\""); }, explode('), (', $values[1][0]));
$info->rows = array_map(function($set) use ($info) { return (object) array_combine($info->keys, explode('", "', $set)); }, $tuples );
break;
default:
# code...
break;
}
return $info;
}
protected static function store( $query ) {
$info = self::get_relation_info($query);
$hash_keys = self::$_database->{$info->table}->_hashkeys;
$offset = self::$_database->{$info->table}->auto_increment;
foreach ($info->rows as $row) {
$hashable = array();
foreach ($hash_keys as $key) {
$hashable[] = $row->$key;
}
$hash = md5(implode('#', $hashable));
if(in_array($hash, array_keys(self::$_database->{$info->table}->data) ) ) {
self::$_database->{$info->table}->data[$hash]->updated_at = date('Y-m-d H:i:s');
}
else {
$row->updated_at = '0000-00-00 00:00:00';
$row->created_at = date('Y-m-d H:i:s');
$row->id = self::$_database->{$info->table}->auto_increment = $offset++;
self::$_database->{$info->table}->data[$hash] = $row;
}
}
}
protected static function increment_query_counter($value='')
{
self::$_query_counter += 1;
}
// No really.... Don't ask...
}
-- http://sqlfiddle.com/#!2/aa28e/1
-- ***************************************
-- * SCROLL DOWN FOR TEH INTERESTING BIT *
-- ***************************************
CREATE TABLE `products_old` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`vendor` varchar(50) DEFAULT NULL,
`category` varchar(50) DEFAULT NULL,
`name` varchar(50) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=latin1;
INSERT INTO `products_old` (`id`, `vendor`, `category`, `name`)
VALUES
(1, 'Vendor 1', 'Category 1', 'Product 1'),
(2, 'Vendor 2', 'Category 2', 'Product 2'),
(3, 'Vendor 3', 'Category 3', 'Product 3'),
(4, 'Vendor 1', 'Category 1', 'Product 4'),
(5, 'Vendor 2', 'Category 2', 'Product 5'),
(6, 'Vendor 3', 'Category 3', 'Product 6'),
(7, 'Vendor 1', 'Category 1', 'Product 7'),
(8, 'Vendor 2', 'Category 2', 'Product 8'),
(9, 'Vendor 3', 'Category 3', 'Product 9');
CREATE TABLE `categories` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(50) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1;
INSERT INTO `categories` (`id`, `name`)
VALUES
(1, 'Category 1'),
(2, 'Category 2'),
(3, 'Category 3');
CREATE TABLE `vendors` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(50) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1;
INSERT INTO `vendors` (`id`, `name`)
VALUES
(1, 'Vendor 1'),
(2, 'Vendor 2'),
(3, 'Vendor 3');
CREATE TABLE `products` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`vendor_id` int(11) DEFAULT NULL,
`category_id` int(11) DEFAULT NULL,
`name` varchar(50) DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `vendor_id` (`vendor_id`,`category_id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=27 DEFAULT CHARSET=latin1;
-- *****************************************************************
-- * THIS IS WHERE THE MAGIC HAPPENS *
-- * SIDE NOTE: This query should be in the query window there, ==>
-- * But SQLFiddle doesn't like that. *
-- * That's why I'm running it twice... *
-- *****************************************************************
-- *** PART ONE: Normalizing existing de-normalized data:
-- First run
INSERT INTO `products` (`vendor_id`, `category_id`, `name`)
SELECT
(SELECT id from vendors v WHERE v.`name` = p.vendor) as vid,
(SELECT id from categories c WHERE c.`name` = p.category) as cid,
p.name pname
FROM products_old p
ON DUPLICATE KEY UPDATE updated_at = NOW();
-- Fake some additional products
INSERT INTO `products_old` (`vendor`, `category`, `name`)
VALUES
('Vendor 3', 'Category 1', 'Product 100'),
('Vendor 3', 'Category 2', 'Product 200'),
('Vendor 3', 'Category 3', 'Product 300');
-- UPDATE DA SHIT OUTTA IT
INSERT INTO `products` (`vendor_id`, `category_id`, `name`)
SELECT
(SELECT id from vendors v WHERE v.`name` = p.vendor) as vid,
(SELECT id from categories c WHERE c.`name` = p.category) as cid,
p.name pname
FROM products_old p
ON DUPLICATE KEY UPDATE updated_at = NOW();
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment