Skip to content

Instantly share code, notes, and snippets.

@rodrigopedra
Created April 17, 2022 04:22
Show Gist options
  • Save rodrigopedra/3fc18f8af4bb0ff49f760b2e2b278061 to your computer and use it in GitHub Desktop.
Save rodrigopedra/3fc18f8af4bb0ff49f760b2e2b278061 to your computer and use it in GitHub Desktop.
PHP import CSV
<?php
$file = fopen('php://output', 'w');
fputcsv($file, ['ID', 'NAME', 'AGE']);
for ($index = 0; $index < 21_000_000; $index++) {
fputcsv($file, [$index + 1, bin2hex(random_bytes(20)), random_int(18, 65)]);
}
fclose($file);
<?php
define('BUFFER_SIZE', 500);
$pdo = new PDO('mysql:host=127.0.0.1;dbname=dummy;charset=utf8mb4', 'root', 'password', [
PDO::ATTR_EMULATE_PREPARES => false,
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
]);
$pdo->exec('DROP TABLE IF EXISTS `import`');
$pdo->exec('CREATE TABLE `import` (
`id` BIGINT UNSIGNED NOT NULL,
`name` VARCHAR(100) NOT NULL,
`age` INT UNSIGNED NOT NULL
)');
$statement = $pdo->prepare('INSERT INTO `import`(`id`, `name`, `age`) VALUES ' . implode(',', array_fill(0, BUFFER_SIZE, '(?, ?, ?)')));
echo date('Y-m-d H:i:s'), PHP_EOL;
$file = fopen($argv[1], 'r');
fgetcsv($file, 100); // skip first line
$buffer = [];
while (!feof($file)) {
$fields = fgetcsv($file, 100);
if ($fields === false) {
continue;
}
$buffer[] = $fields;
if (count($buffer) === BUFFER_SIZE) {
$statement->execute(array_merge(...$buffer));
$buffer = [];
}
}
fclose($file);
$remaining = count($buffer);
if ($remaining > 0) {
$statement = $pdo->prepare('INSERT INTO `import`(`id`, `name`, `age`) VALUES ' . implode(',', array_fill(0, $remaining, '(?, ?, ?)')));
$statement->execute(array_merge(...$buffer));
}
echo date('Y-m-d H:i:s'), PHP_EOL;
$value = $pdo->query('SELECT COUNT(*) FROM `import`')->fetchColumn();
echo number_format($value), PHP_EOL;
<?php
$pdo = new PDO('mysql:host=127.0.0.1;dbname=dummy;charset=utf8mb4', 'root', 'password', [
PDO::ATTR_EMULATE_PREPARES => false,
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
]);
$pdo->exec('DROP TABLE IF EXISTS `import`');
$pdo->exec('CREATE TABLE `import` (
`id` BIGINT UNSIGNED NOT NULL,
`name` VARCHAR(100) NOT NULL,
`age` INT UNSIGNED NOT NULL
)');
$statement = $pdo->prepare('INSERT INTO `import`(`id`, `name`, `age`) VALUES (:id, :name, :age)');
echo date('Y-m-d H:i:s'), PHP_EOL;
$file = fopen($argv[1], 'r');
fgetcsv($file, 100); // skip first line
while (!feof($file)) {
$fields = fgetcsv($file, 100);
if ($fields === false) {
continue;
}
$statement->execute([
':id' => $fields[0],
':name' => $fields[1],
':age' => $fields[2],
]);
}
fclose($file);
echo date('Y-m-d H:i:s'), PHP_EOL;
$value = $pdo->query('SELECT COUNT(*) FROM `import`')->fetchColumn();
echo number_format($value), PHP_EOL;
$ php generate.php > data.csv
$ ls -lh data.csv
-rw-r--r-- 1 rodrigo rodrigo 1,1G abr 17 01:02 data.csv
$ php import-linear.php data.csv
2022-04-17 01:02:50
2022-04-17 01:17:00
21,000,000
$ php import-buffered.php data.csv
2022-04-17 01:17:16
2022-04-17 01:19:12
21,000,000
@rodrigopedra
Copy link
Author

The index I was thinking of adding is actually to a column that will be full of unique values. From my very limited knowledge, I thought you should only add an index to a column that had some repeated values. However, when thinking about it, the primary key is usually indexed and I know indexes help lookups so I thought I would add it. I would be interested in your thoughts on this.

You can add, and actually columns with unique values are a good candidate for an index, indices on columns with unique values.

Even Laravel comes with a unique index already preset on the user migration email field.

Just make sure to create an unique index, for best performance.

Different database vendors use these days very complex and fancy ways to handle indices, but if we think traditionally how index were handled, they usually create a binary tree of buckets of values, so traversal is optimized by a lot when searching for an unique value.

The advantage of using an index, is that usually indices are stored on a different data structure (file, or whatever the database vendor uses) than the table row. So even if you needed to traverse all the values in an index this would be much faster then traversing the whole table, as the table usually have a lot more columns than an index, thus its records are much large to be read and parsed one by one.

As I already said, an index is traversed on a very optimized way, usually using some sort of tree-like data structure, and once the desired entry is found, it has a "pointer" (usually a record number used internally by the database) to the record it refers to.

So to fetch the corresponding record, the database can "seek" (an operation that skips records without reading them from the store), very fast right to the record it is searching for.

Less diverse values ends up having large buckets down on the index tree, as a leaf node will "point" to a bunch of records instead to a single one. Not a big problem, even in a column which allows just 1 or 0 as values, if the proportion of values is about 50/50, it is still much better to retrieve 50% of the records than to manually check 100% of those records.

Hope this is clearer now =)

@rodrigopedra
Copy link
Author

The other point is around the downtime aspect. I am indeed running this import script daily. I do exactly as you did in your sample and drop the table, then create it again before the fresh import. My concern is that I'll have live requests coming into the application which rely on this data as a lookup. So if I continue with this setup then there will be lookups happening on this table as it's dropped, created and re-populated. I'm wondering if I should do the creation into a my_table_name_temp and then once completed drop the my_table_name and rename my_table_name_temp to my_table_name. Again, would welcome your thoughts on this.

You can:

  • take the app down while updating, for example Laravel has a maintenance mode specifically for this use-case which will return a 503 response to any requests while your app is down for maintenance.
  • Or use an strategy like the one you described: dropping the old one then rename the new one.

I usually, when I dealt with large databases, used a mix of both. I ingested data into a temporary table (actually a real table I create with the same data structure as the target table), once data is ingested and verified, I would drop the old one and rename the new one.

Note, that if you foreign indices that references the table being replaced you will need to:

  1. disable foreign indices on the database
  2. drop the indices
  3. drop the old table
  4. rename the new one
  5. recreate the foreign indices
  6. re-enable foreign indices on the database.

There are other strategies you can use, if you can't afford any downtime:

  • You can create a replica database, update the replica, and then switch connection on the app to the replica.

Another one is using transactions, which will lock your app for a while, but both database and app can stay alive, requests are just going to be delayed. This is very similar to the drop/rename strategy above:

  1. Ingest new data into a temporary table
  2. disable foreign indices on the database
  3. start a transaction
  4. delete all data on the target table (yes, I am talking about a delete without a WHERE)
  5. insert the data from the temporary table into the target table (use INSERT with SELECT for this
  6. commit the transaction
  7. re-enable foreign indices on the database.
  8. Drop the temporary table we used to ingest the new data

The main difference here is the transaction usage, which will try to guarantee the database stays on a good state once data is committed.

This approach have some downsides:

  • You need sufficient disk space, as you are dealing with a big dataset, as basically you are going to hold 3 copies of the data while the transaction is happening.
    • Aside of the two tables, database servers copies the data altered inside a transaction to some sort of "undo" storage, in case you need to rollback the transaction instead of committing it
  • Reads on the database are usually fine while the transaction is running, but writes that depend on the data inside the transaction will be queued up, which could lead you to some timeouts.

Hope this ideas help =)

@rodrigopedra
Copy link
Author

@jamesmills a final disclaimer:

Although I spent some years (about 7 years) dealing with large datasets, I worked more on the ETL and analysis side of the job, and not the database management.

Of course along the years you learn a trick or two when dealing over and over with the same problem set. But the options I gave above, specially on the second comment on your additional questions, might no be the best solution.

Those are strategies I would do, from my experience, but if my team had an expert on database management, I would talk to them first to assess if this is the best strategy.

So if anyone else reading this knows better, please share in the comments how would you handle these. Thanks =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment