Skip to content

Instantly share code, notes, and snippets.

@rodrigopedra
Created April 17, 2022 04:22
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rodrigopedra/3fc18f8af4bb0ff49f760b2e2b278061 to your computer and use it in GitHub Desktop.
Save rodrigopedra/3fc18f8af4bb0ff49f760b2e2b278061 to your computer and use it in GitHub Desktop.
PHP import CSV
<?php
$file = fopen('php://output', 'w');
fputcsv($file, ['ID', 'NAME', 'AGE']);
for ($index = 0; $index < 21_000_000; $index++) {
fputcsv($file, [$index + 1, bin2hex(random_bytes(20)), random_int(18, 65)]);
}
fclose($file);
<?php
define('BUFFER_SIZE', 500);
$pdo = new PDO('mysql:host=127.0.0.1;dbname=dummy;charset=utf8mb4', 'root', 'password', [
PDO::ATTR_EMULATE_PREPARES => false,
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
]);
$pdo->exec('DROP TABLE IF EXISTS `import`');
$pdo->exec('CREATE TABLE `import` (
`id` BIGINT UNSIGNED NOT NULL,
`name` VARCHAR(100) NOT NULL,
`age` INT UNSIGNED NOT NULL
)');
$statement = $pdo->prepare('INSERT INTO `import`(`id`, `name`, `age`) VALUES ' . implode(',', array_fill(0, BUFFER_SIZE, '(?, ?, ?)')));
echo date('Y-m-d H:i:s'), PHP_EOL;
$file = fopen($argv[1], 'r');
fgetcsv($file, 100); // skip first line
$buffer = [];
while (!feof($file)) {
$fields = fgetcsv($file, 100);
if ($fields === false) {
continue;
}
$buffer[] = $fields;
if (count($buffer) === BUFFER_SIZE) {
$statement->execute(array_merge(...$buffer));
$buffer = [];
}
}
fclose($file);
$remaining = count($buffer);
if ($remaining > 0) {
$statement = $pdo->prepare('INSERT INTO `import`(`id`, `name`, `age`) VALUES ' . implode(',', array_fill(0, $remaining, '(?, ?, ?)')));
$statement->execute(array_merge(...$buffer));
}
echo date('Y-m-d H:i:s'), PHP_EOL;
$value = $pdo->query('SELECT COUNT(*) FROM `import`')->fetchColumn();
echo number_format($value), PHP_EOL;
<?php
$pdo = new PDO('mysql:host=127.0.0.1;dbname=dummy;charset=utf8mb4', 'root', 'password', [
PDO::ATTR_EMULATE_PREPARES => false,
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
]);
$pdo->exec('DROP TABLE IF EXISTS `import`');
$pdo->exec('CREATE TABLE `import` (
`id` BIGINT UNSIGNED NOT NULL,
`name` VARCHAR(100) NOT NULL,
`age` INT UNSIGNED NOT NULL
)');
$statement = $pdo->prepare('INSERT INTO `import`(`id`, `name`, `age`) VALUES (:id, :name, :age)');
echo date('Y-m-d H:i:s'), PHP_EOL;
$file = fopen($argv[1], 'r');
fgetcsv($file, 100); // skip first line
while (!feof($file)) {
$fields = fgetcsv($file, 100);
if ($fields === false) {
continue;
}
$statement->execute([
':id' => $fields[0],
':name' => $fields[1],
':age' => $fields[2],
]);
}
fclose($file);
echo date('Y-m-d H:i:s'), PHP_EOL;
$value = $pdo->query('SELECT COUNT(*) FROM `import`')->fetchColumn();
echo number_format($value), PHP_EOL;
$ php generate.php > data.csv
$ ls -lh data.csv
-rw-r--r-- 1 rodrigo rodrigo 1,1G abr 17 01:02 data.csv
$ php import-linear.php data.csv
2022-04-17 01:02:50
2022-04-17 01:17:00
21,000,000
$ php import-buffered.php data.csv
2022-04-17 01:17:16
2022-04-17 01:19:12
21,000,000
@rodrigopedra
Copy link
Author

One note: there is a limitation on how many records MySQL accepts on a multi-insert (not sure if it is a MySQL limitation or a PDO one).

Not sure right now, but from memory it was something about 4,000 values.

So in the import-buffered.php try to keep BUFFER_SIZE x <number of columns to be inserted> below 4,000.

@rodrigopedra
Copy link
Author

second note:

While this approach of inserting into a table with no indices or keys seems unrealistic, I used this as an ETL strategy for several years when I used to work on an industry which required large dataset ingestion (I am no longer working with this).

The idea is to import into a dummy table that mimics your input data layout, and then do any validation and data verification when the dataset is fully imported into your database. If something fails during import, verification or validation, you can discard the whole import without messing with your domain tables.

Also once you have the data inside your database you can transform it into another temp table that mimics the destination table, and then add any needed indices to speedup the merge with the domain table. This way one would minimize the time the transactional tables are locked due to a large merge.

Not sure if this is the best approach regarding current industry practices, but back then it worked very well for us, minimized import errors and import times.

@rodrigopedra
Copy link
Author

one more note:

As this was mostly a quick exercise, most error handling is absent =)

So if it inspires anyone to use this code, please take care of any unhandled errors such as PDO connection fail, fopen failing, fgetcsv not returning the expected column count, etc.

@jamesmills
Copy link

jamesmills commented Apr 18, 2022

@rodrigopedra thanks so much for taking the time to put this together and for adding your additional notes.

I copy and paste your script and make naming adjustments and ran it

2022-04-18 06:10:10
2022-04-18 06:13:21
21,833,430

Worked perfectly!

I'm not sure if I was doing something wrong when trying to use the package I was using before but I think for this it makes much more sense to run more raw code as you have written.

I think the longest amount of time will now be taken downloading the source file from FTP, lol.

I already have a couple of separate servers which manage cron/scheduled tasks so I can add this to run on one of those daily.

I don't really need to add any additional foreign keys or anything as it's really just a lookup table. I do need to add an INDEX onto one of the columns but I guess I'll just add that after the data has been inserted.

I'm going to continue playing around with this. Thanks again.

@rodrigopedra
Copy link
Author

@jamesmills great to learn it helped you out. If the column you want to index is not very sparse, such the sample age field I used in the example, having the index added on the table creation should not hurt insertion performance too much.

On a sparse dataset it will, as the database will have to handle rebuilding and reorganizing the index's data structure more often.

But if you can handle a small downtime, as it seems from your description, I would definitely add the index later. Sorry if I misunderstood, but it seems you download and replace the full dataset on a timely schedule, this is why I assume your project can handle a small downtime.

@jamesmills
Copy link

@rodrigopedra couple of good points, which leads me to two follow up questions.

The index I was thinking of adding is actually to a column that will be full of unique values. From my very limited knowledge, I thought you should only add an index to a column that had some repeated values. However, when thinking about it, the primary key is usually indexed and I know indexes help lookups so I thought I would add it. I would be interested in your thoughts on this.

The other point is around the downtime aspect. I am indeed running this import script daily. I do exactly as you did in your sample and drop the table, then create it again before the fresh import. My concern is that I'll have live requests coming into the application which rely on this data as a lookup. So if I continue with this setup then there will be lookups happening on this table as it's dropped, created and re-populated. I'm wondering if I should do the creation into a my_table_name_temp and then once completed drop the my_table_name and rename my_table_name_temp to my_table_name. Again, would welcome your thoughts on this.

Thanks!

@rodrigopedra
Copy link
Author

The index I was thinking of adding is actually to a column that will be full of unique values. From my very limited knowledge, I thought you should only add an index to a column that had some repeated values. However, when thinking about it, the primary key is usually indexed and I know indexes help lookups so I thought I would add it. I would be interested in your thoughts on this.

You can add, and actually columns with unique values are a good candidate for an index, indices on columns with unique values.

Even Laravel comes with a unique index already preset on the user migration email field.

Just make sure to create an unique index, for best performance.

Different database vendors use these days very complex and fancy ways to handle indices, but if we think traditionally how index were handled, they usually create a binary tree of buckets of values, so traversal is optimized by a lot when searching for an unique value.

The advantage of using an index, is that usually indices are stored on a different data structure (file, or whatever the database vendor uses) than the table row. So even if you needed to traverse all the values in an index this would be much faster then traversing the whole table, as the table usually have a lot more columns than an index, thus its records are much large to be read and parsed one by one.

As I already said, an index is traversed on a very optimized way, usually using some sort of tree-like data structure, and once the desired entry is found, it has a "pointer" (usually a record number used internally by the database) to the record it refers to.

So to fetch the corresponding record, the database can "seek" (an operation that skips records without reading them from the store), very fast right to the record it is searching for.

Less diverse values ends up having large buckets down on the index tree, as a leaf node will "point" to a bunch of records instead to a single one. Not a big problem, even in a column which allows just 1 or 0 as values, if the proportion of values is about 50/50, it is still much better to retrieve 50% of the records than to manually check 100% of those records.

Hope this is clearer now =)

@rodrigopedra
Copy link
Author

The other point is around the downtime aspect. I am indeed running this import script daily. I do exactly as you did in your sample and drop the table, then create it again before the fresh import. My concern is that I'll have live requests coming into the application which rely on this data as a lookup. So if I continue with this setup then there will be lookups happening on this table as it's dropped, created and re-populated. I'm wondering if I should do the creation into a my_table_name_temp and then once completed drop the my_table_name and rename my_table_name_temp to my_table_name. Again, would welcome your thoughts on this.

You can:

  • take the app down while updating, for example Laravel has a maintenance mode specifically for this use-case which will return a 503 response to any requests while your app is down for maintenance.
  • Or use an strategy like the one you described: dropping the old one then rename the new one.

I usually, when I dealt with large databases, used a mix of both. I ingested data into a temporary table (actually a real table I create with the same data structure as the target table), once data is ingested and verified, I would drop the old one and rename the new one.

Note, that if you foreign indices that references the table being replaced you will need to:

  1. disable foreign indices on the database
  2. drop the indices
  3. drop the old table
  4. rename the new one
  5. recreate the foreign indices
  6. re-enable foreign indices on the database.

There are other strategies you can use, if you can't afford any downtime:

  • You can create a replica database, update the replica, and then switch connection on the app to the replica.

Another one is using transactions, which will lock your app for a while, but both database and app can stay alive, requests are just going to be delayed. This is very similar to the drop/rename strategy above:

  1. Ingest new data into a temporary table
  2. disable foreign indices on the database
  3. start a transaction
  4. delete all data on the target table (yes, I am talking about a delete without a WHERE)
  5. insert the data from the temporary table into the target table (use INSERT with SELECT for this
  6. commit the transaction
  7. re-enable foreign indices on the database.
  8. Drop the temporary table we used to ingest the new data

The main difference here is the transaction usage, which will try to guarantee the database stays on a good state once data is committed.

This approach have some downsides:

  • You need sufficient disk space, as you are dealing with a big dataset, as basically you are going to hold 3 copies of the data while the transaction is happening.
    • Aside of the two tables, database servers copies the data altered inside a transaction to some sort of "undo" storage, in case you need to rollback the transaction instead of committing it
  • Reads on the database are usually fine while the transaction is running, but writes that depend on the data inside the transaction will be queued up, which could lead you to some timeouts.

Hope this ideas help =)

@rodrigopedra
Copy link
Author

@jamesmills a final disclaimer:

Although I spent some years (about 7 years) dealing with large datasets, I worked more on the ETL and analysis side of the job, and not the database management.

Of course along the years you learn a trick or two when dealing over and over with the same problem set. But the options I gave above, specially on the second comment on your additional questions, might no be the best solution.

Those are strategies I would do, from my experience, but if my team had an expert on database management, I would talk to them first to assess if this is the best strategy.

So if anyone else reading this knows better, please share in the comments how would you handle these. Thanks =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment