bivald/batch-insert-duckdb.md

## batch-insert-duckdb.md

      
    Raw
  

              batch-insert-duckdb.md
            
          
    How I batch-read into DuckDB

On a machine with much ram the following works:
con.sql("CREATE TABLE new_tbl AS SELECT * FROM read_parquet('file.parq')")

It uses about 20GB of ram or more and takes 130s and the duckdb file is 3.42GB
Trying to read the parquet in batches and inserting them instead and not keeping anything more than needed (i.e each row group) in RAM:
for i in range(existing_parquet_data.num_row_groups):
	table_df = existing_parquet_data.read_row_group(i)

	start=time.time()
	con = duckdb.connect('file-flights.db')
	duckdb.register('table_df', table_df)
	con.execute("INSERT INTO new_tbl SELECT * FROM table_df")
	con.close()

	print(f"Read {i} took {time.time()-start}s")

Gives:
Read 0 took 1.7199020385742188s
Read 1 took 3.485285997390747s
Read 2 took 4.743414878845215s
Read 3 took 6.088403940200806s
Read 4 took 7.19077467918396s
Read 5 took 8.5175039768219s
Read 6 took 9.980829238891602s
Read 7 took 11.039855003356934s
Read 8 took 12.29391098022461s
Read 9 took 13.454188108444214s
Read 10 took 14.83984899520874s
Read 11 took 16.1385657787323s
Read 12 took 17.594982862472534s
Read 13 took 3.2629759311676025s
Read 14 took 4.532358169555664s
Read 15 took 5.8218700885772705s
Read 16 took 7.133969068527222s
Read 17 took 8.24370789527893s
Read 18 took 9.627621173858643s
Read 19 took 10.74335503578186s
Read 20 took 11.889867067337036s
Read 21 took 13.191499710083008s
Read 22 took 14.493306875228882s
Read 23 took 15.656914949417114s
Read 24 took 17.20168399810791s
Read 25 took 3.587995767593384s
Read 26 took 5.477898836135864s
Read 27 took 7.144642114639282s
Read 28 took 9.056833982467651s
Read 29 took 10.802035808563232s
Read 30 took 12.37866497039795s
Read 31 took 14.336489200592041s
Read 32 took 16.111725091934204s
Read 33 took 16.402412176132202s

Uses about 2.5GB of ram but takes 300s and the database is 4.97GB
Each row group is 10 000 rows, which I know is a little low probably (but 4000 columns)
Is there any better way for me to do this? And perhaps more importantly, will the second approach be slower for querying given that it's larger in disk size?