Last active
July 29, 2022 17:26
-
-
Save sebres/d207ec9b973d9add5402e9b0ab2205df to your computer and use it in GitHub Desktop.
PoC -- NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL (mariadb vs. sqlite3)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
proc prepare {type args} { | |
set sp "CREATE TEMPORARY TABLE pages ( type VARCHAR(20), id VARCHAR(10), content varchar(50), PRIMARY KEY (type, id) )" | |
set spd "CREATE TEMPORARY TABLE pages_done ( type VARCHAR(20), id VARCHAR(10), PRIMARY KEY (type, id) )" | |
switch -- $type { | |
mysql { | |
package require tdbc::mysql | |
append sp " ENGINE = MEMORY"; append spd " ENGINE = MEMORY"; | |
tdbc::mysql::connection create db {*}$args | |
} | |
sqlite3 { | |
package require tdbc::sqlite3 | |
tdbc::sqlite3::connection create db {*}$args | |
} | |
default { | |
return -code error "NYI" | |
} | |
} | |
[[db prepare $sp] execute] close | |
[[db prepare $spd] execute] close | |
set sp [db prepare {insert into pages values(:t, :i, :c)}]; | |
set spd [db prepare {insert into pages_done values(:t, :i)}]; | |
set i 0; while {$i < 10000} { | |
incr i; set t "type$i"; set c "content $i"; | |
[$sp execute] close; if {$i & 1} {[$spd execute] close} | |
} | |
$spd close; $sp close | |
db allrows {select (select count(*) from pages) pages, (select count(*) from pages_done) done} | |
} | |
proc test {} { | |
puts [format "%10s : %s" "JOIN + NULL" [timerate {db allrows {SELECT p.type, p.id, p.content FROM pages p LEFT JOIN pages_done pd ON p.type = pd.type AND p.id = pd.id WHERE pd.type IS NULL LIMIT 4999,1}}]] | |
puts [format "%10s : %s" "NOT IN " [timerate {db allrows {SELECT type, id, content FROM pages WHERE (type, id) NOT IN (SELECT type, id FROM pages_done) LIMIT 4999,1}}]] | |
puts [format "%10s : %s" "NOT EXISTS " [timerate {db allrows {SELECT type, id, content FROM pages p WHERE NOT EXISTS (SELECT 1 FROM pages_done pd WHERE (pd.type, pd.id) = (p.type, p.id)) LIMIT 4999,1}}]] | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
% source sql-not-in-test.tcl | |
-% prepare sqlite3 :memory: | |
+% prepare mysql -host localhost -user user -password $pwd -database mysql | |
{pages 10000 done 5000} | |
% test | |
-JOIN + NULL : 2861.97 µs/# 350 # 349.41 #/sec 1001.689 net-ms | |
-NOT IN : 938563.0 µs/# 2 # 1.065 #/sec 1877.126 net-ms | |
-NOT EXISTS : 2873.29 µs/# 349 # 348.03 #/sec 1002.779 net-ms | |
+JOIN + NULL : 4133.95 µs/# 242 # 241.90 #/sec 1000.417 net-ms | |
+NOT IN : 6523.18 µs/# 154 # 153.30 #/sec 1004.569 net-ms | |
+NOT EXISTS : 5645.48 µs/# 178 # 177.13 #/sec 1004.896 net-ms | |
% db close |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Basically there is a large difference between IN and NOT IN:
select from where x IN (subselect)
is fastest in sqlite3 (2ms on my test sets)select from where x NOT IN (subselect)
is slowest in sqlite3 (1000ms on my test sets, but not liner even, its big-O is nearly toO(n**2)
)Algorithmic (if we exclude that the substatement can return nulls, and it can't because fields are not nullable) both intersections can be made in comparable times if first table has exactly N*2 of second table (both sets returned by IN and NOT IN statements are equal).
Anyway, there are several ways how one would implement
SCAN TABLE USING FOR IN-OPERATOR
in NOT case, just I don't think the quadratic O is the right way in that case (unrelated to estimated size of subsets or intersection result).Here is an example illustrating the issue:
My assumption is that the issue with NOT IN going to handling of possible NULL in the subset, e. g. because of:
but firstly there are no NULLs possible, secondly also in that case it'd feasible without quadratic O.
Also pretty strange is this:
All 3 queries are
NOT IN ()
, all'd return the same sets (because affecting the same rows), all'd show the same execution plan (not checked but I strictly assume so) and only tuple-ed last statement: