Skip to content

Instantly share code, notes, and snippets.

@sebres
Last active July 29, 2022 17:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sebres/d207ec9b973d9add5402e9b0ab2205df to your computer and use it in GitHub Desktop.
Save sebres/d207ec9b973d9add5402e9b0ab2205df to your computer and use it in GitHub Desktop.
PoC -- NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL (mariadb vs. sqlite3)
proc prepare {type args} {
set sp "CREATE TEMPORARY TABLE pages ( type VARCHAR(20), id VARCHAR(10), content varchar(50), PRIMARY KEY (type, id) )"
set spd "CREATE TEMPORARY TABLE pages_done ( type VARCHAR(20), id VARCHAR(10), PRIMARY KEY (type, id) )"
switch -- $type {
mysql {
package require tdbc::mysql
append sp " ENGINE = MEMORY"; append spd " ENGINE = MEMORY";
tdbc::mysql::connection create db {*}$args
}
sqlite3 {
package require tdbc::sqlite3
tdbc::sqlite3::connection create db {*}$args
}
default {
return -code error "NYI"
}
}
[[db prepare $sp] execute] close
[[db prepare $spd] execute] close
set sp [db prepare {insert into pages values(:t, :i, :c)}];
set spd [db prepare {insert into pages_done values(:t, :i)}];
set i 0; while {$i < 10000} {
incr i; set t "type$i"; set c "content $i";
[$sp execute] close; if {$i & 1} {[$spd execute] close}
}
$spd close; $sp close
db allrows {select (select count(*) from pages) pages, (select count(*) from pages_done) done}
}
proc test {} {
puts [format "%10s : %s" "JOIN + NULL" [timerate {db allrows {SELECT p.type, p.id, p.content FROM pages p LEFT JOIN pages_done pd ON p.type = pd.type AND p.id = pd.id WHERE pd.type IS NULL LIMIT 4999,1}}]]
puts [format "%10s : %s" "NOT IN " [timerate {db allrows {SELECT type, id, content FROM pages WHERE (type, id) NOT IN (SELECT type, id FROM pages_done) LIMIT 4999,1}}]]
puts [format "%10s : %s" "NOT EXISTS " [timerate {db allrows {SELECT type, id, content FROM pages p WHERE NOT EXISTS (SELECT 1 FROM pages_done pd WHERE (pd.type, pd.id) = (p.type, p.id)) LIMIT 4999,1}}]]
}
% source sql-not-in-test.tcl
-% prepare sqlite3 :memory:
+% prepare mysql -host localhost -user user -password $pwd -database mysql
{pages 10000 done 5000}
% test
-JOIN + NULL : 2861.97 µs/# 350 # 349.41 #/sec 1001.689 net-ms
-NOT IN : 938563.0 µs/# 2 # 1.065 #/sec 1877.126 net-ms
-NOT EXISTS : 2873.29 µs/# 349 # 348.03 #/sec 1002.779 net-ms
+JOIN + NULL : 4133.95 µs/# 242 # 241.90 #/sec 1000.417 net-ms
+NOT IN : 6523.18 µs/# 154 # 153.30 #/sec 1004.569 net-ms
+NOT EXISTS : 5645.48 µs/# 178 # 177.13 #/sec 1004.896 net-ms
% db close
@sebres
Copy link
Author

sebres commented Jul 29, 2022

Basically there is a large difference between IN and NOT IN:

  • select from where x IN (subselect) is fastest in sqlite3 (2ms on my test sets)
  • select from where x NOT IN (subselect) is slowest in sqlite3 (1000ms on my test sets, but not liner even, its big-O is nearly to O(n**2))

Algorithmic (if we exclude that the substatement can return nulls, and it can't because fields are not nullable) both intersections can be made in comparable times if first table has exactly N*2 of second table (both sets returned by IN and NOT IN statements are equal).
Anyway, there are several ways how one would implement SCAN TABLE USING FOR IN-OPERATOR in NOT case, just I don't think the quadratic O is the right way in that case (unrelated to estimated size of subsets or intersection result).

Here is an example illustrating the issue:

sqlite3 db :memory:
db eval "CREATE TABLE pages ( type TEXT, id TEXT, content TEXT, PRIMARY KEY (type, id) ); CREATE TABLE pages_done ( type TEXT, id TEXT, PRIMARY KEY (type, id) );"
set i 0; while {$i < 10000} {incr i; set t "type$i"; set c "content $i"; db eval {insert into pages values(:t, :i, :c)}; if {$i & 1} {db eval {insert into pages_done values(:t, :i)}}}
puts -nonewline "Row counts: "; db eval {select (select count(*) from pages) pages, (select count(*) from pages_done) done} values {puts "pages: $values(pages), done: $values(done)"}
puts "IN:         [timerate {db eval {SELECT type, id, content FROM pages WHERE (type, id)     IN (SELECT type, id FROM pages_done) LIMIT 4999,1}}]"
puts "NOT IN:     [timerate {db eval {SELECT type, id, content FROM pages WHERE (type, id) NOT IN (SELECT type, id FROM pages_done) LIMIT 4999,1}}]"
puts "EXISTS:     [timerate {db eval {SELECT type, id, content FROM pages p WHERE     EXISTS (SELECT 1 FROM pages_done pd WHERE (pd.type, pd.id) = (p.type, p.id)) LIMIT 4999,1}}]"
puts "NOT EXISTS: [timerate {db eval {SELECT type, id, content FROM pages p WHERE NOT EXISTS (SELECT 1 FROM pages_done pd WHERE (pd.type, pd.id) = (p.type, p.id)) LIMIT 4999,1}}]"
 
# results:
 
# Row counts: pages: 10000, done: 5000
# IN:         1599.02 µs/# 626 # 625.38 #/sec 1000.988 net-ms
# NOT IN:     935942.5 µs/# 2 # 1.068 #/sec 1871.885 net-ms
# EXISTS:     2761.25 µs/# 363 # 362.15 #/sec 1002.335 net-ms
# NOT EXISTS: 2754.40 µs/# 364 # 363.06 #/sec 1002.602 net-ms

My assumption is that the issue with NOT IN going to handling of possible NULL in the subset, e. g. because of:

- % db eval "select 111 where 111 IN (111)       UNION ALL select 222 where 222 NOT IN (111)"
- 111 222
+ % db eval "select 111 where 111 IN (111, NULL) UNION ALL select 222 where 222 NOT IN (111, NULL)"
+ 111

but firstly there are no NULLs possible, secondly also in that case it'd feasible without quadratic O.

Also pretty strange is this:

timerate {db eval {SELECT type, id, content FROM pages WHERE type       NOT IN (SELECT type     FROM pages_done) LIMIT 4999,1}}
timerate {db eval {SELECT type, id, content FROM pages WHERE id         NOT IN (SELECT id       FROM pages_done) LIMIT 4999,1}}
timerate {db eval {SELECT type, id, content FROM pages WHERE (type, id) NOT IN (SELECT type, id FROM pages_done) LIMIT 4999,1}}

# results:

# 2002.34 µs/# 500 # 499.42 #/sec 1001.168 net-ms
# 2989.27 µs/# 335 # 334.53 #/sec 1001.407 net-ms
# 921954.5 µs/# 2 # 1.085 #/sec 1843.909 net-ms

All 3 queries are NOT IN (), all'd return the same sets (because affecting the same rows), all'd show the same execution plan (not checked but I strictly assume so) and only tuple-ed last statement:

  1. is really slow and
  2. showing quadratic big-O

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment