A function that takes a URL string and returns it with
https
replaced byhttp
(all sites support http endpoints, only some support https);www.
removed (all sites should support naked domains);- ending slash
/
removed from path; - lowercased domain (but not path);
- querystring parameters removed, except those listed in the
allowed_params
table, which expects records like the following:
hostpath | param
-------------------------------------+----------
news.ycombinator.com/user | id
news.ycombinator.com/threads | id
news.ycombinator.com/submitted | id
youtube.com/watch | v
youtube.com/playlist | list
books.google.com | id
books.google.com.br | id
drive.google.com/folderview | id
cuapress.cua.edu/books/viewbook.cfm | book
www.ucpress.edu/book.php | isbn
news.ycombinator.com/saved | id
news.ycombinator.com/saved | comments
::DATABASE=> select normalize('https://www.baNANas.com/uVA/?utm=32');
normalize
------------------------
http://bananas.com/uVA
(1 row)
::DATABASE=> select normalize('https://news.ycombinator.com/user?id=fiatjaf');
normalize
---------------------------------------------
http://news.ycombinator.com/user?id=fiatjaf
(1 row)