Skip to content

Instantly share code, notes, and snippets.

@hamletbatista
Created July 25, 2020 23:26
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hamletbatista/8575b0659df172e188ae6725ef777e25 to your computer and use it in GitHub Desktop.
Save hamletbatista/8575b0659df172e188ae6725ef777e25 to your computer and use it in GitHub Desktop.
from urllib.parse import urlparse
import re
url="https://www.amazon.com/SanDisk-128GB-microSDXC-Memory-Adapter/dp/B073JYC4XM/"
print(set(re.split("[/-]", urlparse(url).path)))
#output
#{'', 'B073JYC4XM', 'dp', '128GB', 'microSDXC', 'Memory', 'SanDisk', 'Adapter'}
@eliasdabbas
Copy link

My two cents if I may :)

Different directories usually contain some context about the URL. If the first directory was blog for example, it would mean something different than if it appeared in last directory (part of the title for example).
Query parameters also contain very useful information when present.
And so... the url_to_df function as born:

urls = [
    'https://www.amazon.com/Fire-HD-8-Previous-Generation-9th/dp/B077H3HJJM/ref=dp-dss_2/130-8721979-2336513?_encoding=UTF8&pd_rd_i=B077H3HJJM&pd_rd_r=e3c9e34e-af99-438b-b0de-b9837686e2aa&pd_rd_w=GMjko&pd_rd_wg=z0GSp&pf_rd_p=0a1acd2a-fc19-49b9-9594-20e4fe9fe8f9&pf_rd_r=52VPPZD73X01S6VZARSG&psc=1&refRID=52VPPZD73X01S6VZARSG',
    'https://www.amazon.com/Staging-Product-Not-Retail-Sale/dp/B07JX8JTB2/ref=dp-dss_3/130-8721979-2336513?_encoding=UTF8&pd_rd_i=B07JX8JTB2&pd_rd_r=e3c9e34e-af99-438b-b0de-b9837686e2aa&pd_rd_w=GMjko&pd_rd_wg=z0GSp&pf_rd_p=0a1acd2a-fc19-49b9-9594-20e4fe9fe8f9&pf_rd_r=52VPPZD73X01S6VZARSG&psc=1&refRID=52VPPZD73X01S6VZARSG',
    'https://www.amazon.com/Fire-HD-10/dp/B07K1RZWMC/ref=dp-dss_4/130-8721979-2336513?_encoding=UTF8&pd_rd_i=B07K1RZWMC&pd_rd_r=e3c9e34e-af99-438b-b0de-b9837686e2aa&pd_rd_w=GMjko&pd_rd_wg=z0GSp&pf_rd_p=0a1acd2a-fc19-49b9-9594-20e4fe9fe8f9&pf_rd_r=52VPPZD73X01S6VZARSG&psc=1&refRID=52VPPZD73X01S6VZARSG'
]

import advertools as adv

adv.url_to_df(urls)
url scheme netloc path query fragment dir_1 dir_2 dir_3 dir_4 dir_5 query__encoding query_pd_rd_i query_pd_rd_r query_pd_rd_w query_pd_rd_wg query_pf_rd_p query_pf_rd_r query_psc query_refRID
0 https://www.amazon.com/Fire-HD-8-Previous-Generation-9th/dp/B077H3HJJM/ref=dp-dss_2/130-8721979-2336513?_encoding=UTF8&pd_rd_i=B077H3HJJM&pd_rd_r=e3c9e34e-af99-438b-b0de-b9837686e2aa&pd_rd_w=GMjko&pd_rd_wg=z0GSp&pf_rd_p=0a1acd2a-fc19-49b9-9594-20e4fe9fe8f9&pf_rd_r=52VPPZD73X01S6VZARSG&psc=1&refRID=52VPPZD73X01S6VZARSG https www.amazon.com /Fire-HD-8-Previous-Generation-9th/dp/B077H3HJJM/ref=dp-dss_2/130-8721979-2336513 _encoding=UTF8&pd_rd_i=B077H3HJJM&pd_rd_r=e3c9e34e-af99-438b-b0de-b9837686e2aa&pd_rd_w=GMjko&pd_rd_wg=z0GSp&pf_rd_p=0a1acd2a-fc19-49b9-9594-20e4fe9fe8f9&pf_rd_r=52VPPZD73X01S6VZARSG&psc=1&refRID=52VPPZD73X01S6VZARSG Fire-HD-8-Previous-Generation-9th dp B077H3HJJM ref=dp-dss_2 130-8721979-2336513 UTF8 B077H3HJJM e3c9e34e-af99-438b-b0de-b9837686e2aa GMjko z0GSp 0a1acd2a-fc19-49b9-9594-20e4fe9fe8f9 52VPPZD73X01S6VZARSG 1 52VPPZD73X01S6VZARSG
1 https://www.amazon.com/Staging-Product-Not-Retail-Sale/dp/B07JX8JTB2/ref=dp-dss_3/130-8721979-2336513?_encoding=UTF8&pd_rd_i=B07JX8JTB2&pd_rd_r=e3c9e34e-af99-438b-b0de-b9837686e2aa&pd_rd_w=GMjko&pd_rd_wg=z0GSp&pf_rd_p=0a1acd2a-fc19-49b9-9594-20e4fe9fe8f9&pf_rd_r=52VPPZD73X01S6VZARSG&psc=1&refRID=52VPPZD73X01S6VZARSG https www.amazon.com /Staging-Product-Not-Retail-Sale/dp/B07JX8JTB2/ref=dp-dss_3/130-8721979-2336513 _encoding=UTF8&pd_rd_i=B07JX8JTB2&pd_rd_r=e3c9e34e-af99-438b-b0de-b9837686e2aa&pd_rd_w=GMjko&pd_rd_wg=z0GSp&pf_rd_p=0a1acd2a-fc19-49b9-9594-20e4fe9fe8f9&pf_rd_r=52VPPZD73X01S6VZARSG&psc=1&refRID=52VPPZD73X01S6VZARSG Staging-Product-Not-Retail-Sale dp B07JX8JTB2 ref=dp-dss_3 130-8721979-2336513 UTF8 B07JX8JTB2 e3c9e34e-af99-438b-b0de-b9837686e2aa GMjko z0GSp 0a1acd2a-fc19-49b9-9594-20e4fe9fe8f9 52VPPZD73X01S6VZARSG 1 52VPPZD73X01S6VZARSG
2 https://www.amazon.com/Fire-HD-10/dp/B07K1RZWMC/ref=dp-dss_4/130-8721979-2336513?_encoding=UTF8&pd_rd_i=B07K1RZWMC&pd_rd_r=e3c9e34e-af99-438b-b0de-b9837686e2aa&pd_rd_w=GMjko&pd_rd_wg=z0GSp&pf_rd_p=0a1acd2a-fc19-49b9-9594-20e4fe9fe8f9&pf_rd_r=52VPPZD73X01S6VZARSG&psc=1&refRID=52VPPZD73X01S6VZARSG https www.amazon.com /Fire-HD-10/dp/B07K1RZWMC/ref=dp-dss_4/130-8721979-2336513 _encoding=UTF8&pd_rd_i=B07K1RZWMC&pd_rd_r=e3c9e34e-af99-438b-b0de-b9837686e2aa&pd_rd_w=GMjko&pd_rd_wg=z0GSp&pf_rd_p=0a1acd2a-fc19-49b9-9594-20e4fe9fe8f9&pf_rd_r=52VPPZD73X01S6VZARSG&psc=1&refRID=52VPPZD73X01S6VZARSG Fire-HD-10 dp B07K1RZWMC ref=dp-dss_4 130-8721979-2336513 UTF8 B07K1RZWMC e3c9e34e-af99-438b-b0de-b9837686e2aa GMjko z0GSp 0a1acd2a-fc19-49b9-9594-20e4fe9fe8f9 52VPPZD73X01S6VZARSG 1 52VPPZD73X01S6VZARSG

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment