Skip to content

Instantly share code, notes, and snippets.

@alirezamika
Last active March 19, 2024 15:33
Show Gist options
  • Star 65 You must be signed in to star a gist
  • Fork 16 You must be signed in to fork a gist
  • Save alirezamika/72083221891eecd991bbc0a2a2467673 to your computer and use it in GitHub Desktop.
Save alirezamika/72083221891eecd991bbc0a2a2467673 to your computer and use it in GitHub Desktop.
AutoScraper Examples

Grouping results and removing unwanted ones

Here we want to scrape product name, price and rating from ebay product pages:

url = 'https://www.ebay.com/itm/Sony-PlayStation-4-PS4-Pro-1TB-4K-Console-Black/203084236670' 

wanted_list = ['Sony PlayStation 4 PS4 Pro 1TB 4K Console - Black', 'US $349.99', '4.8'] 

scraper.build(url, wanted_list)

The items which we wanted have been on multiple sections of the page and the scraper tries to catch them all. So it may retrieve some extra information compared to what we have in mind. Let's run it on a different page:

scraper.get_result_exact('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-144Hz-FHD-Laptop-i7-9750H-16GB-512GB-GTX-1660-Ti/114183725523') 

The result:

[
    "Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti",
    'ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB',
    'US $1,229.49',
    '5.0'
]

As we can see we have one extra item here. We can run the get_result_exact or get_result_similar method with grouped=True parameter. It will group all results per its scraping rule:

scraper.get_result_exact('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-144Hz-FHD-Laptop-i7-9750H-16GB-512GB-GTX-1660-Ti/114183725523', grouped=True) 

Output:

{
    'rule_sks3': ["Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti"],
    'rule_d4n5': ['ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB'],
    'rule_fmrm': ['ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB'],
    'rule_2ydq': ['US $1,229.49'],
    'rule_buhw': ['5.0'],
    'rule_vpfp': ['5.0']
}

Now we can use keep_rules or remove_rules methods to prune unwanted rules:

scraper.keep_rules(['rule_sks3', 'rule_2ydq', 'rule_buhw'])
 
scraper.get_result_exact('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-144Hz-FHD-Laptop-i7-9750H-16GB-512GB-GTX-1660-Ti/114183725523') 

And now the result only contains the ones which we want:

[
    "Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti",
    'US $1,229.49',
    '5.0'
]

Building a scraper to work with multiple websites with incremental learning

Suppose we want to make a price scraper to work with multiple websites. Here we consider ebay.com, walmart.com and etsy.com. We create some sample data for each website and then feed it to the scraper. By using update=True parameter when calling the build method, all previously learned rules will be kept and new rules will be added to them:

from autoscraper import AutoScraper

data = [
   # some Ebay examples
   ('https://www.ebay.com/itm/Sony-PlayStation-4-PS4-Pro-1TB-4K-Console-Black/193632846009', ['US $349.99']),
   ('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-FHD-Gaming-Laptop-i7-10750H-16GB-512GB-RTX-2060/303669272117', ['US $1,369.00']),
   ('https://www.ebay.com/itm/8-TAC-FORCE-SPRING-ASSISTED-FOLDING-STILETTO-TACTICAL-KNIFE-Blade-Pocket-Open/331625445801', ['US $8.95']),
   
   #some Walmart examples
   ('https://www.walmart.com/ip/8mm-Classic-Sterling-Silver-Plain-Wedding-Band-Ring/113651182', ['US $8.95']),
   ('https://www.walmart.com/ip/Apple-iPhone-11-64GB-Red-Fully-Unlocked-A-Grade-Refurbished/806414606', ['$659.99']),

   #some Etsy examples
   ('https://www.etsy.com/listing/805075149/starstruck-silk-face-mask-black-silk', ['$12.50+']),
   ('https://www.etsy.com/listing/851553172/apple-macbook-pro-i9-32gb-500gb-radeon', ['$1,500.00']),
]

scraper = AutoScraper()
for url, wanted_list in data:
   scraper.build(url=url, wanted_list=wanted_list, update=True)

Now hopefully the scraper has learned to scrape all 3 websites. Let's check some new pages:

>>> scraper.get_result_exact('https://www.ebay.com/itm/PUMA-Mens-Turino-Sneakers/274324387149')

['US $24.99', "PUMA Men's Turino Sneakers  | eBay"]


>>> scraper.get_result_exact('https://www.walmart.com/ip/Pack-of-8-Gerber-1st-Foods-Baby-Food-Peach-2-2-oz-Tubs/267133209')

['$8.71', '(Pack of 8) Gerber 1st Foods Baby Food, Peach, 2-2 oz Tubs - Walmart.com']


>>> scraper.get_result_exact('https://www.etsy.com/listing/863615551/matte-black-smart-wireless-bluetooth')

['$60.00']

Almost done! But's there's some extra info, let's fix it:

>>> scraper.get_result_exact('https://www.walmart.com/ip/Pack-of-8-Gerber-1st-Foods-Baby-Food-Peach-2-2-oz-Tubs/267133209', grouped=True)

 {'rule_cqhs': [],
 'rule_h4sy': [],
 'rule_jqtb': [],
 'rule_r9qd': ['$8.71'],
 'rule_6lt7': ['$8.71'],
 'rule_2nrk': ['$8.71'],
 'rule_wy9j': ['$8.71'],
 'rule_v395': [],
 'rule_4ej6': ['(Pack of 8) Gerber 1st Foods Baby Food, Peach, 2-2 oz Tubs - Walmart.com']}


>>> scraper.remove_rules(['rule_4ej6'])
>>> scraper.get_result_exact('https://www.ebay.com/itm/PUMA-Mens-Turino-Sneakers/274324387149')

['US $24.99']


>>> scraper.get_result_exact('https://www.walmart.com/ip/Pack-of-8-Gerber-1st-Foods-Baby-Food-Peach-2-2-oz-Tubs/267133209')

['$8.71']


>>> scraper.get_result_exact('https://www.etsy.com/listing/863615551/matte-black-smart-wireless-bluetooth')

['$60.00']

Now we have a scraper which works with Ebay, Walmart and Etsy!

Fuzzy matching for html tag attributes

Some websites use different tag values for different pages (like different styles for the same element). In these cases you can adjust attr_fuzz_ratio parameter when getting the results. See this issue for a sample usage.

Using regular expressions

You can use regular expressions for wanted items:

wanted_list = [re.compile('Lorem ipsum.+est laborum')]
@anoduck
Copy link

anoduck commented Jun 7, 2022

Hi @alirezamika ! I tried to scrape this page: https://u.gg/lol/top-lane-tier-list But when I got empty list. What I did wrong?

from autoscraper import AutoScraper

url = 'https://u.gg/lol/top-lane-tier-list'

wanted_list = ['Shen', '52.68%', '2.1%']

scraper = AutoScraper()

result = [scraper.build](http://scraper.build)(url, wanted_list=wanted_listl)

print(result)

@nixonthe --> The page loads content via javascript, and not html or php. Not only that, it loads the data from a different domain than the one hosting the page, when might have to do with XSS policy.

@Vponed
Copy link

Vponed commented Jul 2, 2022

Thank you so much for the code. Please tell me if it is possible to use the model in other programming languages?
Well, or somehow extract the request for data itself?

@Abdul-Hannan96
Copy link

How can we apply auto scraper on multiple pages?

@JettScythe
Copy link

What about pages that use Js to load content? Is it working too?

@natzar Nope. You will need to use some kind of lib (like requests-html or selenium) to render the content and pass it to the builder

@meirpertzz
Copy link

meirpertzz commented Jan 28, 2023

Hi everyone, I have started using this tool and I love it!
I have some question though, I am using it to run over product pages (using wanted_dict key-value), in this product pages sometime, I don't have all the values, for example, not in all the pages, we have x and y, sometimes you just have x, now I wonder how can I teach the model about the pages with some empty values?

I will highly appreciate any suggestions

Thank you very much

@RishabAgr
Copy link

The text_fuzz-ratio attribute in the build method seems to be causing errors in the code.
I wrote this code in order to extract items and it returns a list:

scraper = AutoScraper() result = scraper.build(url=url, wanted_list=sample_item)

However when I add the text_fuzz_ratio attribute to try and get a more general list:

scraper = AutoScraper() result = scraper.build(url=url, wanted_list=sample_item, text_fuzz_ratio= 0.9)

It returns a TypeError
image

Thoughts?

@debrupf2946
Copy link

Hi can some one help in how to save the scraped data into csv file

@rhythm-04
Copy link

Hi @alirezamika, I want to know, can we send keys or click on any link in any website using autoscraper?

@furkannkilicc
Copy link

Hi, when ı tried to get list of products name and price there is 2 problem
1 -warning about bulk data => ı solved it with zip it in for loop
2- not returning all the datas just for 36 ( ı dont know why and how to solve it )
could you please help me ?

@akoredenitan
Copy link

Hi @alirezamika ,

I tried the code here and maybe I am doing something wrong but couldn't get it to work as I had expected.

Firstly, I am relatively new to Web-scraping and saw this while working on another project.

I would like to fetch information from a table on a webpage when I specify the model of the CPU in my wanted list, I get an empty array returned most time with the exception of when I used the CPU name which then returns only 2 results.

import requests
import auto_scraper
import autoscraper

from autoscraper import AutoScraper

cpuUrl = 'https://www.techpowerup.com/cpu-specs'
gpuUrl = "https://www.techpowerup.com/gpu-specs/?mobile=No&workstation=No&sort=name"

We can add one or multiple candidates here.
You can also put urls here to retrieve urls.
cpu_wanted_list = {"Manufacturer":["AMD","Intel"],"Release Date":['2020','2021','2022','2023']}
#gpu_wanted_list = ["AMD","Intel"]
gpu_wanted_list = ["Nvidia","Geforce","AMD", "Radeon"]

scraper = AutoScraper()
cpuResult = scraper.build(cpuUrl, cpu_wanted_list)
gpuResult = scraper.build(gpuUrl, gpu_wanted_list)

print("Printing CPU relsult:")
print(cpuResult)

relatedCPUResult = scraper.get_result_similar('https://www.techpowerup.com/cpu-specs/?mobile=No&server=No&sort=name')
print(relatedCPUResult)

I am going to https://www.techpowerup.com/cpu-specs/?mobile=No&server=No&sort=name and trying to fetch the table items as my result.

When I do get a result, it seems to be fetching values from the Refine Search parameters instead of the actual result:
Printing CPU relsult: ['Manufacturer', 'Release Date', 'Mobile', 'Server', 'TDP', 'Cores', 'Threads', 'Generation', 'Socket', 'Codename', 'Process', 'Multiplier Unlocked', 'Integrated Graphics'] ['AMD', 'Intel', '2022', 'No', '9 W', '2', 'Intel Pentium', 'AMD Socket 939', 'Alder Lake-S', '7 nm', 'Yes'].

I appended my code above in case, I am missing something and thanks in advance.

@anoduck
Copy link

anoduck commented Oct 10, 2023

OK, I was hoping I could figure this out, but it is 7am and I have been up all night... so... burning spent fuel here.

Autoscraper saved me SO much time, it is ridiculous, but there is a little hitch. All of my data was returned in one huge list. Items are grouped together, but the number of items differ from type to type, and are not matched with the associative data.

(example data generated from Faker module)

In other words, Results looks like:

['Tiffany Williams',
 'Kimberly Ramirez',
 'Marissa Wilson',
 'David King',
 'Jasmine Wilson',
 'Rebecca Swanson',
 ...
'511-538-9955x9371',
 '582-816-1125x878',
 '343.352.1379x820',
 '805.755.6352x44545',
 '001-541-393-9153x0600',
 '362-438-7059x3506',
 '802.484.3879x623',
 '+1-280-463-9311x9452',
 '+1-436-455-5647x6468',
...
'9385 Sean Courts Suite 395\nLopezborough, GU 46379',
 '338 Andrea Locks Suite 075\nMontgomerytown, OH 92933',
 '075 Barnett Walks\nNorth Tannerview, NH 64984',
 '00146 Newton Expressway\nSarahfort, MS 62136',
 '0111 Porter Curve Apt. 986\nFosterstad, KS 27560',
 '0668 Douglas Harbor\nWest Amyport, PW 83959',
 '317 Theresa Run\nNorth Angelafurt, MO 32844',
 '46803 Mueller Parks Suite 903\nPort Patrickmouth, MI 09350',
 '8474 Kimberly Point Suite 958\nPhamfort, MN 21067',
 '59573 William Light Suite 476\nSouth Dylan, DC 73663',
 'USCGC Wilson\nFPO AE 95307',
 '49219 Mcconnell Ranch\nNorth Robertport, UT 17995',
 '1369 Jeffrey Island\nCatherinemouth, MO 90968',
 '91483 Petersen Flats Apt. 265\nSilvaland, CO 46272']

Rather than:

                 Name                  Number                                            Address
0    Tiffany Williams       511-538-9955x9371  9385 Sean Courts Suite 395\nLopezborough, GU 4...
1    Kimberly Ramirez        582-816-1125x878  338 Andrea Locks Suite 075\nMontgomerytown, OH...
2      Marissa Wilson        343.352.1379x820      075 Barnett Walks\nNorth Tannerview, NH 64984
3          David King      805.755.6352x44545                                                Nan
4      Jasmine Wilson   001-541-393-9153x0600   0111 Porter Curve Apt. 986\nFosterstad, KS 27560
5     Rebecca Swanson       362-438-7059x3506        0668 Douglas Harbor\nWest Amyport, PW 83959
6    Christina Potter        802.484.3879x623        317 Theresa Run\nNorth Angelafurt, MO 32844
7         James Eaton    +1-280-463-9311x9452  46803 Mueller Parks Suite 903\nPort Patrickmou...
8      Laura Gonzalez    +1-436-455-5647x6468  8474 Kimberly Point Suite 958\nPhamfort, MN 21067
9     Rebecca Freeman       604.974.6647x2368  59573 William Light Suite 476\nSouth Dylan, DC...
10   Mr. William Lara     (848)389-3506x26756                                                Nan
11       Daniel Avila   001-490-540-8510x3636  49219 Mcconnell Ranch\nNorth Robertport, UT 17995
12       Andrew Price       886.582.9972x2800      1369 Jeffrey Island\nCatherinemouth, MO 90968
13       Joseph Smith         +1-388-226-7496  91483 Petersen Flats Apt. 265\nSilvaland, CO 4...
14     James Anderson        433-526-5687x642                                                Nan
15       Brandon Tate              6705266223  5467 Logan Terrace Apt. 127\nMichaelberg, PW 6...
16     Travis Wallace  001-767-613-1216x64547  99422 Justin Ramp Apt. 203\nNew Johnmouth, FL ...
17      Michelle Wong        001-249-343-4216  9324 Meghan Trail Apt. 103\nPhillipburgh, AK 4...
18          Jean Lowe              8703452366  0053 Dale Plains Suite 173\nEast Deniseburgh, ...
19     Melinda Tucker    001-649-372-1670x229          3281 Sarah Points\nPort Richard, PW 02531

@alirezamika
Copy link
Author

OK, I was hoping I could figure this out, but it is 7am and I have been up all night... so... burning spent fuel here.

Autoscraper saved me SO much time, it is ridiculous, but there is a little hitch. All of my data was returned in one huge list. Items are grouped together, but the number of items differ from type to type, and are not matched with the associative data.

(example data generated from Faker module)

In other words, Results looks like:

['Tiffany Williams',
 'Kimberly Ramirez',
 'Marissa Wilson',
 'David King',
 'Jasmine Wilson',
 'Rebecca Swanson',
 ...
'511-538-9955x9371',
 '582-816-1125x878',
 '343.352.1379x820',
 '805.755.6352x44545',
 '001-541-393-9153x0600',
 '362-438-7059x3506',
 '802.484.3879x623',
 '+1-280-463-9311x9452',
 '+1-436-455-5647x6468',
...
'9385 Sean Courts Suite 395\nLopezborough, GU 46379',
 '338 Andrea Locks Suite 075\nMontgomerytown, OH 92933',
 '075 Barnett Walks\nNorth Tannerview, NH 64984',
 '00146 Newton Expressway\nSarahfort, MS 62136',
 '0111 Porter Curve Apt. 986\nFosterstad, KS 27560',
 '0668 Douglas Harbor\nWest Amyport, PW 83959',
 '317 Theresa Run\nNorth Angelafurt, MO 32844',
 '46803 Mueller Parks Suite 903\nPort Patrickmouth, MI 09350',
 '8474 Kimberly Point Suite 958\nPhamfort, MN 21067',
 '59573 William Light Suite 476\nSouth Dylan, DC 73663',
 'USCGC Wilson\nFPO AE 95307',
 '49219 Mcconnell Ranch\nNorth Robertport, UT 17995',
 '1369 Jeffrey Island\nCatherinemouth, MO 90968',
 '91483 Petersen Flats Apt. 265\nSilvaland, CO 46272']

Rather than:

                 Name                  Number                                            Address
0    Tiffany Williams       511-538-9955x9371  9385 Sean Courts Suite 395\nLopezborough, GU 4...
1    Kimberly Ramirez        582-816-1125x878  338 Andrea Locks Suite 075\nMontgomerytown, OH...
2      Marissa Wilson        343.352.1379x820      075 Barnett Walks\nNorth Tannerview, NH 64984
3          David King      805.755.6352x44545                                                Nan
4      Jasmine Wilson   001-541-393-9153x0600   0111 Porter Curve Apt. 986\nFosterstad, KS 27560
5     Rebecca Swanson       362-438-7059x3506        0668 Douglas Harbor\nWest Amyport, PW 83959
6    Christina Potter        802.484.3879x623        317 Theresa Run\nNorth Angelafurt, MO 32844
7         James Eaton    +1-280-463-9311x9452  46803 Mueller Parks Suite 903\nPort Patrickmou...
8      Laura Gonzalez    +1-436-455-5647x6468  8474 Kimberly Point Suite 958\nPhamfort, MN 21067
9     Rebecca Freeman       604.974.6647x2368  59573 William Light Suite 476\nSouth Dylan, DC...
10   Mr. William Lara     (848)389-3506x26756                                                Nan
11       Daniel Avila   001-490-540-8510x3636  49219 Mcconnell Ranch\nNorth Robertport, UT 17995
12       Andrew Price       886.582.9972x2800      1369 Jeffrey Island\nCatherinemouth, MO 90968
13       Joseph Smith         +1-388-226-7496  91483 Petersen Flats Apt. 265\nSilvaland, CO 4...
14     James Anderson        433-526-5687x642                                                Nan
15       Brandon Tate              6705266223  5467 Logan Terrace Apt. 127\nMichaelberg, PW 6...
16     Travis Wallace  001-767-613-1216x64547  99422 Justin Ramp Apt. 203\nNew Johnmouth, FL ...
17      Michelle Wong        001-249-343-4216  9324 Meghan Trail Apt. 103\nPhillipburgh, AK 4...
18          Jean Lowe              8703452366  0053 Dale Plains Suite 173\nEast Deniseburgh, ...
19     Melinda Tucker    001-649-372-1670x229          3281 Sarah Points\nPort Richard, PW 02531

please provide your code.

@anoduck
Copy link

anoduck commented Oct 15, 2023

@alirezamika
Really just straight forward from the examples.

from autoscraper import AutoScraper

scraper = AutoScraper()

url = 'https://justia.com/lawyers/civil-rights/california/los-angeles'

wants = ['Jonathon Howard Kaplan', '(213) 553-4550', '355 S. Grand Ave. Suite 2450 Los Angeles, CA 90071', 'Civil Rights and Employment', 'Duke University School of Law and Duke University Law School']

try:
    results = scraper.build(url, wants)

As previously mentioned, this returns one long list. ex. [[all names] + [all phone numbers] + [all address]]. It isn't such a big issue, because this list can be broken down using list.index() and list[x:], except for the length of each categorical list differing. ex. len(name_list) = 39 and len(phone_list) = 27 etc, etc... Thus without knowing exactly what categorical item went with what name, reassembly of the original dataset programmatically appears impossible.

I even attempted to break the scraping process down into individual parts, but as with before the categorical lists varied.

It was not until much later, I discovered the wanted_dict variable. I was just unsure who to structure the dict in order for autoscraper to accept it.

@karrtikiyer-tw
Copy link

Is it possible to define the crawl depth? Like one URL might have other URL's, if we give parent one, can it crawl all the child ones along with content present on the parent one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment