Last active November 7, 2024 13:56
AutoScraper Examples

Grouping results and removing unwanted ones

Here we want to scrape product name, price and rating from ebay product pages:

url = '' 

wanted_list = ['Sony PlayStation 4 PS4 Pro 1TB 4K Console - Black', 'US $349.99', '4.8'], wanted_list)

The items which we wanted have been on multiple sections of the page and the scraper tries to catch them all. So it may retrieve some extra information compared to what we have in mind. Let's run it on a different page:


The result:

    "Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti",
    'ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB',
    'US $1,229.49',

As we can see we have one extra item here. We can run the get_result_exact or get_result_similar method with grouped=True parameter. It will group all results per its scraping rule:

scraper.get_result_exact('', grouped=True) 


    'rule_sks3': ["Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti"],
    'rule_d4n5': ['ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB'],
    'rule_fmrm': ['ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB'],
    'rule_2ydq': ['US $1,229.49'],
    'rule_buhw': ['5.0'],
    'rule_vpfp': ['5.0']

Now we can use keep_rules or remove_rules methods to prune unwanted rules:

scraper.keep_rules(['rule_sks3', 'rule_2ydq', 'rule_buhw'])

And now the result only contains the ones which we want:

    "Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti",
    'US $1,229.49',

Building a scraper to work with multiple websites with incremental learning

Suppose we want to make a price scraper to work with multiple websites. Here we consider, and We create some sample data for each website and then feed it to the scraper. By using update=True parameter when calling the build method, all previously learned rules will be kept and new rules will be added to them:

from autoscraper import AutoScraper

data = [
   # some Ebay examples
   ('', ['US $349.99']),
   ('', ['US $1,369.00']),
   ('', ['US $8.95']),
   #some Walmart examples
   ('', ['US $8.95']),
   ('', ['$659.99']),

   #some Etsy examples
   ('', ['$12.50+']),
   ('', ['$1,500.00']),

scraper = AutoScraper()
for url, wanted_list in data:, wanted_list=wanted_list, update=True)

Now hopefully the scraper has learned to scrape all 3 websites. Let's check some new pages:

>>> scraper.get_result_exact('')

['US $24.99', "PUMA Men's Turino Sneakers  | eBay"]

>>> scraper.get_result_exact('')

['$8.71', '(Pack of 8) Gerber 1st Foods Baby Food, Peach, 2-2 oz Tubs -']

>>> scraper.get_result_exact('')


Almost done! But's there's some extra info, let's fix it:

>>> scraper.get_result_exact('', grouped=True)

 {'rule_cqhs': [],
 'rule_h4sy': [],
 'rule_jqtb': [],
 'rule_r9qd': ['$8.71'],
 'rule_6lt7': ['$8.71'],
 'rule_2nrk': ['$8.71'],
 'rule_wy9j': ['$8.71'],
 'rule_v395': [],
 'rule_4ej6': ['(Pack of 8) Gerber 1st Foods Baby Food, Peach, 2-2 oz Tubs -']}

>>> scraper.remove_rules(['rule_4ej6'])
>>> scraper.get_result_exact('')

['US $24.99']

>>> scraper.get_result_exact('')


>>> scraper.get_result_exact('')


Now we have a scraper which works with Ebay, Walmart and Etsy!

Fuzzy matching for html tag attributes

Some websites use different tag values for different pages (like different styles for the same element). In these cases you can adjust attr_fuzz_ratio parameter when getting the results. See this issue for a sample usage.

Using regular expressions

You can use regular expressions for wanted items:

wanted_list = [re.compile('Lorem ipsum.+est laborum')]
Hi can some one help in how to save the scraped data into csv file

Hi @alirezamika, I want to know, can we send keys or click on any link in any website using autoscraper?

Hi, when ı tried to get list of products name and price there is 2 problem
1 -warning about bulk data => ı solved it with zip it in for loop
2- not returning all the datas just for 36 ( ı dont know why and how to solve it )
could you please help me ?

Hi @alirezamika ,

I tried the code here and maybe I am doing something wrong but couldn't get it to work as I had expected.

Firstly, I am relatively new to Web-scraping and saw this while working on another project.

I would like to fetch information from a table on a webpage when I specify the model of the CPU in my wanted list, I get an empty array returned most time with the exception of when I used the CPU name which then returns only 2 results.

import requests
import auto_scraper
import autoscraper

from autoscraper import AutoScraper

cpuUrl = ''
gpuUrl = ""

We can add one or multiple candidates here.
You can also put urls here to retrieve urls.
cpu_wanted_list = {"Manufacturer":["AMD","Intel"],"Release Date":['2020','2021','2022','2023']}
#gpu_wanted_list = ["AMD","Intel"]
gpu_wanted_list = ["Nvidia","Geforce","AMD", "Radeon"]

scraper = AutoScraper()
cpuResult =, cpu_wanted_list)
gpuResult =, gpu_wanted_list)

print("Printing CPU relsult:")

relatedCPUResult = scraper.get_result_similar('')

I am going to and trying to fetch the table items as my result.

When I do get a result, it seems to be fetching values from the Refine Search parameters instead of the actual result:
Printing CPU relsult: ['Manufacturer', 'Release Date', 'Mobile', 'Server', 'TDP', 'Cores', 'Threads', 'Generation', 'Socket', 'Codename', 'Process', 'Multiplier Unlocked', 'Integrated Graphics'] ['AMD', 'Intel', '2022', 'No', '9 W', '2', 'Intel Pentium', 'AMD Socket 939', 'Alder Lake-S', '7 nm', 'Yes'].

I appended my code above in case, I am missing something and thanks in advance.

anoduck commented Oct 10, 2023

OK, I was hoping I could figure this out, but it is 7am and I have been up all night... so... burning spent fuel here.

Autoscraper saved me SO much time, it is ridiculous, but there is a little hitch. All of my data was returned in one huge list. Items are grouped together, but the number of items differ from type to type, and are not matched with the associative data.

(example data generated from Faker module)

In other words, Results looks like:

['Tiffany Williams',
 'Kimberly Ramirez',
 'Marissa Wilson',
 'David King',
 'Jasmine Wilson',
 'Rebecca Swanson',
'9385 Sean Courts Suite 395\nLopezborough, GU 46379',
 '338 Andrea Locks Suite 075\nMontgomerytown, OH 92933',
 '075 Barnett Walks\nNorth Tannerview, NH 64984',
 '00146 Newton Expressway\nSarahfort, MS 62136',
 '0111 Porter Curve Apt. 986\nFosterstad, KS 27560',
 '0668 Douglas Harbor\nWest Amyport, PW 83959',
 '317 Theresa Run\nNorth Angelafurt, MO 32844',
 '46803 Mueller Parks Suite 903\nPort Patrickmouth, MI 09350',
 '8474 Kimberly Point Suite 958\nPhamfort, MN 21067',
 '59573 William Light Suite 476\nSouth Dylan, DC 73663',
 'USCGC Wilson\nFPO AE 95307',
 '49219 Mcconnell Ranch\nNorth Robertport, UT 17995',
 '1369 Jeffrey Island\nCatherinemouth, MO 90968',
 '91483 Petersen Flats Apt. 265\nSilvaland, CO 46272']

Rather than:

                 Name                  Number                                            Address
0    Tiffany Williams       511-538-9955x9371  9385 Sean Courts Suite 395\nLopezborough, GU 4...
1    Kimberly Ramirez        582-816-1125x878  338 Andrea Locks Suite 075\nMontgomerytown, OH...
2      Marissa Wilson        343.352.1379x820      075 Barnett Walks\nNorth Tannerview, NH 64984
3          David King      805.755.6352x44545                                                Nan
4      Jasmine Wilson   001-541-393-9153x0600   0111 Porter Curve Apt. 986\nFosterstad, KS 27560
5     Rebecca Swanson       362-438-7059x3506        0668 Douglas Harbor\nWest Amyport, PW 83959
6    Christina Potter        802.484.3879x623        317 Theresa Run\nNorth Angelafurt, MO 32844
7         James Eaton    +1-280-463-9311x9452  46803 Mueller Parks Suite 903\nPort Patrickmou...
8      Laura Gonzalez    +1-436-455-5647x6468  8474 Kimberly Point Suite 958\nPhamfort, MN 21067
9     Rebecca Freeman       604.974.6647x2368  59573 William Light Suite 476\nSouth Dylan, DC...
10   Mr. William Lara     (848)389-3506x26756                                                Nan
11       Daniel Avila   001-490-540-8510x3636  49219 Mcconnell Ranch\nNorth Robertport, UT 17995
12       Andrew Price       886.582.9972x2800      1369 Jeffrey Island\nCatherinemouth, MO 90968
13       Joseph Smith         +1-388-226-7496  91483 Petersen Flats Apt. 265\nSilvaland, CO 4...
14     James Anderson        433-526-5687x642                                                Nan
15       Brandon Tate              6705266223  5467 Logan Terrace Apt. 127\nMichaelberg, PW 6...
16     Travis Wallace  001-767-613-1216x64547  99422 Justin Ramp Apt. 203\nNew Johnmouth, FL ...
17      Michelle Wong        001-249-343-4216  9324 Meghan Trail Apt. 103\nPhillipburgh, AK 4...
18          Jean Lowe              8703452366  0053 Dale Plains Suite 173\nEast Deniseburgh, ...
19     Melinda Tucker    001-649-372-1670x229          3281 Sarah Points\nPort Richard, PW 02531

please provide your code.

anoduck commented Oct 15, 2023

Really just straight forward from the examples.

from autoscraper import AutoScraper

scraper = AutoScraper()

url = ''

wants = ['Jonathon Howard Kaplan', '(213) 553-4550', '355 S. Grand Ave. Suite 2450 Los Angeles, CA 90071', 'Civil Rights and Employment', 'Duke University School of Law and Duke University Law School']

    results =, wants)

As previously mentioned, this returns one long list. ex. [[all names] + [all phone numbers] + [all address]]. It isn't such a big issue, because this list can be broken down using list.index() and list[x:], except for the length of each categorical list differing. ex. len(name_list) = 39 and len(phone_list) = 27 etc, etc... Thus without knowing exactly what categorical item went with what name, reassembly of the original dataset programmatically appears impossible.

I even attempted to break the scraping process down into individual parts, but as with before the categorical lists varied.

It was not until much later, I discovered the wanted_dict variable. I was just unsure who to structure the dict in order for autoscraper to accept it.

Is it possible to define the crawl depth? Like one URL might have other URL's, if we give parent one, can it crawl all the child ones along with content present on the parent one?

