Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
AutoScraper Examples

Grouping results and removing unwanted ones

Here we want to scrape product name, price and rating from ebay product pages:

url = 'https://www.ebay.com/itm/Sony-PlayStation-4-PS4-Pro-1TB-4K-Console-Black/203084236670' 

wanted_list = ['Sony PlayStation 4 PS4 Pro 1TB 4K Console - Black', 'US $349.99', '4.8'] 

scraper.build(url, wanted_list)

The items which we wanted have been on multiple sections of the page and the scraper tries to catch them all. So it may retrieve some extra information compared to what we have in mind. Let's run it on a different page:

scraper.get_result_exact('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-144Hz-FHD-Laptop-i7-9750H-16GB-512GB-GTX-1660-Ti/114183725523') 

The result:

[
    "Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti",
    'ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB',
    'US $1,229.49',
    '5.0'
]

As we can see we have one extra item here. We can run the get_result_exact or get_result_similar method with grouped=True parameter. It will group all results per its scraping rule:

scraper.get_result_exact('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-144Hz-FHD-Laptop-i7-9750H-16GB-512GB-GTX-1660-Ti/114183725523', grouped=True) 

Output:

{
    'rule_sks3': ["Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti"],
    'rule_d4n5': ['ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB'],
    'rule_fmrm': ['ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB'],
    'rule_2ydq': ['US $1,229.49'],
    'rule_buhw': ['5.0'],
    'rule_vpfp': ['5.0']
}

Now we can use keep_rules or remove_rules methods to prune unwanted rules:

scraper.keep_rules(['rule_sks3', 'rule_2ydq', 'rule_buhw'])
 
scraper.get_result_exact('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-144Hz-FHD-Laptop-i7-9750H-16GB-512GB-GTX-1660-Ti/114183725523') 

And now the result only contains the ones which we want:

[
    "Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti",
    'US $1,229.49',
    '5.0'
]

Building a scraper to work with multiple websites with incremental learning

Suppose we want to make a price scraper to work with multiple websites. Here we consider ebay.com, walmart.com and etsy.com. We create some sample data for each website and then feed it to the scraper. By using update=True parameter when calling the build method, all previously learned rules will be kept and new rules will be added to them:

from autoscraper import AutoScraper

data = [
   # some Ebay examples
   ('https://www.ebay.com/itm/Sony-PlayStation-4-PS4-Pro-1TB-4K-Console-Black/193632846009', ['US $349.99']),
   ('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-FHD-Gaming-Laptop-i7-10750H-16GB-512GB-RTX-2060/303669272117', ['US $1,369.00']),
   ('https://www.ebay.com/itm/8-TAC-FORCE-SPRING-ASSISTED-FOLDING-STILETTO-TACTICAL-KNIFE-Blade-Pocket-Open/331625445801', ['US $8.95']),
   
   #some Walmart examples
   ('https://www.walmart.com/ip/8mm-Classic-Sterling-Silver-Plain-Wedding-Band-Ring/113651182', ['US $8.95']),
   ('https://www.walmart.com/ip/Apple-iPhone-11-64GB-Red-Fully-Unlocked-A-Grade-Refurbished/806414606', ['$659.99']),

   #some Etsy examples
   ('https://www.etsy.com/listing/805075149/starstruck-silk-face-mask-black-silk', ['$12.50+']),
   ('https://www.etsy.com/listing/851553172/apple-macbook-pro-i9-32gb-500gb-radeon', ['$1,500.00']),
]

scraper = AutoScraper()
for url, wanted_list in data:
   scraper.build(url=url, wanted_list=wanted_list, update=True)

Now hopefully the scraper has learned to scrape all 3 websites. Let's check some new pages:

>>> scraper.get_result_exact('https://www.ebay.com/itm/PUMA-Mens-Turino-Sneakers/274324387149')

['US $24.99', "PUMA Men's Turino Sneakers  | eBay"]


>>> scraper.get_result_exact('https://www.walmart.com/ip/Pack-of-8-Gerber-1st-Foods-Baby-Food-Peach-2-2-oz-Tubs/267133209')

['$8.71', '(Pack of 8) Gerber 1st Foods Baby Food, Peach, 2-2 oz Tubs - Walmart.com']


>>> scraper.get_result_exact('https://www.etsy.com/listing/863615551/matte-black-smart-wireless-bluetooth')

['$60.00']

Almost done! But's there's some extra info, let's fix it:

>>> scraper.get_result_exact('https://www.walmart.com/ip/Pack-of-8-Gerber-1st-Foods-Baby-Food-Peach-2-2-oz-Tubs/267133209', grouped=True)

 {'rule_cqhs': [],
 'rule_h4sy': [],
 'rule_jqtb': [],
 'rule_r9qd': ['$8.71'],
 'rule_6lt7': ['$8.71'],
 'rule_2nrk': ['$8.71'],
 'rule_wy9j': ['$8.71'],
 'rule_v395': [],
 'rule_4ej6': ['(Pack of 8) Gerber 1st Foods Baby Food, Peach, 2-2 oz Tubs - Walmart.com']}


>>> scraper.remove_rules(['rule_4ej6'])
>>> scraper.get_result_exact('https://www.ebay.com/itm/PUMA-Mens-Turino-Sneakers/274324387149')

['US $24.99']


>>> scraper.get_result_exact('https://www.walmart.com/ip/Pack-of-8-Gerber-1st-Foods-Baby-Food-Peach-2-2-oz-Tubs/267133209')

['$8.71']


>>> scraper.get_result_exact('https://www.etsy.com/listing/863615551/matte-black-smart-wireless-bluetooth')

['$60.00']

Now we have a scraper which works with Ebay, Walmart and Etsy!

Fuzzy matching for html tag attributes

Some websites use different tag values for different pages (like different styles for the same element). In these cases you can adjust attr_fuzz_ratio parameter when getting the results. See this issue for a sample usage.

Using regular expressions

You can use regular expressions for wanted items:

wanted_list = [re.compile('Lorem ipsum.+est laborum')]
@vjhebbar

This comment has been minimized.

Copy link

@vjhebbar vjhebbar commented Sep 16, 2020

Would the trained model be able to parse a 4th website that might be similar to the first 3?

@raokrutarth

This comment has been minimized.

Copy link

@raokrutarth raokrutarth commented Sep 16, 2020

If I wanted to use this to scrape the text and images from a medium or similar article, what would the parameters look like?

@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Sep 17, 2020

Would the trained model be able to parse a 4th website that might be similar to the first 3?

The structure of the new website should be nearly identical to the learned ones.

If I wanted to use this to scrape the text and images from a medium or similar article, what would the parameters look like?

For medium, you can use like this:

from autoscraper import AutoScraper 
 
url = 'https://medium.com/@Medium/statistics-2971adaa615' 
 
wanted_list = ['Your stats page will allow you to see how people are interacting with your stories. To access your stats, click on you avatar in the top right corner and 
then chooseStatsfrom the menu.', "https://miro.medium.com/max/2528/0*1RlS1lN-dt4Igp1J.png"] 
 
scraper = AutoScraper() 
scraper.build(url, wanted_list) 

scraper.get_result_similar(url, contain_sibling_leaves=True, keep_order=True) 
@raokrutarth

This comment has been minimized.

Copy link

@raokrutarth raokrutarth commented Sep 20, 2020

Thanks @alirezmika! Will give it a shot and report results. Approx. what should be the rate of false negatives for blog sites? I.e. out of 100 potential articles, how many should I expect get incorrectly truncated/missing images? (Just curious since I don't yet know the internal algorithm of the learning process.)

@stordopoulos

This comment has been minimized.

Copy link

@stordopoulos stordopoulos commented Sep 22, 2020

@alirezamika, is it possible to get the results along with their corresponding target locator (e.g. id, class, etc.)?

@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Sep 23, 2020

Thanks @alirezmika! Will give it a shot and report results. Approx. what should be the rate of false negatives for blog sites? I.e. out of 100 potential articles, how many should I expect get incorrectly truncated/missing images? (Just curious since I don't yet know the internal algorithm of the learning process.)

Depends of the structure of the site. If it's consistent across pages, false negatives rates should be low. If it's not consistent, you should add samples for different pages.

@alirezamika, is it possible to get the results along with their corresponding target locator (e.g. id, class, etc.)?

All the locators are present in the stack_list attribute of the scraper. You can get the results grouped by rule_id with grouped=True parameter and check the corresponding rule locators in the stack_list.

@Mmokwa

This comment has been minimized.

Copy link

@Mmokwa Mmokwa commented Sep 29, 2020

if I wanted to scrape videos on Instagram how would the parameter be.

@mzakariaCERN

This comment has been minimized.

Copy link

@mzakariaCERN mzakariaCERN commented Oct 4, 2020

Hi! This is very interesting. I am trying to retrieve two columns from, say wikipedia:
https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population

  1. Proper usage

I know that the following will give me two lists that I can join as a DF and continue:

from autoscraper import AutoScraper

url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population'

wanted_list_state = ["California"]
wanted_list_population = ["37,253,956"]

scraper = AutoScraper()
result_statename = scraper.build(url, wanted_list_state)
print(result_statename)

result_population = scraper.build(url, wanted_list_population)
print(result_population)

Is this the proper way to use this method or there is a better practice to pass State and 2020 population to the parser?

  1. I notice that the example I gave was of the state of California. and I ended up with a list of the 50 states + DC (missing: the 5 territories but. Now how on earth did it do it that way? I expected it to just pull the 50 (state) + DC +5 territories (the list as is).
@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Oct 4, 2020

@mzakariaCERN you can do like this:

from autoscraper import AutoScraper

url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population'
wanted_dict = {'state': ["California"], 'population': ["37,253,956"]}

scraper = AutoScraper()
scraper.build(url, wanted_dict=wanted_dict)
result = scraper.get_result_similar(url, group_by_alias=True)
print(result)

If you want to get territories you can add one of them as well to the list.
Sometimes you may get duplicate results which you can fine-tune them with keep_rules and remove_rules methods. (or using unique=True parameter in some cases would be okay.)

@mzakariaCERN

This comment has been minimized.

Copy link

@mzakariaCERN mzakariaCERN commented Oct 4, 2020

This makes good sense. Thank you @alirezamika

Moving to a different website

https://cancerstatisticscenter.cancer.org/#!/cancer-site/Non-Hodgkin%20lymphoma

I am trying to pull the data from a table that lists estimated infection rate by state

from autoscraper import AutoScraper

url = 'https://cancerstatisticscenter.cancer.org/#!/cancer-site/Non-Hodgkin%20lymphoma'
wanted_dict = {'State': ["Alaska"], 'Estimated new cases, 2020': ["120"]}

scraper = AutoScraper()
scraper.build(url, wanted_dict=wanted_dict)
result = scraper.get_result_similar(url, group_by_alias=True)
print(result)

gave an empty dictionary. What is the cause of this? Is it the format of the website not being friendly to this approach?

@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Oct 4, 2020

@mzakariaCERN This website uses js and ajax calls to retrieve and populate data. You should get the rendered html source content of the web page (with tools like puppeteer or just copy it from your browser if it's one time) and pass it to the scraper via html parameter.

@rodrihazar

This comment has been minimized.

Copy link

@rodrihazar rodrihazar commented Oct 7, 2020

Hi. I don't have much programming experience and I'm learning, and I've been looking for something like this for 2 months. It's cool what you created!
I would like to know if you could explain to me how to extract specific data from a web (specifically the price of a share) using css or html, because in Argentina (my country) web pages are not as well structured as the United States, where it is more simple to do this.
I want to use this scraper to extract stock data in real time and display it on my Wordpress website.
Hope you can help me, thank you very much.

@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Oct 7, 2020

@rodrihazar please open an issue containing the website and the code which you have tried.

@rodrihazar

This comment has been minimized.

Copy link

@rodrihazar rodrihazar commented Oct 8, 2020

@rodrihazar please open an issue containing the website and the code which you have tried.

First of all, sorry if what I'm going to ask is very silly, but I'm not very familiar with the code on GitHub or how to implement it.
What I want to do is the following:

I want to extract a real-time quote from this page https://www.puentenet.com/cotizaciones/accion/ALUA and show it in a block my web page created with Wordpress.
To do this, create the following logical path: extract the data, by means of code connect with MySQL, and with PHP code or some plugin to show it on the web.
The problem is that I don't know how to extract the data, I spent months searching and I think that with your creation I can do it. But when I copied the following code in python it gives me the answer: none

from autoscraper import AutoScraper

url = 'https://www.puentenet.com/cotizaciones/accion/ALUA'

scraper.get_result_exact('https://www.puentenet.com/cotizaciones/accion/ALUA')

scraper = AutoScraper()

result = scraper.build(url, wanted_list)

print(result)

Obviously I did something wrong, and I understand that the syntax is wrong, but I don't understand where the error is.

Using other codes they asked me for the Xpath (in this case it is: // * [@ id = "mainContainer"] / div [1] / div [1] / div [1] / span), but with your code I don't know where place it, or how to extract just that data.

Once the data is extracted, can I have your code extract data in real time? or Create a cron to update it every so often?

Thank you very much!

@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Oct 9, 2020

@rodrihazar this website uses cookies which you can obtain via session module. like this:

import requests
from autoscraper import AutoScraper

scraper = AutoScraper()
s = requests.session()

url = 'https://www.puentenet.com/cotizaciones/accion/ALUA'
s.get(url, headers=scraper.request_headers)
s.get(url, headers=scraper.request_headers)

wanted_list = ['53,00']

scraper.build(url, wanted_list, request_args={'cookies': s.cookies.get_dict()})
@rulosant

This comment has been minimized.

Copy link

@rulosant rulosant commented Oct 21, 2020

Great tool, i want to know how to undestand saved projects, to manually edit them.
I'm getting a very specific rule and i need something more generic when I scrap this site: http://gnula.nu/

I get this rules:

{"stack_list": [{"content": [["html", {"style": "", "class": ""}, 0], ["body", {"style": "", "class": ""}, 0], ["div", {"style": "", "class": ""}, 0], ["div", {"class": ["content"], "style": ""}, 1], ["div", {"style": "", "class": ""}, 0], ["div", {"class": ["post"], "style": ""}, 0], ["div", {"class": ["cover"], "style": ""}, 0], ["div", {"class": ["entry"], "style": "padding:5px 10px"}, 0], ["div", {"style": "text-align: center;", "class": ""}, 0], ["div", {"class": ["widget-content"], "style": ""}, 0], ["table", {"style": "background: none repeat scroll 0% 0% #ffffff; border-collapse: collapse; border-color: #ffd1ba; border-width: 0pt;", "class": ""}, 0], ["tbody", {"style": "", "class": ""}, 0], ["tr", {"style": "", "class": ""}, 0], ["td", {"style": "background: none repeat scroll 0% 0% #382F2A; border: 1px 1px 0px solid #3b3430;", "class": ""}, XXXXX], ["a", {"style": "", "class": ""}]], "wanted_attr": "href", "is_full_url": null, "url": "", "hash": "ee645243ec557475acca2196a919a73c9de73ec0f6c00b88cbd9ebc903b83aa1", "stack_id": "rule_irdn", "alias": ""}]}

The XXXXX is the TD number, if I change that number, i will get another TD, but i need all TD elements, of all rows, of first table.

Can you tell me how can i do that?

@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Oct 21, 2020

@rulosant can you share your code? so I can understand your problem better.

@rulosant

This comment has been minimized.

Copy link

@rulosant rulosant commented Oct 22, 2020

Here is the code:

from autoscraper import AutoScraper

url = 'http://gnula.nu/'

# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
wanted_list = ["http://gnula.nu/thriller/ver-tenet-2020-online/"]

scraper = AutoScraper()

result = scraper.build(url, wanted_list)
print(result)
result = scraper.get_result_similar(url, group_by_alias=True)

print(result)
scraper.save('gnula')

That wanted list item returns only the first TD, when i change it or add multiple items, it finds only the TD corresponding to the item in wnated list, but i need all the TDs of the table

@manoj-nain

This comment has been minimized.

Copy link

@manoj-nain manoj-nain commented Oct 22, 2020

Would the trained model be able to parse a 4th website that might be similar to the first 3?

The structure of the new website should be nearly identical to the learned ones.

If I wanted to use this to scrape the text and images from a medium or similar article, what would the parameters look like?

For medium, you can use like this:

from autoscraper import AutoScraper 
 
url = 'https://medium.com/@Medium/statistics-2971adaa615' 
 
wanted_list = ['Your stats page will allow you to see how people are interacting with your stories. To access your stats, click on you avatar in the top right corner and 
then chooseStatsfrom the menu.', "https://miro.medium.com/max/2528/0*1RlS1lN-dt4Igp1J.png"] 
 
scraper = AutoScraper() 
scraper.build(url, wanted_list) 

scraper.get_result_similar(url, contain_sibling_leaves=True, keep_order=True) 

I tried this on medium. But its only working on the blog link that you have given here. Doesn't work on any other medium blog. Also it doesn't work on any Wordpress blog either. Maybe the issue with Wordpress is Ajax requests. But don't know why its not working on Medium.

@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Oct 22, 2020

That wanted list item returns only the first TD, when i change it or add multiple items, it finds only the TD corresponding to the item in wnated list, but i need all the TDs of the table

@rulosant you can use:

scraper.get_result_similar(url, contain_sibling_leaves=True)
@rulosant

This comment has been minimized.

Copy link

@rulosant rulosant commented Oct 22, 2020

That works perfect! Is that documented anywhere?

Now i want to get the results only from one of these divs:
image

In saved project i see:
["div", {"style": "text-align: center;", "class": ""}, 0]

I thought that changing 0 to 1 will take only 1st div, but i was wrong.

Thankyou!

@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Oct 26, 2020

That works perfect! Is that documented anywhere?

Just in the module doc-string.

I thought that changing 0 to 1 will take only 1st div, but i was wrong.

You can use get_result_exact method if you want to get specific elements.

@sadickam

This comment has been minimized.

Copy link

@sadickam sadickam commented Oct 28, 2020

Hi @alirezamika, thank you for this very interesting tool. I tried the code below and got an empty dictionary. I will be grateful if you will have a look and kindly point me in the right direction.

from autoscraper import AutoScraper

url = "http://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/vic/VCAT/2020/460.html? 
context=1;query=construction%20dispute;mask_path=au/cases/vic/VCAT"

wanted_dict = {'catchwords': ["Building & Property List –Termination of domestic building contacts by repudiation – Breach of implied duty 
to cooperate – Defective workmanship – Variations – Non-compliance with notice requirements – Whether exceptional circumstances or 
hardship – Calculation of award of damages."]}

scraper = AutoScraper()
scraper.build(url, wanted_dict=wanted_dict)
result = scraper.get_result_similar(url, group_by_alias=True)
print(result)

Thank you for your help.

@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Oct 28, 2020

Hi @alirezamika, thank you for this very interesting tool. I tried the code below and got an empty dictionary. I will be grateful if you will have a look and kindly point me in the right direction.

@sadickam This text which you are trying to scrape contains lots of newline characters. You can copy it from page source. Try this:

from autoscraper import AutoScraper

url = "http://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/vic/VCAT/2020/460.html?context=1;query=construction%20dispute;mask_path=au/cases/vic/VCAT"

wanted_dict = {'catchwords': ["""Building & Property List –Termination of domestic building
contacts by repudiation – Breach of implied duty to cooperate

Defective workmanship – Variations – Non-compliance with notice
requirements – Whether exceptional circumstances
or hardship –
Calculation of award of damages."""]}

scraper = AutoScraper()
scraper.build(url, wanted_dict=wanted_dict)
result = scraper.get_result_similar(url, group_by_alias=True)
print(result)
@sadickam

This comment has been minimized.

Copy link

@sadickam sadickam commented Oct 29, 2020

Hi @alirezamika, thank you for this very interesting tool. I tried the code below and got an empty dictionary. I will be grateful if you will have a look and kindly point me in the right direction.

@sadickam This text which you are trying to scrape contains lots of newline characters. You can copy it from page source. Try this:

from autoscraper import AutoScraper

url = "http://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/vic/VCAT/2020/460.html?context=1;query=construction%20dispute;mask_path=au/cases/vic/VCAT"

wanted_dict = {'catchwords': ["""Building & Property List –Termination of domestic building
contacts by repudiation – Breach of implied duty to cooperate

Defective workmanship – Variations – Non-compliance with notice
requirements – Whether exceptional circumstances
or hardship –
Calculation of award of damages."""]}

scraper = AutoScraper()
scraper.build(url, wanted_dict=wanted_dict)
result = scraper.get_result_similar(url, group_by_alias=True)
print(result)

@alirezamika, thank you soo much. That makes lots of sense. So the algorithm needed to learn that the text has lots of newline characters. I am right? Thanks all the same. Really grateful.

@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Oct 29, 2020

That makes lots of sense. So the algorithm needed to learn that the text has lots of newline characters. I am right?

The scraper expects an exact match. In some rare cases like yours, copying from the browser may be different from the source code.

@zkung2

This comment has been minimized.

Copy link

@zkung2 zkung2 commented Nov 1, 2020

@alirezamika
Chinese content is not accessible. Can you modify and add some code in the code?

    @classmethod
    def _get_soup(cls, url=None, html=None, request_args=None):
        request_args = request_args or {}

        if html:
            html = unicodedata.normalize("NFKD", unescape(html))
            return BeautifulSoup(html, 'lxml')

        headers = dict(cls.request_headers)
        if url:
            headers['Host'] = urlparse(url).netloc

        user_headers = request_args.pop('headers', {})
        headers.update(user_headers)
        # Change the code here
        r = requests.get(url, headers=headers, **request_args)
        r.encoding = r.apparent_encoding
        html = r.text
        html = unicodedata.normalize("NFKD", unescape(html))

        return BeautifulSoup(html, 'lxml')

original

        html = requests.get(url, headers=headers, **request_args).text
        html = unicodedata.normalize("NFKD", unescape(html))

modified

        r = requests.get(url, headers=headers, **request_args)
        r.encoding = r.apparent_encoding
        html = r.text
        html = unicodedata.normalize("NFKD", unescape(html))
@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Nov 1, 2020

Chinese content is not accessible. Can you modify and add some code in the code?

Can you give an example? There should't be any language specific problem as all strings are unicode.

@zkung2

This comment has been minimized.

Copy link

@zkung2 zkung2 commented Nov 2, 2020

Chinese content is not accessible. Can you modify and add some code in the code?

Can you give an example? There should't be any language specific problem as all strings are unicode.

It is not possible to extract content on some chinese websites.
Is there any way to extract Chinese content from this website?Thank you.

from autoscraper import AutoScraper

url = 'https://top.chinaz.com/diqu/index_GuangDong_ShenZhen.html'

# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
data_old = {'web_name':['腾讯网'], 'url':['qq.com']}

scraper = AutoScraper()
scraper.build(url, wanted_dict=data_old)
scraper.get_result_similar('https://top.chinaz.com/diqu/index_GuangDong_ShenZhen.html', group_by_alias=True)

relust:

{'url': ['qq.com',
  'v.qq.com',
  'kuaidi100.com',
  'bendibao.com',
  '500.com',
  'y.qq.com',
  'sf-express.com',
  'huawei.com',
  'maigoo.com',
  'to8to.com',
  'b2b168.com',
  'elecfans.com',
  '11467.com',
  'chachaba.com',
  'szhk.com',
  'tencent.com',
  'tvsou.com',
  'news.qq.com',
  'jiwu.com',
  'sz.gov.cn',
  '51sole.com',
  'vmall.com',
  'mail.qq.com',
  'shejiben.com',
  'ent.qq.com',
  'lol.qq.com',
  'xunlei.com',
  'aicai.com',
  'szhome.com',
  'sports.qq.com']}

html source code text encoding:

<!doctype html>
--
  | <html>
  | <head>
  | <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  | <title>深圳网站 - 深圳网站排名 - 网站排行榜</title>
@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Nov 2, 2020

Is there any way to extract Chinese content from this website?Thank you.

@zkung try:

import requests
from autoscraper import AutoScraper

url = 'https://top.chinaz.com/diqu/index_GuangDong_ShenZhen.html'
res = requests.get(url)
res.encoding = 'utf-8'
html = res.text

wanted_dict = {'web_name':['腾讯网'], 'url':['qq.com']}

scraper = AutoScraper()
scraper.build(url, html=html, wanted_dict=wanted_dict)
scraper.get_result_similar(url, html=html, group_by_alias=True)
@zkung2

This comment has been minimized.

Copy link

@zkung2 zkung2 commented Nov 2, 2020

Is there any way to extract Chinese content from this website?Thank you.

@zkung try:

import requests
from autoscraper import AutoScraper

url = 'https://top.chinaz.com/diqu/index_GuangDong_ShenZhen.html'
res = requests.get(url)
res.encoding = 'utf-8'
html = res.text

wanted_dict = {'web_name':['腾讯网'], 'url':['qq.com']}

scraper = AutoScraper()
scraper.build(url, html=html, wanted_dict=wanted_dict)
scraper.get_result_similar(url, html=html, group_by_alias=True)

@alirezamika Thank you very much.

@cpun94

This comment has been minimized.

Copy link

@cpun94 cpun94 commented Nov 4, 2020

This is a great tool!

I'm trying to perform a scrape of the device brand, device name, monthly price (..../mo.), down payment ($0) for most, and savings (i.e., "Save up to $450") from the following website:

https://www.telus.com/en/mobility/phones

When I try to scrape the rules, the rule ID's corresponding to the data change so I am unable to save rules. Was wondering if I could get your help on this. My code is below.

`from autoscraper import AutoScraper

url = 'https://www.telus.com/en/mobility/phones'
wanted_list = ['Apple', 'iPhone 12 Pro', '41', '.25', '0', 'Save up to $450']

scraper = AutoScraper()
scraper.build(url=url, wanted_list=wanted_list)

scraper.get_result_exact(url, grouped=True)`

@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Nov 4, 2020

@cpun94 if you want to have the same ids for each run add this at start of your code:

import random
random.seed(0)

I also suggest to achieve this by using aliases instead of rules if you can.

@cpun94

This comment has been minimized.

Copy link

@cpun94 cpun94 commented Nov 4, 2020

@cpun94 if you want to have the same ids for each run add this at start of your code:

import random
random.seed(0)

I also suggest to achieve this by using aliases instead of rules if you can.

Sorry, having a little bit of trouble understanding your suggestion at the very end. Is there an example that you can refer me to, to help better understand your suggestion?

@PaperChonChon

This comment has been minimized.

Copy link

@PaperChonChon PaperChonChon commented Nov 14, 2020

How do I get variable values or direct HTML codes instead?
For example:

<script type="text/javascript" language="JavaScript">
                                            var totalPrice = "\'$55.40\'";
                                            var totalQuantity = '7';
</script>

I want to capture 7 or '7', can the scraper do this?
I have tried wanted_list = ["xxx","totalQuantity = '7'"] but it did not work.

Also, using get_result_exact or get_result_similar, I always get empty results like ({'name': [], 'desc': [], 'article': [], 'photo': [], 'price': [], 'subtotal': []}, {'name': [], 'desc': [], 'article': [], 'photo': [], 'price': [], 'subtotal': []}) even though my build() works just fine.
My dictionary:

wanted_dict = {	"name":["a"],
				"desc":["b"],
				"article":["111"],
				"photo":["c"],
				"desc":["d"],
				"price":["$1"],
				"subtotal":["$2"]
}
result = scraper.build(url=url, wanted_dict=wanted_dict, update=True, request_args=dict(proxies=proxies, verify=False,cookies=cookies1))
result1 = scraper.get_result(url, group_by_alias=True) 
print(result1)
@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Nov 14, 2020

@PaperChonChon the scraper works with text and attribute values for now, not the scipts.
And for your second problem, you should pass request_args to the get_results methods too.

@hellosunny

This comment has been minimized.

Copy link

@hellosunny hellosunny commented Nov 20, 2020

It is one of the best webscraper I have ever seen. Great job!

Actually, how can I handle the empty cell in some rows? Please find the code below. I have found one of the column has empty cell, and it makes the scraper return difference length of data.

from autoscraper import AutoScraper
url='http://www.etnet.com.hk/www/eng/stocks/sector_business.php?business=6'
wanted_dict={'code':['02616','03613'], 'currency':['HKD','HKD']}
scraper = AutoScraper()
scraper.build(url, wanted_dict=wanted_dict)
result=scraper.get_result_similar(url, group_by_alias=True)

print('lenght code={} currency={}'.format(len(result['code']), len(result['currency'])))

lenght code=27 currency=338

@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Nov 22, 2020

Actually, how can I handle the empty cell in some rows? Please find the code below. I have found one of the column has empty cell, and it makes the scraper return difference length of data.

The scraper emits the empty results for now. Adding it as an option would be nice.
You may need to use grouped option and fine tune the rules to remove duplicate ones.

@hugocool

This comment has been minimized.

Copy link

@hugocool hugocool commented Nov 22, 2020

Edit: I already found the answer in the code, the build method allows one to pass HTML, sorry for wasting your time!

First let me say this is an amazing project!

We do a lot of scraping of dynamic sites, can I use this scraper with selenium?

So instead of passing a URL of a website, I would like to pass the HTML content of the page after some interaction with the page's javascript.

Is this possible?

@jaff248

This comment has been minimized.

Copy link

@jaff248 jaff248 commented Nov 29, 2020

Is there an example of passing an HTML tag into the build method?

@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Nov 29, 2020

Is there an example of passing an HTML tag into the build method?

just pass its value as a regular text.

@j3vr0n

This comment has been minimized.

Copy link

@j3vr0n j3vr0n commented Dec 7, 2020

Hi, cool work! I'm just wondering how this would scale. Let's say I want to collect the title and price from a list of Etsy urls. Does the algo pick up on the first set of inputs in order to scale to future urls?

For example, let's say I wanted to pull from Etsy using the following parameters:

('https://www.etsy.com/listing/851553172/apple-macbook-pro-i9-32gb-500gb-radeon', ['$1,500.00'])

Then, I want to capture future urls from etsy with a similar format. Would I need to pass in the exact inputs or can the algo adapt to future urls? I think I'm just a bit confused from the documentation for how this can scale to more urls over time without needing exact matches for the inputs.

@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Dec 9, 2020

Then, I want to capture future urls from etsy with a similar format. Would I need to pass in the exact inputs or can the algo adapt to future urls? I think I'm just a bit confused from the documentation for how this can scale to more urls over time without needing exact matches for the inputs.

Hi! I'm not sure if I understood your question, but for each training data, the scraper learns the structure of that page and those which are completely similar to it. so if your target website has multiple page formats, its good to provide 1 sample for each page format. if the html attributes change in a minimal way in between pages, you can also check the attr_fuzz_ratio attribute.

@mzakariaCERN

This comment has been minimized.

Copy link

@mzakariaCERN mzakariaCERN commented Jan 3, 2021

Hello @alirezamika

Two points please:

  1. I am trying to get the second table from this page:
    https://acsjournals.onlinelibrary.wiley.com/doi/full/10.3322/caac.21590
from autoscraper import AutoScraper
url = 'https://acsjournals.onlinelibrary.wiley.com/doi/full/10.3322/caac.21590'

#wanted_list = ["NON‐HODGKIN LYMPHOMA"]
wanted_list = ['Estimated New Cases for Selected Cancers by State']
scraper = AutoScraper()
#scraper.build(url, wanted_dict=wanted_dict)
scraper.build(url, wanted_list)

result = scraper.get_result_similar(url, group_by_alias=False)

print(result)

gives me an empty list. Can you please help

  1. I notices that, for the same page. using requests gives a 403 error. I am curious to know why is this not the case with Autoscraper
import requests
page = requests.get('https://acsjournals.onlinelibrary.wiley.com/doi/full/10.3322/caac.21590')
page
@slavahu

This comment has been minimized.

Copy link

@slavahu slavahu commented Feb 13, 2021

Hello,
Thank you for a great work. I'm trying to process seemingly very simple page but it completely doesn't work.
It doesn't work "simple way". It doesn't work with "request". If I try to process downloaded mhtml (but not html!)
I'm able to process dates, but not rates. Could you please look to this page? Thank you in advance.

import requests
from autoscraper import AutoScraper

url = 'https://www.federalreserve.gov/releases/h10/hist/dat00_eu.htm'
res = requests.get(url)
html = res.text

#fname="./The Fed - Foreign Exchange Rates - H.10 - February 08, 2021.mhtml"
#HtmlFile = open(fname, 'r')
#html = HtmlFile.read()

#print(html)

wanted_dict = {'Date': ["3-Jan-00"], 'Rate': ["1.0155"]}
#wanted_list = ["3-Jan-00"]
#wanted_list = ["1.0155"]

scraper = AutoScraper()
result = scraper.build(url, html=html, wanted_dict=wanted_dict)
#result = scraper.build(url, html=html, wanted_list=wanted_list)
print(result)
#result = scraper.get_result_similar(url, html=html, group_by_alias=True, keep_order=True, grouped=True)
#print(result)

@harishsg99

This comment has been minimized.

Copy link

@harishsg99 harishsg99 commented Mar 3, 2021

import requests
from autoscraper import AutoScraper
from requests_html import HTMLSession

scraper = AutoScraper()
s = requests.session()
session = HTMLSession()

url = 'https://www.flipkart.com/search?q=iphone'
response = session.get(url)
print(response.content)
html1 = response.text
s.get(url, headers=scraper.request_headers)

wanted_list = ['₹49,999','https//dl.flipkart.com/dl/home','Apple iPhone 11 (White, 64 GB)']
scraper.build(url, wanted_list, html=html1,request_args={'cookies': s.cookies.get_dict()})
scraper.get_result_similar(url,html=html1,grouped=True,request_args={'cookies': s.cookies.get_dict()})

How to get url links from website I am trying to scrap

@OrkhanS

This comment has been minimized.

Copy link

@OrkhanS OrkhanS commented Mar 19, 2021

Hi, after using the build, will it automatically keep the rules? Like I tried to build with one product on Amazon and later tried with get_results_exact(), but it doesn't work. Am I using it right?

image

And the result is:

image

@rodrihazar

This comment has been minimized.

Copy link

@rodrihazar rodrihazar commented Apr 3, 2021

Hi Mika!
How I can to create a crawler to find and print all the url's from the DOM?

Thank you!

@alirezamika

This comment has been minimized.

Copy link
Owner Author

@alirezamika alirezamika commented Apr 6, 2021

How I can to create a crawler to find and print all the url's from the DOM?

Hi, you can use regex to find urls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment