Skip to content

Instantly share code, notes, and snippets.

@alirezamika
Last active March 19, 2024 15:33
  • Star 65 You must be signed in to star a gist
  • Fork 16 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save alirezamika/72083221891eecd991bbc0a2a2467673 to your computer and use it in GitHub Desktop.
AutoScraper Examples

Grouping results and removing unwanted ones

Here we want to scrape product name, price and rating from ebay product pages:

url = 'https://www.ebay.com/itm/Sony-PlayStation-4-PS4-Pro-1TB-4K-Console-Black/203084236670' 

wanted_list = ['Sony PlayStation 4 PS4 Pro 1TB 4K Console - Black', 'US $349.99', '4.8'] 

scraper.build(url, wanted_list)

The items which we wanted have been on multiple sections of the page and the scraper tries to catch them all. So it may retrieve some extra information compared to what we have in mind. Let's run it on a different page:

scraper.get_result_exact('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-144Hz-FHD-Laptop-i7-9750H-16GB-512GB-GTX-1660-Ti/114183725523') 

The result:

[
    "Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti",
    'ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB',
    'US $1,229.49',
    '5.0'
]

As we can see we have one extra item here. We can run the get_result_exact or get_result_similar method with grouped=True parameter. It will group all results per its scraping rule:

scraper.get_result_exact('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-144Hz-FHD-Laptop-i7-9750H-16GB-512GB-GTX-1660-Ti/114183725523', grouped=True) 

Output:

{
    'rule_sks3': ["Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti"],
    'rule_d4n5': ['ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB'],
    'rule_fmrm': ['ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB'],
    'rule_2ydq': ['US $1,229.49'],
    'rule_buhw': ['5.0'],
    'rule_vpfp': ['5.0']
}

Now we can use keep_rules or remove_rules methods to prune unwanted rules:

scraper.keep_rules(['rule_sks3', 'rule_2ydq', 'rule_buhw'])
 
scraper.get_result_exact('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-144Hz-FHD-Laptop-i7-9750H-16GB-512GB-GTX-1660-Ti/114183725523') 

And now the result only contains the ones which we want:

[
    "Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti",
    'US $1,229.49',
    '5.0'
]

Building a scraper to work with multiple websites with incremental learning

Suppose we want to make a price scraper to work with multiple websites. Here we consider ebay.com, walmart.com and etsy.com. We create some sample data for each website and then feed it to the scraper. By using update=True parameter when calling the build method, all previously learned rules will be kept and new rules will be added to them:

from autoscraper import AutoScraper

data = [
   # some Ebay examples
   ('https://www.ebay.com/itm/Sony-PlayStation-4-PS4-Pro-1TB-4K-Console-Black/193632846009', ['US $349.99']),
   ('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-FHD-Gaming-Laptop-i7-10750H-16GB-512GB-RTX-2060/303669272117', ['US $1,369.00']),
   ('https://www.ebay.com/itm/8-TAC-FORCE-SPRING-ASSISTED-FOLDING-STILETTO-TACTICAL-KNIFE-Blade-Pocket-Open/331625445801', ['US $8.95']),
   
   #some Walmart examples
   ('https://www.walmart.com/ip/8mm-Classic-Sterling-Silver-Plain-Wedding-Band-Ring/113651182', ['US $8.95']),
   ('https://www.walmart.com/ip/Apple-iPhone-11-64GB-Red-Fully-Unlocked-A-Grade-Refurbished/806414606', ['$659.99']),

   #some Etsy examples
   ('https://www.etsy.com/listing/805075149/starstruck-silk-face-mask-black-silk', ['$12.50+']),
   ('https://www.etsy.com/listing/851553172/apple-macbook-pro-i9-32gb-500gb-radeon', ['$1,500.00']),
]

scraper = AutoScraper()
for url, wanted_list in data:
   scraper.build(url=url, wanted_list=wanted_list, update=True)

Now hopefully the scraper has learned to scrape all 3 websites. Let's check some new pages:

>>> scraper.get_result_exact('https://www.ebay.com/itm/PUMA-Mens-Turino-Sneakers/274324387149')

['US $24.99', "PUMA Men's Turino Sneakers  | eBay"]


>>> scraper.get_result_exact('https://www.walmart.com/ip/Pack-of-8-Gerber-1st-Foods-Baby-Food-Peach-2-2-oz-Tubs/267133209')

['$8.71', '(Pack of 8) Gerber 1st Foods Baby Food, Peach, 2-2 oz Tubs - Walmart.com']


>>> scraper.get_result_exact('https://www.etsy.com/listing/863615551/matte-black-smart-wireless-bluetooth')

['$60.00']

Almost done! But's there's some extra info, let's fix it:

>>> scraper.get_result_exact('https://www.walmart.com/ip/Pack-of-8-Gerber-1st-Foods-Baby-Food-Peach-2-2-oz-Tubs/267133209', grouped=True)

 {'rule_cqhs': [],
 'rule_h4sy': [],
 'rule_jqtb': [],
 'rule_r9qd': ['$8.71'],
 'rule_6lt7': ['$8.71'],
 'rule_2nrk': ['$8.71'],
 'rule_wy9j': ['$8.71'],
 'rule_v395': [],
 'rule_4ej6': ['(Pack of 8) Gerber 1st Foods Baby Food, Peach, 2-2 oz Tubs - Walmart.com']}


>>> scraper.remove_rules(['rule_4ej6'])
>>> scraper.get_result_exact('https://www.ebay.com/itm/PUMA-Mens-Turino-Sneakers/274324387149')

['US $24.99']


>>> scraper.get_result_exact('https://www.walmart.com/ip/Pack-of-8-Gerber-1st-Foods-Baby-Food-Peach-2-2-oz-Tubs/267133209')

['$8.71']


>>> scraper.get_result_exact('https://www.etsy.com/listing/863615551/matte-black-smart-wireless-bluetooth')

['$60.00']

Now we have a scraper which works with Ebay, Walmart and Etsy!

Fuzzy matching for html tag attributes

Some websites use different tag values for different pages (like different styles for the same element). In these cases you can adjust attr_fuzz_ratio parameter when getting the results. See this issue for a sample usage.

Using regular expressions

You can use regular expressions for wanted items:

wanted_list = [re.compile('Lorem ipsum.+est laborum')]
@cpun94
Copy link

cpun94 commented Nov 4, 2020

This is a great tool!

I'm trying to perform a scrape of the device brand, device name, monthly price (..../mo.), down payment ($0) for most, and savings (i.e., "Save up to $450") from the following website:

https://www.telus.com/en/mobility/phones

When I try to scrape the rules, the rule ID's corresponding to the data change so I am unable to save rules. Was wondering if I could get your help on this. My code is below.

`from autoscraper import AutoScraper

url = 'https://www.telus.com/en/mobility/phones'
wanted_list = ['Apple', 'iPhone 12 Pro', '41', '.25', '0', 'Save up to $450']

scraper = AutoScraper()
scraper.build(url=url, wanted_list=wanted_list)

scraper.get_result_exact(url, grouped=True)`

@alirezamika
Copy link
Author

@cpun94 if you want to have the same ids for each run add this at start of your code:

import random
random.seed(0)

I also suggest to achieve this by using aliases instead of rules if you can.

@cpun94
Copy link

cpun94 commented Nov 4, 2020

@cpun94 if you want to have the same ids for each run add this at start of your code:

import random
random.seed(0)

I also suggest to achieve this by using aliases instead of rules if you can.

Sorry, having a little bit of trouble understanding your suggestion at the very end. Is there an example that you can refer me to, to help better understand your suggestion?

@PaperChonChon
Copy link

PaperChonChon commented Nov 14, 2020

How do I get variable values or direct HTML codes instead?
For example:

<script type="text/javascript" language="JavaScript">
                                            var totalPrice = "\'$55.40\'";
                                            var totalQuantity = '7';
</script>

I want to capture 7 or '7', can the scraper do this?
I have tried wanted_list = ["xxx","totalQuantity = '7'"] but it did not work.

Also, using get_result_exact or get_result_similar, I always get empty results like ({'name': [], 'desc': [], 'article': [], 'photo': [], 'price': [], 'subtotal': []}, {'name': [], 'desc': [], 'article': [], 'photo': [], 'price': [], 'subtotal': []}) even though my build() works just fine.
My dictionary:

wanted_dict = {	"name":["a"],
				"desc":["b"],
				"article":["111"],
				"photo":["c"],
				"desc":["d"],
				"price":["$1"],
				"subtotal":["$2"]
}
result = scraper.build(url=url, wanted_dict=wanted_dict, update=True, request_args=dict(proxies=proxies, verify=False,cookies=cookies1))
result1 = scraper.get_result(url, group_by_alias=True) 
print(result1)

@alirezamika
Copy link
Author

@PaperChonChon the scraper works with text and attribute values for now, not the scipts.
And for your second problem, you should pass request_args to the get_results methods too.

@hellosunny
Copy link

It is one of the best webscraper I have ever seen. Great job!

Actually, how can I handle the empty cell in some rows? Please find the code below. I have found one of the column has empty cell, and it makes the scraper return difference length of data.

from autoscraper import AutoScraper
url='http://www.etnet.com.hk/www/eng/stocks/sector_business.php?business=6'
wanted_dict={'code':['02616','03613'], 'currency':['HKD','HKD']}
scraper = AutoScraper()
scraper.build(url, wanted_dict=wanted_dict)
result=scraper.get_result_similar(url, group_by_alias=True)

print('lenght code={} currency={}'.format(len(result['code']), len(result['currency'])))

lenght code=27 currency=338

@alirezamika
Copy link
Author

Actually, how can I handle the empty cell in some rows? Please find the code below. I have found one of the column has empty cell, and it makes the scraper return difference length of data.

The scraper emits the empty results for now. Adding it as an option would be nice.
You may need to use grouped option and fine tune the rules to remove duplicate ones.

@hugocool
Copy link

hugocool commented Nov 22, 2020

Edit: I already found the answer in the code, the build method allows one to pass HTML, sorry for wasting your time!

First let me say this is an amazing project!

We do a lot of scraping of dynamic sites, can I use this scraper with selenium?

So instead of passing a URL of a website, I would like to pass the HTML content of the page after some interaction with the page's javascript.

Is this possible?

@jaff248
Copy link

jaff248 commented Nov 29, 2020

Is there an example of passing an HTML tag into the build method?

@alirezamika
Copy link
Author

Is there an example of passing an HTML tag into the build method?

just pass its value as a regular text.

@j3vr0n
Copy link

j3vr0n commented Dec 7, 2020

Hi, cool work! I'm just wondering how this would scale. Let's say I want to collect the title and price from a list of Etsy urls. Does the algo pick up on the first set of inputs in order to scale to future urls?

For example, let's say I wanted to pull from Etsy using the following parameters:

('https://www.etsy.com/listing/851553172/apple-macbook-pro-i9-32gb-500gb-radeon', ['$1,500.00'])

Then, I want to capture future urls from etsy with a similar format. Would I need to pass in the exact inputs or can the algo adapt to future urls? I think I'm just a bit confused from the documentation for how this can scale to more urls over time without needing exact matches for the inputs.

@alirezamika
Copy link
Author

Then, I want to capture future urls from etsy with a similar format. Would I need to pass in the exact inputs or can the algo adapt to future urls? I think I'm just a bit confused from the documentation for how this can scale to more urls over time without needing exact matches for the inputs.

Hi! I'm not sure if I understood your question, but for each training data, the scraper learns the structure of that page and those which are completely similar to it. so if your target website has multiple page formats, its good to provide 1 sample for each page format. if the html attributes change in a minimal way in between pages, you can also check the attr_fuzz_ratio attribute.

@mzakariaCERN
Copy link

mzakariaCERN commented Jan 3, 2021

Hello @alirezamika

Two points please:

  1. I am trying to get the second table from this page:
    https://acsjournals.onlinelibrary.wiley.com/doi/full/10.3322/caac.21590
from autoscraper import AutoScraper
url = 'https://acsjournals.onlinelibrary.wiley.com/doi/full/10.3322/caac.21590'

#wanted_list = ["NON‐HODGKIN LYMPHOMA"]
wanted_list = ['Estimated New Cases for Selected Cancers by State']
scraper = AutoScraper()
#scraper.build(url, wanted_dict=wanted_dict)
scraper.build(url, wanted_list)

result = scraper.get_result_similar(url, group_by_alias=False)

print(result)

gives me an empty list. Can you please help

  1. I notices that, for the same page. using requests gives a 403 error. I am curious to know why is this not the case with Autoscraper
import requests
page = requests.get('https://acsjournals.onlinelibrary.wiley.com/doi/full/10.3322/caac.21590')
page

@slavahu
Copy link

slavahu commented Feb 13, 2021

Hello,
Thank you for a great work. I'm trying to process seemingly very simple page but it completely doesn't work.
It doesn't work "simple way". It doesn't work with "request". If I try to process downloaded mhtml (but not html!)
I'm able to process dates, but not rates. Could you please look to this page? Thank you in advance.

import requests
from autoscraper import AutoScraper

url = 'https://www.federalreserve.gov/releases/h10/hist/dat00_eu.htm'
res = requests.get(url)
html = res.text

#fname="./The Fed - Foreign Exchange Rates - H.10 - February 08, 2021.mhtml"
#HtmlFile = open(fname, 'r')
#html = HtmlFile.read()

#print(html)

wanted_dict = {'Date': ["3-Jan-00"], 'Rate': ["1.0155"]}
#wanted_list = ["3-Jan-00"]
#wanted_list = ["1.0155"]

scraper = AutoScraper()
result = scraper.build(url, html=html, wanted_dict=wanted_dict)
#result = scraper.build(url, html=html, wanted_list=wanted_list)
print(result)
#result = scraper.get_result_similar(url, html=html, group_by_alias=True, keep_order=True, grouped=True)
#print(result)

@harishsg99
Copy link

import requests
from autoscraper import AutoScraper
from requests_html import HTMLSession

scraper = AutoScraper()
s = requests.session()
session = HTMLSession()

url = 'https://www.flipkart.com/search?q=iphone'
response = session.get(url)
print(response.content)
html1 = response.text
s.get(url, headers=scraper.request_headers)

wanted_list = ['₹49,999','https//dl.flipkart.com/dl/home','Apple iPhone 11 (White, 64 GB)']
scraper.build(url, wanted_list, html=html1,request_args={'cookies': s.cookies.get_dict()})
scraper.get_result_similar(url,html=html1,grouped=True,request_args={'cookies': s.cookies.get_dict()})

How to get url links from website I am trying to scrap

@OrkhanS
Copy link

OrkhanS commented Mar 19, 2021

Hi, after using the build, will it automatically keep the rules? Like I tried to build with one product on Amazon and later tried with get_results_exact(), but it doesn't work. Am I using it right?

image

And the result is:

image

@rodrigonzalezok
Copy link

Hi Mika!
How I can to create a crawler to find and print all the url's from the DOM?

Thank you!

@alirezamika
Copy link
Author

How I can to create a crawler to find and print all the url's from the DOM?

Hi, you can use regex to find urls.

@ianbracing90
Copy link

Hello,

I can't seem to handle fetching information from the following website: https://www.bet365.com/#/AC/B2/C172/D101/E50379427/F2/P10/

I wish to pull all selections and their prices, if possible. Here's another format (link might have expired)
https://www.bet365.com/#/AC/B2/C101/D20210422/E20745422/F101650958/P10/

from autoscraper import AutoScraper

url = 'https://www.bet365.com/#/AC/B73/C104/D20210420/E20745288/F101613130/G1/H543/P10/'

wanted_dict = {'items': ['Shazeera','Southern Tales','Wrs Buster Brown','Jess No Foolin']}

scraper = AutoScraper()
scraper.build(url, wanted_dict=wanted_dict)
result = scraper.get_result_similar(url, group_by_alias=True)
print(result)

@rishabhjain6377
Copy link

Hello,

I want to scrape all the comments in Kaggle's notebook. But when I ran the below code, it gave me an empty list.

url = "https://www.kaggle.com/jeongyoonlee/dae-with-2-lines-of-code-with-kaggler/comments"
wanted_list = ["RESPECT! :)"]
scraper = AutoScraper()
result = scraper.build(url,wanted_list)
print(result)

@Hanisonian
Copy link

I would like to use auto scraper to fetch numbers greater than zero on a website which displays numbers in a row and then after fetching those numbers I auto click on the NEXT page and keep on fetching the numbers till the last page...kindly help(am using the above main example code).

@nileshchilka1
Copy link

please add scrolling

@nixonthe
Copy link

nixonthe commented Jan 12, 2022

Hi @alirezamika ! I tried to scrape this page: https://u.gg/lol/top-lane-tier-list But when I got empty list. What I did wrong?

from autoscraper import AutoScraper

url = 'https://u.gg/lol/top-lane-tier-list'

wanted_list = ['Shen', '52.68%', '2.1%']

scraper = AutoScraper()

result = scraper.build(url, wanted_list=wanted_listl)

print(result)

@natzar
Copy link

natzar commented Feb 18, 2022

What about pages that use Js to load content? Is it working too?

@anoduck
Copy link

anoduck commented Jun 7, 2022

Hi @alirezamika ! I tried to scrape this page: https://u.gg/lol/top-lane-tier-list But when I got empty list. What I did wrong?

from autoscraper import AutoScraper

url = 'https://u.gg/lol/top-lane-tier-list'

wanted_list = ['Shen', '52.68%', '2.1%']

scraper = AutoScraper()

result = [scraper.build](http://scraper.build)(url, wanted_list=wanted_listl)

print(result)

@nixonthe --> The page loads content via javascript, and not html or php. Not only that, it loads the data from a different domain than the one hosting the page, when might have to do with XSS policy.

@Vponed
Copy link

Vponed commented Jul 2, 2022

Thank you so much for the code. Please tell me if it is possible to use the model in other programming languages?
Well, or somehow extract the request for data itself?

@Abdul-Hannan96
Copy link

How can we apply auto scraper on multiple pages?

@JettScythe
Copy link

What about pages that use Js to load content? Is it working too?

@natzar Nope. You will need to use some kind of lib (like requests-html or selenium) to render the content and pass it to the builder

@meirpertzz
Copy link

meirpertzz commented Jan 28, 2023

Hi everyone, I have started using this tool and I love it!
I have some question though, I am using it to run over product pages (using wanted_dict key-value), in this product pages sometime, I don't have all the values, for example, not in all the pages, we have x and y, sometimes you just have x, now I wonder how can I teach the model about the pages with some empty values?

I will highly appreciate any suggestions

Thank you very much

@RishabAgr
Copy link

The text_fuzz-ratio attribute in the build method seems to be causing errors in the code.
I wrote this code in order to extract items and it returns a list:

scraper = AutoScraper() result = scraper.build(url=url, wanted_list=sample_item)

However when I add the text_fuzz_ratio attribute to try and get a more general list:

scraper = AutoScraper() result = scraper.build(url=url, wanted_list=sample_item, text_fuzz_ratio= 0.9)

It returns a TypeError
image

Thoughts?

@debrupf2946
Copy link

Hi can some one help in how to save the scraped data into csv file

@rhythm-04
Copy link

Hi @alirezamika, I want to know, can we send keys or click on any link in any website using autoscraper?

@furkannkilicc
Copy link

Hi, when ı tried to get list of products name and price there is 2 problem
1 -warning about bulk data => ı solved it with zip it in for loop
2- not returning all the datas just for 36 ( ı dont know why and how to solve it )
could you please help me ?

@akoredenitan
Copy link

Hi @alirezamika ,

I tried the code here and maybe I am doing something wrong but couldn't get it to work as I had expected.

Firstly, I am relatively new to Web-scraping and saw this while working on another project.

I would like to fetch information from a table on a webpage when I specify the model of the CPU in my wanted list, I get an empty array returned most time with the exception of when I used the CPU name which then returns only 2 results.

import requests
import auto_scraper
import autoscraper

from autoscraper import AutoScraper

cpuUrl = 'https://www.techpowerup.com/cpu-specs'
gpuUrl = "https://www.techpowerup.com/gpu-specs/?mobile=No&workstation=No&sort=name"

We can add one or multiple candidates here.
You can also put urls here to retrieve urls.
cpu_wanted_list = {"Manufacturer":["AMD","Intel"],"Release Date":['2020','2021','2022','2023']}
#gpu_wanted_list = ["AMD","Intel"]
gpu_wanted_list = ["Nvidia","Geforce","AMD", "Radeon"]

scraper = AutoScraper()
cpuResult = scraper.build(cpuUrl, cpu_wanted_list)
gpuResult = scraper.build(gpuUrl, gpu_wanted_list)

print("Printing CPU relsult:")
print(cpuResult)

relatedCPUResult = scraper.get_result_similar('https://www.techpowerup.com/cpu-specs/?mobile=No&server=No&sort=name')
print(relatedCPUResult)

I am going to https://www.techpowerup.com/cpu-specs/?mobile=No&server=No&sort=name and trying to fetch the table items as my result.

When I do get a result, it seems to be fetching values from the Refine Search parameters instead of the actual result:
Printing CPU relsult: ['Manufacturer', 'Release Date', 'Mobile', 'Server', 'TDP', 'Cores', 'Threads', 'Generation', 'Socket', 'Codename', 'Process', 'Multiplier Unlocked', 'Integrated Graphics'] ['AMD', 'Intel', '2022', 'No', '9 W', '2', 'Intel Pentium', 'AMD Socket 939', 'Alder Lake-S', '7 nm', 'Yes'].

I appended my code above in case, I am missing something and thanks in advance.

@anoduck
Copy link

anoduck commented Oct 10, 2023

OK, I was hoping I could figure this out, but it is 7am and I have been up all night... so... burning spent fuel here.

Autoscraper saved me SO much time, it is ridiculous, but there is a little hitch. All of my data was returned in one huge list. Items are grouped together, but the number of items differ from type to type, and are not matched with the associative data.

(example data generated from Faker module)

In other words, Results looks like:

['Tiffany Williams',
 'Kimberly Ramirez',
 'Marissa Wilson',
 'David King',
 'Jasmine Wilson',
 'Rebecca Swanson',
 ...
'511-538-9955x9371',
 '582-816-1125x878',
 '343.352.1379x820',
 '805.755.6352x44545',
 '001-541-393-9153x0600',
 '362-438-7059x3506',
 '802.484.3879x623',
 '+1-280-463-9311x9452',
 '+1-436-455-5647x6468',
...
'9385 Sean Courts Suite 395\nLopezborough, GU 46379',
 '338 Andrea Locks Suite 075\nMontgomerytown, OH 92933',
 '075 Barnett Walks\nNorth Tannerview, NH 64984',
 '00146 Newton Expressway\nSarahfort, MS 62136',
 '0111 Porter Curve Apt. 986\nFosterstad, KS 27560',
 '0668 Douglas Harbor\nWest Amyport, PW 83959',
 '317 Theresa Run\nNorth Angelafurt, MO 32844',
 '46803 Mueller Parks Suite 903\nPort Patrickmouth, MI 09350',
 '8474 Kimberly Point Suite 958\nPhamfort, MN 21067',
 '59573 William Light Suite 476\nSouth Dylan, DC 73663',
 'USCGC Wilson\nFPO AE 95307',
 '49219 Mcconnell Ranch\nNorth Robertport, UT 17995',
 '1369 Jeffrey Island\nCatherinemouth, MO 90968',
 '91483 Petersen Flats Apt. 265\nSilvaland, CO 46272']

Rather than:

                 Name                  Number                                            Address
0    Tiffany Williams       511-538-9955x9371  9385 Sean Courts Suite 395\nLopezborough, GU 4...
1    Kimberly Ramirez        582-816-1125x878  338 Andrea Locks Suite 075\nMontgomerytown, OH...
2      Marissa Wilson        343.352.1379x820      075 Barnett Walks\nNorth Tannerview, NH 64984
3          David King      805.755.6352x44545                                                Nan
4      Jasmine Wilson   001-541-393-9153x0600   0111 Porter Curve Apt. 986\nFosterstad, KS 27560
5     Rebecca Swanson       362-438-7059x3506        0668 Douglas Harbor\nWest Amyport, PW 83959
6    Christina Potter        802.484.3879x623        317 Theresa Run\nNorth Angelafurt, MO 32844
7         James Eaton    +1-280-463-9311x9452  46803 Mueller Parks Suite 903\nPort Patrickmou...
8      Laura Gonzalez    +1-436-455-5647x6468  8474 Kimberly Point Suite 958\nPhamfort, MN 21067
9     Rebecca Freeman       604.974.6647x2368  59573 William Light Suite 476\nSouth Dylan, DC...
10   Mr. William Lara     (848)389-3506x26756                                                Nan
11       Daniel Avila   001-490-540-8510x3636  49219 Mcconnell Ranch\nNorth Robertport, UT 17995
12       Andrew Price       886.582.9972x2800      1369 Jeffrey Island\nCatherinemouth, MO 90968
13       Joseph Smith         +1-388-226-7496  91483 Petersen Flats Apt. 265\nSilvaland, CO 4...
14     James Anderson        433-526-5687x642                                                Nan
15       Brandon Tate              6705266223  5467 Logan Terrace Apt. 127\nMichaelberg, PW 6...
16     Travis Wallace  001-767-613-1216x64547  99422 Justin Ramp Apt. 203\nNew Johnmouth, FL ...
17      Michelle Wong        001-249-343-4216  9324 Meghan Trail Apt. 103\nPhillipburgh, AK 4...
18          Jean Lowe              8703452366  0053 Dale Plains Suite 173\nEast Deniseburgh, ...
19     Melinda Tucker    001-649-372-1670x229          3281 Sarah Points\nPort Richard, PW 02531

@alirezamika
Copy link
Author

OK, I was hoping I could figure this out, but it is 7am and I have been up all night... so... burning spent fuel here.

Autoscraper saved me SO much time, it is ridiculous, but there is a little hitch. All of my data was returned in one huge list. Items are grouped together, but the number of items differ from type to type, and are not matched with the associative data.

(example data generated from Faker module)

In other words, Results looks like:

['Tiffany Williams',
 'Kimberly Ramirez',
 'Marissa Wilson',
 'David King',
 'Jasmine Wilson',
 'Rebecca Swanson',
 ...
'511-538-9955x9371',
 '582-816-1125x878',
 '343.352.1379x820',
 '805.755.6352x44545',
 '001-541-393-9153x0600',
 '362-438-7059x3506',
 '802.484.3879x623',
 '+1-280-463-9311x9452',
 '+1-436-455-5647x6468',
...
'9385 Sean Courts Suite 395\nLopezborough, GU 46379',
 '338 Andrea Locks Suite 075\nMontgomerytown, OH 92933',
 '075 Barnett Walks\nNorth Tannerview, NH 64984',
 '00146 Newton Expressway\nSarahfort, MS 62136',
 '0111 Porter Curve Apt. 986\nFosterstad, KS 27560',
 '0668 Douglas Harbor\nWest Amyport, PW 83959',
 '317 Theresa Run\nNorth Angelafurt, MO 32844',
 '46803 Mueller Parks Suite 903\nPort Patrickmouth, MI 09350',
 '8474 Kimberly Point Suite 958\nPhamfort, MN 21067',
 '59573 William Light Suite 476\nSouth Dylan, DC 73663',
 'USCGC Wilson\nFPO AE 95307',
 '49219 Mcconnell Ranch\nNorth Robertport, UT 17995',
 '1369 Jeffrey Island\nCatherinemouth, MO 90968',
 '91483 Petersen Flats Apt. 265\nSilvaland, CO 46272']

Rather than:

                 Name                  Number                                            Address
0    Tiffany Williams       511-538-9955x9371  9385 Sean Courts Suite 395\nLopezborough, GU 4...
1    Kimberly Ramirez        582-816-1125x878  338 Andrea Locks Suite 075\nMontgomerytown, OH...
2      Marissa Wilson        343.352.1379x820      075 Barnett Walks\nNorth Tannerview, NH 64984
3          David King      805.755.6352x44545                                                Nan
4      Jasmine Wilson   001-541-393-9153x0600   0111 Porter Curve Apt. 986\nFosterstad, KS 27560
5     Rebecca Swanson       362-438-7059x3506        0668 Douglas Harbor\nWest Amyport, PW 83959
6    Christina Potter        802.484.3879x623        317 Theresa Run\nNorth Angelafurt, MO 32844
7         James Eaton    +1-280-463-9311x9452  46803 Mueller Parks Suite 903\nPort Patrickmou...
8      Laura Gonzalez    +1-436-455-5647x6468  8474 Kimberly Point Suite 958\nPhamfort, MN 21067
9     Rebecca Freeman       604.974.6647x2368  59573 William Light Suite 476\nSouth Dylan, DC...
10   Mr. William Lara     (848)389-3506x26756                                                Nan
11       Daniel Avila   001-490-540-8510x3636  49219 Mcconnell Ranch\nNorth Robertport, UT 17995
12       Andrew Price       886.582.9972x2800      1369 Jeffrey Island\nCatherinemouth, MO 90968
13       Joseph Smith         +1-388-226-7496  91483 Petersen Flats Apt. 265\nSilvaland, CO 4...
14     James Anderson        433-526-5687x642                                                Nan
15       Brandon Tate              6705266223  5467 Logan Terrace Apt. 127\nMichaelberg, PW 6...
16     Travis Wallace  001-767-613-1216x64547  99422 Justin Ramp Apt. 203\nNew Johnmouth, FL ...
17      Michelle Wong        001-249-343-4216  9324 Meghan Trail Apt. 103\nPhillipburgh, AK 4...
18          Jean Lowe              8703452366  0053 Dale Plains Suite 173\nEast Deniseburgh, ...
19     Melinda Tucker    001-649-372-1670x229          3281 Sarah Points\nPort Richard, PW 02531

please provide your code.

@anoduck
Copy link

anoduck commented Oct 15, 2023

@alirezamika
Really just straight forward from the examples.

from autoscraper import AutoScraper

scraper = AutoScraper()

url = 'https://justia.com/lawyers/civil-rights/california/los-angeles'

wants = ['Jonathon Howard Kaplan', '(213) 553-4550', '355 S. Grand Ave. Suite 2450 Los Angeles, CA 90071', 'Civil Rights and Employment', 'Duke University School of Law and Duke University Law School']

try:
    results = scraper.build(url, wants)

As previously mentioned, this returns one long list. ex. [[all names] + [all phone numbers] + [all address]]. It isn't such a big issue, because this list can be broken down using list.index() and list[x:], except for the length of each categorical list differing. ex. len(name_list) = 39 and len(phone_list) = 27 etc, etc... Thus without knowing exactly what categorical item went with what name, reassembly of the original dataset programmatically appears impossible.

I even attempted to break the scraping process down into individual parts, but as with before the categorical lists varied.

It was not until much later, I discovered the wanted_dict variable. I was just unsure who to structure the dict in order for autoscraper to accept it.

@karrtikiyer-tw
Copy link

Is it possible to define the crawl depth? Like one URL might have other URL's, if we give parent one, can it crawl all the child ones along with content present on the parent one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment