Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
web scraping in php

Have you ever wanted to get a specific data from another website but there's no API available for it? That's where Web Scraping comes in, if the data is not made available by the website we can just scrape it from the website itself.

But before we dive in let us first define what web scraping is. According to Wikipedia:

{% blockquote %} Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox. {% endblockquote %}

So yes, web scraping lets us extract information from websites. But the thing is there are some legal issues regarding web scraping. Some consider it as an act of trespassing to the website where you are scraping the data from. That's why it is wise to read the terms of service of the specific website that you want to scrape because you might be doing something illegal without knowing it. You can read more about it in this Wikipedia page.

##Web Scraping Techniques

There are many techniques in web scraping as mentioned in the Wikipedia page earlier. But I will only discuss the following:

  • Document Parsing
  • Regular Expressions

###Document Parsing

Document parsing is the process of converting HTML into DOM (Document Object Model) in which we can traverse through. Here's an example on how we can scrape data from a public website:

<?php
$html = file_get_contents('http://pokemondb.net/evolution'); //get the html returned from the following url

$pokemon_doc = new DOMDocument();

libxml_use_internal_errors(TRUE); //disable libxml errors

if(!empty($html)){ //if any html is actually returned

	$pokemon_doc->loadHTML($html);
	libxml_clear_errors(); //remove errors for yucky html
	
	$pokemon_xpath = new DOMXPath($pokemon_doc);

	//get all the h2's with an id
	$pokemon_row = $pokemon_xpath->query('//h2[@id]');

	if($pokemon_row->length > 0){
		foreach($pokemon_row as $row){
			echo $row->nodeValue . "<br/>";
		}
	}
}
?>

What we did with the code above was to get the html returned from the url of the website that we want to scrape. In this case the website is pokemondb.net.

<?php
$html = file_get_contents('http://pokemondb.net/evolution'); 
?>

Then we declare a new DOM Document, this is used for converting the html string returned from file_get_contents into an actual Document Object Model which we can traverse through:

<?php
$pokemon_doc = new DOMDocument();
?>

Then we disable libxml errors so that they won't be outputted on the screen, instead they will be buffered and stored:

<?php
libxml_use_internal_errors(TRUE); //disable libxml errors
?>

Next we check if there's an actual html that has been returned:

<?php
if(!empty($html)){ //if any html is actually returned
}
?>

Next we use the loadHTML() function from the new instance of DOMDocument that we created earlier to load the html that was returned. Simply use the html that was returned as the argument:

<?php
$pokemon_doc->loadHTML($html);
?>

Then we clear the errors if any. Most of the time yucky html causes these errors. Examples of yucky html are inline styling (style attributes embedded in elements), invalid attributes and invalid elements. Elements and attributes are considered invalid if they are not part of the HTML specification for the doctype used in the specific page.

<?php
libxml_clear_errors(); //remove errors for yucky html
?>

Next we declare a new instance of DOMXpath. This allows us to do some queries with the DOM Document that we created. This requires an instance of the DOM Document as its argument.

<?php
$pokemon_xpath = new DOMXPath($pokemon_doc);
?>

Finally, we simply write the query for the specific elements that we want to get. If you have used jQuery before then this process is similar to what you do when you select elements from the DOM. What were selecting here is all the h2 tags which has an id, we make the location of the h2 unspecific by using double slashes // right before the element that we want to select. The value of the id also doesn't matter as long as there's an id then it will get selected. The nodeValue attribute contains the text inside the h2 that was selected.

<?php
//get all the h2's with an id
$pokemon_row = $pokemon_xpath->query('//h2[@id]');

if($pokemon_row->length > 0){
	foreach($pokemon_row as $row){
		echo $row->nodeValue . "<br/>";
	}
}
?>

This results to the following text printed out in the screen:

Generation 1 - Red, Blue, Yellow
Generation 2 - Gold, Silver, Crystal
Generation 3 - Ruby, Sapphire, Emerald
Generation 4 - Diamond, Pearl, Platinum
Generation 5 - Black, White, Black 2, White 2

Let's do one more example with the document parsing before we move on to regular expressions. This time were going to get a list of all pokemons along with their specific type (E.g Fire, Grass, Water).

First let's examine what we have on pokemondb.net/evolution so that we know what particular element to query.

checking

As you can see from the screenshot, the information that we want to get is contained within a span element with a class of infocard-tall . Yes, the space there is included. When using XPath to query spaces are included if they are present, otherwise it wouldn't work.

Converting what we know into actual query, we come up with this:

//span[@class="infocard-tall "]

This selects all the span elements which has a class of infocard-tall . It doesn't matter where in the document the span is because we used the double forward slash before the actual element.

Once were inside the span we have to get to the actual elements which directly contains the data that we want. And that is the name and the type of the pokemon. As you can see from the screenshot below the name of the pokemon is directly contained within an anchor element with a class of ent-name. And the types are stored within a small element with a class of aside.

info card

We can then use that knowledge to come up with the following code:

<?php
$pokemon_list = array();

$pokemon_and_type = $pokemon_xpath->query('//span[@class="infocard-tall "]');

if($pokemon_and_type->length > 0){	
	
	//loop through all the pokemons
	foreach($pokemon_and_type as $pat){
		
		//get the name of the pokemon
		$name = $pokemon_xpath->query('a[@class="ent-name"]', $pat)->item(0)->nodeValue;
		
		$pkmn_types = array(); //reset $pkmn_types for each pokemon
		$types = $pokemon_xpath->query('small[@class="aside"]/a', $pat);

		//loop through all the types and store them in the $pkmn_types array
		foreach($types as $type){
			$pkmn_types[] = $type->nodeValue; //the pokemon type
		}

		//store the data in the $pokemon_list array
		$pokemon_list[] = array('name' => $name, 'types' => $pkmn_types);
		
	}
}

//output what we have
echo "<pre>";
print_r($pokemon_list);
echo "</pre>";
?>

There's nothing new with the code that we have above except for using query inside the foreach loop. We use this particular line of code to get the name of the pokemon, you might notice that we specified a second argument when we used the query method. The second argument is the current row, we use it to specify the scope of the query. This means that were limiting the scope of the query to that of the current row.

<?php
$name = $pokemon_xpath->query('a[@class="ent-name"]', $pat)->item(0)->nodeValue;
?>

The results would be something like this:

Array
(
    [0] => Array
        (
            [name] => Bulbasaur
            [types] => Array
                (
                    [0] => Grass
                    [1] => Poison
                )
        )
    [1] => Array
        (
            [name] => Ivysaur
            [types] => Array
                (
                    [0] => Grass
                    [1] => Poison
                )
        )
    [2] => Array
        (
            [name] => Venusaur
            [types] => Array
                (
                    [0] => Grass
                    [1] => Poison
                )
        )

###Regular Expressions

##Web Scraping Tools

###Simple HTML Dom

To make web scraping easier you can use libraries such as simple html DOM. Here's an example of getting the names of the pokemon using simple html DOM:

<?php
$html = file_get_html('http://pokemondb.net/evolution');

foreach($html->find('a[class=ent-name]') as $element){
	echo $element->innertext . '<br>'; //outputs bulbasaur, ivysaur, etc...
} 
?>

The syntax is more simple so the code that you have to write is lesser plus there are also some convenience functions and attributes which you can use. An example is the plaintext attribute which extracts all the text from a web page:

<?php
echo file_get_html('http://pokemondb.net/evolution')->plaintext; 
?>

###Ganon

##Scraping non-public parts of website

###Scraping Amazon

##Resources

@AlexCarlson

This comment has been minimized.

Copy link

@AlexCarlson AlexCarlson commented Jan 15, 2015

Nice work .. But what if i want to extract the data from two or more web pages ? .... at an instant of time....

@asadpk

This comment has been minimized.

Copy link

@asadpk asadpk commented Feb 13, 2015

Hi,
I need these detail in xls sheet .can you modify this script? in a way that i can get data in xls format in same way bellow.

Item name ASIN By 1st price 2nd price 3rd price Category

@prasadmunna

This comment has been minimized.

Copy link

@prasadmunna prasadmunna commented Feb 4, 2016

thanks for such a nice post..

@knaveenchand

This comment has been minimized.

Copy link

@knaveenchand knaveenchand commented Apr 14, 2016

Very well written.

@TheKetan2

This comment has been minimized.

Copy link

@TheKetan2 TheKetan2 commented May 31, 2016

suppose we have structure like : Main Link one->Sub Link->Sub Sub Link and we have to get info from all those links and come back to Main Link page and do the same thing with Main Link Two->Sub Link->Sub Sub Link ..how should we so that.

@saurabh-vijayvargiya

This comment has been minimized.

Copy link

@saurabh-vijayvargiya saurabh-vijayvargiya commented Aug 2, 2016

awesome dude, nice explanation along with the code.

@bhawnam193

This comment has been minimized.

Copy link

@bhawnam193 bhawnam193 commented Sep 5, 2016

very informative article but for the first example when i use different URL than the one listed it does not show anything and that url has h2 tags, tried changing url not for one but for umpteen number of URLs. Any idea why?

@verma-ashish

This comment has been minimized.

Copy link

@verma-ashish verma-ashish commented Jan 2, 2017

At present, scraping the data is coming as plain text not with existing html tags. So, is it possible to scrap the data with all html tags as well e.g. <a>, <b>, <i>?

@sebastian2609

This comment has been minimized.

Copy link

@sebastian2609 sebastian2609 commented Feb 17, 2017

I need to get content from table how can i do this...? please help me as soon as posible..

WINNING NUMBERS
2017-02-15 (Wed) 3712/17
1ST 2ND 3RD

9411

3367

9162

@EdwinChua

This comment has been minimized.

Copy link

@EdwinChua EdwinChua commented Feb 23, 2017

Thanks! I managed to write my first web scraper for a local news website thanks to this article. :)

@verma-ashish have you tried query('//a'); ?

@sebastian2609 try query('//tr //td'); or something similar

@CCHFWBAN

This comment has been minimized.

Copy link

@CCHFWBAN CCHFWBAN commented Mar 3, 2017

Instead of echo -ing the values how can you update them as different rows in a MySQL table?

@CCHFWBAN

This comment has been minimized.

Copy link

@CCHFWBAN CCHFWBAN commented Mar 3, 2017

Or, is there a way to select one specific H2 from the array instead of just outputting all the H2's?

@Girish0406

This comment has been minimized.

Copy link

@Girish0406 Girish0406 commented Mar 21, 2017

that's a very nice explanation . but i want to know if i want to store data into mysql than how to do i m not getting can anyone help me out as soon as possible

@vishnu1991

This comment has been minimized.

Copy link

@vishnu1991 vishnu1991 commented Apr 20, 2017

@CCHFWBAN
for outputting just the specific h2 value u can use the array index; like say ,if u want the third h2 then u can use

$scrap_row = $scrap_xpath->query('//h2');
echo `"Get first H2 Value:".$scrap_row[2]->nodeValue."<br>";
@Whip

This comment has been minimized.

Copy link

@Whip Whip commented Jun 9, 2017

Any ideas about how to get contents of the page which requires login? Assuming you do have the login access for the website.

@manualvarado22

This comment has been minimized.

Copy link

@manualvarado22 manualvarado22 commented Aug 2, 2017

Thank you so much! This is amazing.

@anamahmed2012

This comment has been minimized.

Copy link

@anamahmed2012 anamahmed2012 commented Aug 4, 2017

That's what I was looking for.

@imran300

This comment has been minimized.

Copy link

@imran300 imran300 commented Aug 15, 2017

What if i have i div class="description" and it contains a ul with 5 li tags
now i want to extract these li data but this ul doesn't have a class or id ans their are t=hundreds of li on a single web page so how are we gonna extract this information

@norcaljohnny

This comment has been minimized.

Copy link

@norcaljohnny norcaljohnny commented Aug 25, 2017

VeeK727 I would assume just like any site you can use the l/p in the url itself to gain access.
Example.. http://username:password@www.example.com/

@norcaljohnny

This comment has been minimized.

Copy link

@norcaljohnny norcaljohnny commented Aug 25, 2017

@imran300 you can try using the simple_html_dom.php and then included it in the php file.
As such.

find('li') as $element) echo $element ; ?>

Yes, it is literally that easy and will scrape the full details for each 'li'

@norcaljohnny

This comment has been minimized.

Copy link

@norcaljohnny norcaljohnny commented Aug 25, 2017

looks like it got cropped. Testing in full once more. (removing opening and closing tags to post)

include_once 'simple_html_dom.php';
// Create DOM from URL
$html = file_get_html('https://www.example.com/');

// Find all links
foreach($html->find('li') as $element)
echo $element ;

@kasabesiddhi

This comment has been minimized.

Copy link

@kasabesiddhi kasabesiddhi commented Sep 16, 2017

Awesome explanation

@hamzamumtaz007

This comment has been minimized.

Copy link

@hamzamumtaz007 hamzamumtaz007 commented Nov 22, 2017

I have an element with multiple classes how can i detect it using html document parsing?

@peterpilip

This comment has been minimized.

Copy link

@peterpilip peterpilip commented Jan 12, 2018

Great work bro..

@stephanoapiolaza

This comment has been minimized.

Copy link

@stephanoapiolaza stephanoapiolaza commented Jan 20, 2018

Nice Article

@EhabElzeny

This comment has been minimized.

Copy link

@EhabElzeny EhabElzeny commented Apr 17, 2018

very nice & i think there more ways for this thank you

@fahadhowlader

This comment has been minimized.

Copy link

@fahadhowlader fahadhowlader commented Sep 12, 2018

Please have a look my HTML likes
<li>১, ২, ৩, ৪। <br>1, 2, 3, 4 </br><span>It's test.</span></li>
But i need to scraping only ১, ২, ৩, ৪ and
not need <br>1, 2, 3, 4 </br><span>It's test.</span>
how can it possible ?

@nootype

This comment has been minimized.

Copy link

@nootype nootype commented Oct 27, 2020

wow cool written thank you very much. Can you tell me on what scrape data from website on this service https://finddatalab.com/how-to-scrape-data-from-a-website?? What language are they using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment