erud1te-sec/gist:53006068886fb35367deea6069a741e6

## gistfile1.txt
If you’re using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager:

$ apt-get install python-bs4

Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip. The package name is beautifulsoup4, and the same package works on Python 2 and Python 3.

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

(The BeautifulSoup package is probably not what you want. That’s the previous major release, Beautiful Soup 3. Lots of software uses BS3, so it’s still available, but if you’re writing new code you should install beautifulsoup4.)

If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py.

$ python setup.py install

If all else fails, the license for Beautiful Soup allows you to package the entire library with your application. You can download the tarball, copy its bs4 directory into your application’s codebase, and use Beautiful Soup without installing it at all.

I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it should work with other recent versions.

Problems after installation
Beautiful Soup is packaged as Python 2 code. When you install it for use with Python 3, it’s automatically converted to Python 3 code. If you don’t install the package, the code won’t be converted. There have also been reports on Windows machines of the wrong version being installed.

If you get the ImportError “No module named HTMLParser”, your problem is that you’re running the Python 2 version of the code under Python 3.

If you get the ImportError “No module named html.parser”, your problem is that you’re running the Python 3 version of the code under Python 2.

In both cases, your best bet is to completely remove the Beautiful Soup installation from your system (including any directory created when you unzipped the tarball) and try the installation again.

If you get the SyntaxError “Invalid syntax” on the line ROOT_TAG_NAME = u'[document]', you need to convert the Python 2 code to Python 3. You can do this either by installing the package:

$ python3 setup.py install

or by manually running Python’s 2to3 conversion script on the bs4 directory:

$ 2to3-3.2 -w bs4

Installing a parser
Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands:

$ apt-get install python-lxml

$ easy_install lxml

$ pip install lxml

Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:

$ apt-get install python-html5lib

$ easy_install html5lib

$ pip install html5lib

This table summarizes the advantages and disadvantages of each parser library:

Parser	Typical usage	Advantages	Disadvantages
Python’s html.parser	BeautifulSoup(markup, "html.parser")
Batteries included
Decent speed
Lenient (as of Python 2.7.3 and 3.2.)
Not very lenient (before Python 2.7.3 or 3.2.2)
lxml’s HTML parser	BeautifulSoup(markup, "lxml")
Very fast
Lenient
External C dependency
lxml’s XML parser	BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml")
Very fast
The only currently supported XML parser
External C dependency
html5lib	BeautifulSoup(markup, "html5lib")
Extremely lenient
Parses pages the same way a web browser does
Creates valid HTML5
Very slow
External Python dependency
If you can, I recommend you install and use lxml for speed. If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

Note that if a document is invalid, different parsers will generate different Beautiful Soup trees for it. See Differences between parsers for details.

Making the soup
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")
First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

BeautifulSoup("Sacr&eacute; bleu!")
<html><head></head><body>Sacré bleu!</body></html>
Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser. (See Parsing XML.)

Kinds of objects
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.

Tag
A Tag object corresponds to an XML or HTML tag in the original document:

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>
Tags have a lot of attributes and methods, and I’ll cover most of them in Navigating the tree and Searching the tree. For now, the most important features of a tag are its name and attributes.

Name
	If you’re using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager:

	$ apt-get install python-bs4

	Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip. The package name is beautifulsoup4, and the same package works on Python 2 and Python 3.

	$ easy_install beautifulsoup4

	$ pip install beautifulsoup4

	(The BeautifulSoup package is probably not what you want. That’s the previous major release, Beautiful Soup 3. Lots of software uses BS3, so it’s still available, but if you’re writing new code you should install beautifulsoup4.)

	If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py.

	$ python setup.py install

	If all else fails, the license for Beautiful Soup allows you to package the entire library with your application. You can download the tarball, copy its bs4 directory into your application’s codebase, and use Beautiful Soup without installing it at all.

	I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it should work with other recent versions.

	Problems after installation
	Beautiful Soup is packaged as Python 2 code. When you install it for use with Python 3, it’s automatically converted to Python 3 code. If you don’t install the package, the code won’t be converted. There have also been reports on Windows machines of the wrong version being installed.

	If you get the ImportError “No module named HTMLParser”, your problem is that you’re running the Python 2 version of the code under Python 3.

	If you get the ImportError “No module named html.parser”, your problem is that you’re running the Python 3 version of the code under Python 2.

	In both cases, your best bet is to completely remove the Beautiful Soup installation from your system (including any directory created when you unzipped the tarball) and try the installation again.

	If you get the SyntaxError “Invalid syntax” on the line ROOT_TAG_NAME = u'[document]', you need to convert the Python 2 code to Python 3. You can do this either by installing the package:

	$ python3 setup.py install

	or by manually running Python’s 2to3 conversion script on the bs4 directory:

	$ 2to3-3.2 -w bs4

	Installing a parser
	Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands:

	$ apt-get install python-lxml

	$ easy_install lxml

	$ pip install lxml

	Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:

	$ apt-get install python-html5lib

	$ easy_install html5lib

	$ pip install html5lib

	This table summarizes the advantages and disadvantages of each parser library:

	Parser Typical usage Advantages Disadvantages
	Python’s html.parser BeautifulSoup(markup, "html.parser")
	Batteries included
	Decent speed
	Lenient (as of Python 2.7.3 and 3.2.)
	Not very lenient (before Python 2.7.3 or 3.2.2)
	lxml’s HTML parser BeautifulSoup(markup, "lxml")
	Very fast
	Lenient
	External C dependency
	lxml’s XML parser BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml")
	Very fast
	The only currently supported XML parser
	External C dependency
	html5lib BeautifulSoup(markup, "html5lib")
	Extremely lenient
	Parses pages the same way a web browser does
	Creates valid HTML5
	Very slow
	External Python dependency
	If you can, I recommend you install and use lxml for speed. If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

	Note that if a document is invalid, different parsers will generate different Beautiful Soup trees for it. See Differences between parsers for details.

	Making the soup
	To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

	from bs4 import BeautifulSoup

	soup = BeautifulSoup(open("index.html"))

	soup = BeautifulSoup("<html>data</html>")
	First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

	BeautifulSoup("Sacré bleu!")
	<html><head></head><body>Sacré bleu!</body></html>
	Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser. (See Parsing XML.)

	Kinds of objects
	Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.

	Tag
	A Tag object corresponds to an XML or HTML tag in the original document:

	soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
	tag = soup.b
	type(tag)
	# <class 'bs4.element.Tag'>
	Tags have a lot of attributes and methods, and I’ll cover most of them in Navigating the tree and Searching the tree. For now, the most important features of a tag are its name and attributes.

	Name