Skip to content

Instantly share code, notes, and snippets.

@code-boxx
Last active November 8, 2023 03:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save code-boxx/5c2f5f827b4378ca93e008953b44b3b6 to your computer and use it in GitHub Desktop.
Save code-boxx/5c2f5f827b4378ca93e008953b44b3b6 to your computer and use it in GitHub Desktop.
Python Web Scraper

PYTHON WEB SCRAPER EXAMPLE

https://code-boxx.com/python-web-scraper/

NOTES

  1. Run unpack.bat (Windows) unpack.sh (Linux/Mac). This will automatically:
    • Create 2 folders - static and templates.
    • Move S1_dummy.html into templates.
    • Save the sample image below as static/box.png
    • Create a virtual environment - virtualenv venv
    • Activate the virtual environment - venv\scripts\activate (Windows) venv/bin/activate (Mac/Linux)
    • Install the required modules pip install flask requests beautifulsoup4
    • Start the server python S1_http.py
  2. Access http://localhost in your web browser, make sure the dummy HTTP server is running.
  3. Open another command line terminal.
    • Activate the virtual environment - venv\scripts\activate (Windows) venv/bin/activate (Mac/Linux)
    • Run python S2_scrape.py, it should extract information from the dummy HTML page.

IMAGES

sandwich

LICENSE

Copyright by Code Boxx

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.****

<!DOCTYPE html>
<html>
<head>
<title>Dummy Product Page</title>
<meta charset="utf-8">
<style>
* {
font-family: Arial, Helvetica, sans-serif;
box-sizing: border-box;
}
#product {
width: 300px;
padding: 15px;
border: 1px solid #eee;
background: #f5f5f5;
}
#pImg {
display: block;
width: 100%;
margin-bottom: 15px;
}
#pName {
font-size: 18px;
font-weight: 700;
}
#pPrice {
font-weight: 700;
color: #f34242;
}
#pDesc {
font-size: 14px;
color: #5c5c5c;
}
#pAdd {
width: 100%;
margin-top: 10px;
border: 0;
padding: 10px;
color: #fff;
background: #b72b2b;
cursor: pointer;
}
</style>
</head>
<body>
<div id="product">
<img src="static/sandwich.png" id="pImg">
<div id="pName">Sandwich</div>
<div id="pPrice">$12.34</div>
<div id="pDesc">Just a regular SAND wich.</div>
<input type="button" value="Add To Cart" id="pAdd">
</div>
</body>
</html>
# (A) LOAD MODULES
from flask import Flask, render_template
# (B) FLASK
app = Flask(__name__)
@app.route("/")
def index():
return render_template("S1_dummy.html")
# (C) RUN!
if __name__ == "__main__":
app.run("localhost", 80)
# (A) LOAD REQUIRED MODULES
import requests
from bs4 import BeautifulSoup
# (B) GET HTML
html = requests.get("http://localhost").text
# print(html)
# (C) HTML PARSER
soup = BeautifulSoup(html, "html.parser")
name = soup.find("div", {"id": "pName"}).text
desc = soup.find("div", {"id": "pDesc"}).text
price = soup.find("div", {"id": "pPrice"}).text
image = soup.find("img", {"id": "pImg"})["src"]
print(name)
print(desc)
print(price)
print(image)
md templates
md static
move S1_dummy.html templates
curl https://user-images.githubusercontent.com/11156244/281251426-1737b31d-d813-4913-aa99-0d6cf490e757.png --ssl-no-revoke --output static/sandwich.png
virtualenv venv
call venv\Scripts\activate
pip install flask requests beautifulsoup4
python S1_http.py
mkdir -m 777 templates
mkdir -m 777 static
mv ./S1_dummy.html ./templates
curl https://user-images.githubusercontent.com/11156244/281251426-1737b31d-d813-4913-aa99-0d6cf490e757.png --ssl-no-revoke --output ./static/sandwich.png
virtualenv venv
source "venv/bin/activate"
pip install flask requests beautifulsoup4
python S1_http.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment