Created
July 26, 2016 16:37
-
-
Save calebcgates/6608917f0c0f2271306f1d83ee22737a to your computer and use it in GitHub Desktop.
Web Scraper Python LXML & XPATH
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
################################################################################################################ | |
################################################################################################################ | |
# purpose of this program is to scrap data from a page using lxml and Xpath | |
# original tutorial found at http://docs.python-guide.org/en/latest/scenarios/scrape/ | |
# here we go | |
### was told to install lxml using | |
### pip install lxml ###CONTINUE READING BEFORE INSTALLING | |
### pip install requests ###CONTINUE READING BEFORE INSTALLING | |
###YES | |
###got an error so had to install dev libraries | |
#sudo apt install libxml2-dev libxslt1-dev | |
###NO | |
###this install hung up for a while to build a tree | |
#sudo pip install lxml | |
# pip install requests | |
###x86_64-linux-gnu-gcc: internal compiler error: Killed (program cc1) | |
###http://stackoverflow.com/questions/24455238/lxml-installation-error-ubuntu-14-04-internal-compiler-error | |
###says issue was solved by increasing ram if completed on a virtual machine (I am) however i dont wish to edit machine size to trying alternate route | |
###alternate solution says use apt-get install python3-lxml or for python original apt-get install python-lxml | |
###YES | |
###Success with no issues exicuting this command | |
#apt-get install python3-lxml | |
###YES | |
###requrements already satisfied (already installed) | |
#pip install requests | |
#FAILURE | |
#no package lxml | |
###OK TRY ALTERNATE ROUTE, internet says if i add more ram to my computer it should install properly with pip install lxml, lets try that | |
###STEPS | |
###Power off digital ocean droplet with command line 'poweroff' | |
###create snapshot of digital ocean droplet | |
###choose resize on side panel | |
###choose flexible rezise which perservs disk space size (increases RAM and CPUs ONLY) allowing the process to be reversed after install | |
###Hit resize at bottom and wait | |
###power on VM | |
### login to shell | |
### I personally went to my folder where my projects are, (probably makes no difference) | |
### then run command | |
# pip install lxml | |
###took a while to install but didnt give any errors | |
###one line says installing without cython, quick research shows that's needed if you dont have pip | |
###after install run program | |
###worked successfully!!!! output values to terminal | |
###FINAL STEPS | |
### poweroff server and resize at digital ocean back to $5 server from $80 server | |
### | |
### TO EXICUTE USE #> python filename.py python3 filename.py gives errors about no module named requests | |
### | |
################################################################################################################ | |
################################################################################################################ | |
from lxml import html | |
import requests | |
#page = requests.get('http://econpy.pythonanywhere.com/ex/001.html') | |
#tree = html.fromstring(page.content) | |
# | |
##This will create a list of buyers: | |
#buyers = tree.xpath('//div[@title="buyer-name"]/text()') | |
##This will create a list of prices | |
#prices = tree.xpath('//span[@class="item-price"]/text()') | |
# | |
#print 'Buyers: ', buyers | |
#print 'Prices: ', prices | |
#### | |
page = requests.get('https://www.kickstarter.com/projects/2134501481/eclipse-playing-card-deck') | |
tree = html.fromstring(page.content) | |
#This will create a list of buyers: | |
links = tree.xpath('//a/@href') #NEED TO LEARN XPATH SYNTAX | |
for link in links: | |
print(link) | |
#print 'links: ', links |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment