Skip to content

Instantly share code, notes, and snippets.

@calebcgates
Created July 26, 2016 16:37
Show Gist options
  • Save calebcgates/6608917f0c0f2271306f1d83ee22737a to your computer and use it in GitHub Desktop.
Save calebcgates/6608917f0c0f2271306f1d83ee22737a to your computer and use it in GitHub Desktop.
Web Scraper Python LXML & XPATH
################################################################################################################
################################################################################################################
# purpose of this program is to scrap data from a page using lxml and Xpath
# original tutorial found at http://docs.python-guide.org/en/latest/scenarios/scrape/
# here we go
### was told to install lxml using
### pip install lxml ###CONTINUE READING BEFORE INSTALLING
### pip install requests ###CONTINUE READING BEFORE INSTALLING
###YES
###got an error so had to install dev libraries
#sudo apt install libxml2-dev libxslt1-dev
###NO
###this install hung up for a while to build a tree
#sudo pip install lxml
# pip install requests
###x86_64-linux-gnu-gcc: internal compiler error: Killed (program cc1)
###http://stackoverflow.com/questions/24455238/lxml-installation-error-ubuntu-14-04-internal-compiler-error
###says issue was solved by increasing ram if completed on a virtual machine (I am) however i dont wish to edit machine size to trying alternate route
###alternate solution says use apt-get install python3-lxml or for python original apt-get install python-lxml
###YES
###Success with no issues exicuting this command
#apt-get install python3-lxml
###YES
###requrements already satisfied (already installed)
#pip install requests
#FAILURE
#no package lxml
###OK TRY ALTERNATE ROUTE, internet says if i add more ram to my computer it should install properly with pip install lxml, lets try that
###STEPS
###Power off digital ocean droplet with command line 'poweroff'
###create snapshot of digital ocean droplet
###choose resize on side panel
###choose flexible rezise which perservs disk space size (increases RAM and CPUs ONLY) allowing the process to be reversed after install
###Hit resize at bottom and wait
###power on VM
### login to shell
### I personally went to my folder where my projects are, (probably makes no difference)
### then run command
# pip install lxml
###took a while to install but didnt give any errors
###one line says installing without cython, quick research shows that's needed if you dont have pip
###after install run program
###worked successfully!!!! output values to terminal
###FINAL STEPS
### poweroff server and resize at digital ocean back to $5 server from $80 server
###
### TO EXICUTE USE #> python filename.py python3 filename.py gives errors about no module named requests
###
################################################################################################################
################################################################################################################
from lxml import html
import requests
#page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
#tree = html.fromstring(page.content)
#
##This will create a list of buyers:
#buyers = tree.xpath('//div[@title="buyer-name"]/text()')
##This will create a list of prices
#prices = tree.xpath('//span[@class="item-price"]/text()')
#
#print 'Buyers: ', buyers
#print 'Prices: ', prices
####
page = requests.get('https://www.kickstarter.com/projects/2134501481/eclipse-playing-card-deck')
tree = html.fromstring(page.content)
#This will create a list of buyers:
links = tree.xpath('//a/@href') #NEED TO LEARN XPATH SYNTAX
for link in links:
print(link)
#print 'links: ', links
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment