Skip to content

Instantly share code, notes, and snippets.

@rajasankar
rajasankar / createfinal.py
Created June 10, 2015 07:09
Convert back the extracted text into songs
co=[]
for e in range(0,len(E)):
if E[e].find('\t')!=-1 and E[e].find('(')==-1:
if E[e].find('.')!=-1:
print ' '
print E[e]
print ''
print ' '
print E[e+2]
print ''
@rajasankar
rajasankar / ExtractOnlySongs.py
Created June 10, 2015 07:06
This code to extract only songs not the explanations that comes with.
A=open('file.txt')
E=A.readlines()
A.close()
startcount= #put the linecount here
emptyline=0
state=True
value=True
linevalue=0
i=0
@rajasankar
rajasankar / extracttext.py
Created June 10, 2015 07:02
This code to extract data from the html files based on class information
inputfilename='file.html'
data=urllib2.urlopen(inputfilename)
soup = BeautifulSoup(data)
data=soup.prettify()
soup = BeautifulSoup(data)
ti=soup.findAll(attrs={'class':'pno'})
for t in ti:
t.extract()
ti=soup.findAll(attrs={'class':'subhead'})
for t in ti:
@rajasankar
rajasankar / combine.py
Created June 10, 2015 06:55
Cleaning up downloaded files
outfilename=<html file to write>
number_of_pages= #number of files downloaded from site
for i in range(1,number_of_pages):
data=urllib2.urlopen(outfilename)
soup = BeautifulSoup(data)
data=soup.prettify()
soup = BeautifulSoup(data)
ti=soup.findAll(attrs={'class':'link'})
for t in ti:
t.extract()

Here are a list of headless browsers that I know about:

  • [HtmlUnit][1] - Java. Custom browser engine. JavaScript support/DOM emulated. Open source.
  • [Ghost][2] - Python only. WebKit-based. Full JavaScript support. Open source.
  • [Twill][3] - Python/command line. Custom browser engine. No JavaScript. Open source.
  • [PhantomJS][4] - Command line/all platforms. WebKit-based. Full JavaScript support. Open source.
  • [Awesomium][5] - C++/.Net/all platforms. Chromium-based. Full JavaScript support. Commercial/free.
  • [SimpleBrowser][6] - .Net 4/C#. Custom browser engine. No JavaScript support. Open source.
  • [ZombieJS][7] - Node.js. Custom browser engine. JavaScript support/emulated DOM. Open source.
  • [EnvJS][8] - JavaScript via Java/Rhino. Custom browser engine. JavaScript support/emulated DOM. Open source.