Skip to content

Instantly share code, notes, and snippets.

@suranands
Created October 2, 2016 17:12
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save suranands/e1e2c4ca06b2b4a14140b8b5d5d22d24 to your computer and use it in GitHub Desktop.
Save suranands/e1e2c4ca06b2b4a14140b8b5d5d22d24 to your computer and use it in GitHub Desktop.
"""
Following Links in Python
In this assignment you will write a Python program that expands on
http://www.pythonlearn.com/code/urllinks.py (http://www.pythonlearn.com/code/urllinks.py). The program will
use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a
tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a
number of times and report the last name you find.
We provide two files for this assignment. One is a sample file where we give you the name for your testing and
the other is the actual data you need to process for the assignment
- Sample problem: Start at http://python­data.dr­chuck.net/known_by_Fikret.html (http://python­data.dr­
chuck.net/known_by_Fikret.html)
Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer
is the last name that you retrieve.
Sequence of names: Fikret Montgomery Mhairade Butchi Anayah
Last name in sequence: Anayah
- Actual problem: Start at: http://python­data.dr­chuck.net/known_by_Inaara.html (http://python­data.dr­
chuck.net/known_by_Inaara.html)
Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The
answer is the last name that you retrieve.
Hint: The first character of the name of the last page that you will load is: R
Strategy
The web pages tweak the height between the links and hide the page after a few seconds to make it difficult for
you to do the assignment without writing a Python program. But frankly with a little effort and patience you can
overcome these attempts to make it a little harder to complete the assignment without writing a Python
program. But that is not the point. The point is to write a clever Python program to solve the program.
"""
import re, urllib
from BeautifulSoup import *
all_links = []
all_names = []
url_first_part = 'http://python-data.dr-chuck.net/known_by_'
url_last_part = '.html'
first_entry = 'Inaara'
for i in range(7):
url = url_first_part + first_entry + url_last_part
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
#def get_next_name(url)
tags = soup('a')
links = []
for tag in tags:
links.append(tag.get('href', None))
url = links[17]
print url
name = url[41:]
next_entry = name[:-5]
all_names.append(next_entry)
first_entry = next_entry
url = url_first_part + first_entry + url_last_part
all_links.append(url)
print all_names[-1]
@fushuai1229
Copy link

For python3, there is a new code works for the problem

To run this, you can install BeautifulSoup

https://pypi.python.org/pypi/beautifulsoup4

Or download the file

http://www.py4e.com/code3/bs4.zip

and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

Ignore SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
coun=input('Enter your count:')
pos=input('Enter your position:')
print(url)

for i in range(int(coun)):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags = soup('a')
url = tags[int(pos)-1].get('href',None)
print (url)

@yts61
Copy link

yts61 commented May 17, 2018

would anyone help me with this question?

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

Ignore SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = ('http://py4e-data.dr-chuck.net/known_by_Nabeel.html')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

post = int(input("Enter position: ")) -1 #The position of link relative to first link
count = int(input("Enter count: ")) #The number of times to be repeated

Build a tag list

tags = soup('a')

check the list

#print (tags)
#retrive all the links and put into dictionary
for tag in tags:
#retrive the url every 18
url = tag.get('href',None)
for i in range(count):
ans=url[post]
print (ans)

what is wrong with my code?

@Kajol-Kumari
Copy link

In your code you are getting the letter at the 18th position of the link as you are iterating over letters of a particular link stored in your tags.
You try the code below and then run your code, you'll get to know your fault:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter URL : ')
count = input('Enter count : ')
position = input('Enter position : ')

for i in range(int(count)):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
url = tags[int(position)-1].get('href', None)
print(url)

@vedant-milind
Copy link

I'm still not getting the answer. Whats the name ?

@snehasrinija2000
Copy link

I am getting the error that the module of BeautifulSoup is not available eventhough I have downloaded bs4

@vedant-milind
Copy link

Unzip the bs4 RAR and then copy the bs4 folder inside it on the desktop . And then use the line

from bs4 import BeautifulSoup

It'll work .

@tejas22198
Copy link

what is the name..i cant find it .

@tejas22198
Copy link

I'm still not getting the answer. Whats the name ?

did u got it

@MohammedBasheerUddin
Copy link

for i in range(int(coun)):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags = soup('a')
url = tags[int(pos)-1].get('href',None)
print (url)

(first of all great code, but can you explain why and how the variable "pos" is used because I cannot understand it's use in question too . It would be helpful if you explain this code thank you in advance)

@snehasrinija2000
Copy link

Thanks everyone for the support... The mistake was that I didn't unzip bs4... Thanks and sorry for the inconvenience

@BuvanasriAK
Copy link

BuvanasriAK commented Jun 23, 2020

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

ctx = ssl.create_default_context() # Ignore SSL certificate errors

ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter URL: ')
c = input('Enter count: ')
pos = input('Enter position: ')
print(url)

for i in range(int(c)):
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = soup('a')
url = tags[int(pos)-1].get('href',None)
print(url)
#you get an URL, print it. Then you look for the anchor tag at index which is pos - 1 and get the key value i.e, url present in the href attribute
#then you open that url, print it and do the same for the remaining number of times (count no. of times)
#in the end retrieve the last url

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment