Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
"""
Following Links in Python
In this assignment you will write a Python program that expands on
http://www.pythonlearn.com/code/urllinks.py (http://www.pythonlearn.com/code/urllinks.py). The program will
use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a
tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a
number of times and report the last name you find.
We provide two files for this assignment. One is a sample file where we give you the name for your testing and
the other is the actual data you need to process for the assignment
- Sample problem: Start at http://python­data.dr­chuck.net/known_by_Fikret.html (http://python­data.dr­
chuck.net/known_by_Fikret.html)
Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer
is the last name that you retrieve.
Sequence of names: Fikret Montgomery Mhairade Butchi Anayah
Last name in sequence: Anayah
- Actual problem: Start at: http://python­data.dr­chuck.net/known_by_Inaara.html (http://python­data.dr­
chuck.net/known_by_Inaara.html)
Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The
answer is the last name that you retrieve.
Hint: The first character of the name of the last page that you will load is: R
Strategy
The web pages tweak the height between the links and hide the page after a few seconds to make it difficult for
you to do the assignment without writing a Python program. But frankly with a little effort and patience you can
overcome these attempts to make it a little harder to complete the assignment without writing a Python
program. But that is not the point. The point is to write a clever Python program to solve the program.
"""
import re, urllib
from BeautifulSoup import *
all_links = []
all_names = []
url_first_part = 'http://python-data.dr-chuck.net/known_by_'
url_last_part = '.html'
first_entry = 'Inaara'
for i in range(7):
url = url_first_part + first_entry + url_last_part
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
#def get_next_name(url)
tags = soup('a')
links = []
for tag in tags:
links.append(tag.get('href', None))
url = links[17]
print url
name = url[41:]
next_entry = name[:-5]
all_names.append(next_entry)
first_entry = next_entry
url = url_first_part + first_entry + url_last_part
all_links.append(url)
print all_names[-1]
@fushuai1229

This comment has been minimized.

Copy link

fushuai1229 commented Sep 4, 2017

For python3, there is a new code works for the problem

To run this, you can install BeautifulSoup

https://pypi.python.org/pypi/beautifulsoup4

Or download the file

http://www.py4e.com/code3/bs4.zip

and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

Ignore SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
coun=input('Enter your count:')
pos=input('Enter your position:')
print(url)

for i in range(int(coun)):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags = soup('a')
url = tags[int(pos)-1].get('href',None)
print (url)

@yts61

This comment has been minimized.

Copy link

yts61 commented May 17, 2018

would anyone help me with this question?

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

Ignore SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = ('http://py4e-data.dr-chuck.net/known_by_Nabeel.html')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

post = int(input("Enter position: ")) -1 #The position of link relative to first link
count = int(input("Enter count: ")) #The number of times to be repeated

Build a tag list

tags = soup('a')

check the list

#print (tags)
#retrive all the links and put into dictionary
for tag in tags:
#retrive the url every 18
url = tag.get('href',None)
for i in range(count):
ans=url[post]
print (ans)

what is wrong with my code?

@Kajol-Kumari

This comment has been minimized.

Copy link

Kajol-Kumari commented Aug 19, 2019

In your code you are getting the letter at the 18th position of the link as you are iterating over letters of a particular link stored in your tags.
You try the code below and then run your code, you'll get to know your fault:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter URL : ')
count = input('Enter count : ')
position = input('Enter position : ')

for i in range(int(count)):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
url = tags[int(position)-1].get('href', None)
print(url)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.