Skip to content

Instantly share code, notes, and snippets.

@vijoin
Last active March 15, 2020 14:01
Show Gist options
  • Save vijoin/4541e844fff830ec6b087edb1e172d06 to your computer and use it in GitHub Desktop.
Save vijoin/4541e844fff830ec6b087edb1e172d06 to your computer and use it in GitHub Desktop.
TED Scraper

Introduction

In this tutorial we'll learn step by step how to create a simple web scrapper with Python 3. At the end of this tutorial you'll be able to get list of Ted Talks with a Python script with parameters. You'll be executing something like:

$ python tedscraper.py -s "Artificial Inteligence" --page 1 --results-per-page 5

1 - Playlist: Artificial intelligence (10 talks)
2 - Gil Weinberg: Can robots be creative?
3 - Peter Norvig | TED Speaker
4 - Dan Finkel: Can you solve the rogue AI riddle?
5 - Margaret Mitchell | TED Speaker

Index

  1. Set the environment
  2. Check what we want to scrape
  3. Check the libraries we're going to use
  4. Get the Data
  5. Parse the content
  6. Improve the execution interface

0. Set the environment

For this tutorial we're going to use Pycharm with python3 in Ubuntu 19.10

First we need to do is create our new Project: If you just installed Pycharm or closed all your opened projects, you should be able to see this screen Insert screenshot here

but if you're in an opened project just go to File > New Project

  • Set the project name (I suggest tedscraper)
  • Unfold the Project Interpreter options and make sure the option New Environment is selected and the base interpreter is python3
  • Click on Create button

Insert screenshot here

Now let's make sure that the virtual environment was created successfuly. Go to the bottom tab terminal. You should be able to see the prompt prefix venv Insert screenshot here and highlight the venv prefix

Now we have Pycharm already working with a virtualenv. For more information about virtual environments in python, check HERE

1. Check what we want to scrape

Now let's explore our target. In the browser let's go into ted.com and let's make a search in there:

  • Go to the searchbar a the top right
  • Copy the resulting URL. Something like this https://www.ted.com/search?page=2&q=artificial+intelligence Insert screenshot here

Now active the Developer Tools and take a look at any article's title in order to get some unique identifier, maybe from the class or the id. In this case we'll use the <article> element with the class m1 search__result and the <h3> and the <a> element. Insert screenshot here

2. Check the libraries we need

Requests

https://requests.readthedocs.io/en/master/

BeautifulSoup4

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

3. Get the data

let's install the packages we we'll user pip install -r requirements.txt

Now, let's create our script (insert screenshot)

Let's make a quick test for the request

import requests

url = "https://www.ted.com"
response = requests.get(url)
print (response.status_code)

Let's execute our script (insert screenshot) and we we'll see a 200 code, which means success

Now let's make the real request

import requests
from bs4 import BeautifulSoup
import argparse

parser = argparse.ArgumentParser(description="Scrape TED Talks")
parser.add_argument('-s', '--search-term', required=True)
parser.add_argument('-p', '--page', type=int, default=1)
parser.add_argument('-rp', '--results-per-page', type=int, default=10)
args = parser.parse_args()

search_term = args.search_term
page_number = args.page
RESULTS_PER_PAGE = args.results_per_page

url = "https://www.ted.com/search"
params = {'page': page_number, 'per_page': RESULTS_PER_PAGE, 'q': search_term}


def scrape():
    response = requests.get(url, params)
    soup = BeautifulSoup(response.content, "html.parser")
    return soup.find_all('h3', {'class': 'h7 m4'})


if __name__ == '__main__':
    print("Start")
    articles = scrape()
    for idx, article in enumerate(articles, 1):
        article_title = article.a.text
        print(f"{idx} - {article_title}")

Insert commit here

At this point we've just made a GET request and print the raw html. Now we need to get the specific element and their contents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment