vijoin/tedscraper.md

## tedscraper.md

      
    Raw
  

              tedscraper.md
            
          
    Introduction

In this tutorial we'll learn step by step how to create a simple web scrapper with Python 3. At the end of this tutorial you'll be able to get list of Ted Talks with a Python script with parameters. You'll be executing something like:
$ python tedscraper.py -s "Artificial Inteligence" --page 1 --results-per-page 5

1 - Playlist: Artificial intelligence (10 talks)
2 - Gil Weinberg: Can robots be creative?
3 - Peter Norvig | TED Speaker
4 - Dan Finkel: Can you solve the rogue AI riddle?
5 - Margaret Mitchell | TED Speaker

Index


Set the environment
Check what we want to scrape
Check the libraries we're going to use
Get the Data
Parse the content
Improve the execution interface

0. Set the environment

For this tutorial we're going to use Pycharm with python3 in Ubuntu 19.10
First we need to do is create our new Project:
If you just installed Pycharm or closed all your opened projects, you should be able to see this screen
Insert screenshot here
but if you're in an opened project just go to File > New Project

Set the project name (I suggest tedscraper)
Unfold the Project Interpreter options and make sure the option New Environment is selected and the base interpreter is python3
Click on Create button

Insert screenshot here
Now let's make sure that the virtual environment was created successfuly. Go to the bottom tab terminal. You should be able to see the prompt prefix venv
Insert screenshot here and highlight the venv prefix
Now we have Pycharm already working with a virtualenv. For more information about virtual environments in python, check HERE
1. Check what we want to scrape

Now let's explore our target. In the browser let's go into ted.com and let's make a search in there:

Go to the searchbar a the top right
Copy the resulting URL. Something like this https://www.ted.com/search?page=2&q=artificial+intelligence
Insert screenshot here

Now active the Developer Tools and take a look at any article's title in order to get some unique identifier, maybe from the class or the id. In this case we'll use the <article> element with the class m1 search__result and the <h3> and the <a> element.
Insert screenshot here
2. Check the libraries we need

Requests

https://requests.readthedocs.io/en/master/
BeautifulSoup4

https://www.crummy.com/software/BeautifulSoup/bs4/doc/
3. Get the data

let's install the packages we we'll user
pip install -r requirements.txt
Now, let's create our script (insert screenshot)
Let's make a quick test for the request
import requests

url = "https://www.ted.com"
response = requests.get(url)
print (response.status_code)
Let's execute our script (insert screenshot)
and we we'll see a 200 code, which means success
Now let's make the real request
import requests
from bs4 import BeautifulSoup
import argparse

parser = argparse.ArgumentParser(description="Scrape TED Talks")
parser.add_argument('-s', '--search-term', required=True)
parser.add_argument('-p', '--page', type=int, default=1)
parser.add_argument('-rp', '--results-per-page', type=int, default=10)
args = parser.parse_args()

search_term = args.search_term
page_number = args.page
RESULTS_PER_PAGE = args.results_per_page

url = "https://www.ted.com/search"
params = {'page': page_number, 'per_page': RESULTS_PER_PAGE, 'q': search_term}


def scrape():
    response = requests.get(url, params)
    soup = BeautifulSoup(response.content, "html.parser")
    return soup.find_all('h3', {'class': 'h7 m4'})


if __name__ == '__main__':
    print("Start")
    articles = scrape()
    for idx, article in enumerate(articles, 1):
        article_title = article.a.text
        print(f"{idx} - {article_title}")
Insert commit here
At this point we've just made a GET request and print the raw html. Now we need to get the specific element and their contents