Skip to content

Instantly share code, notes, and snippets.

@tananin
Last active October 16, 2021 10:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tananin/fe7b19cb10f9d993b5b7a2c2f8814764 to your computer and use it in GitHub Desktop.
Save tananin/fe7b19cb10f9d993b5b7a2c2f8814764 to your computer and use it in GitHub Desktop.

Парсин сайта с помощью requests, BeautifulSoup, csv

Устанавливаем нужные библиотеки

pip install requests 
pip install beautifulsoup4 
pip install lxml

Импортируем наши библиотеки

import requests
from bs4 import BeautifulSoup
import csv 

Указываем необходимые переменные скрипты

HOST = 'http://avrora-arm.ru/'
URL = 'http://avrora-arm.ru/index/sitemenu/14790'
HEADERS = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'
}
FILE = 'links.csv'

Функции

Функция загрузки страницы

def get_html(url, params=''):
    r = requests.get(url, headers=HEADERS, params=params)
    return r

Функция парсинга карточек

def get_content(html, url=URL):
    soup = BeautifulSoup(html, 'html.parser')
    items = soup.find_all('div', class_='elements-list_autogenerated')

    cards = []
    for item in items:
        cards.append(
            {
                'title': item.find('span').get_text(),
                'link' : url + item.find('a').get('href'),
                'img': url + item.find('img').get('src')
            }
        )

    return cards
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment