Skip to content

Instantly share code, notes, and snippets.

@ruanbekker
Last active January 22, 2024 05:49
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save ruanbekker/bcf5f00a0a8ed7b9c6a7fbac7e92f7e6 to your computer and use it in GitHub Desktop.
Save ruanbekker/bcf5f00a0a8ed7b9c6a7fbac7e92f7e6 to your computer and use it in GitHub Desktop.
Python Script that Scrapes Sitemap and Ingest URL, Title and Tags to Elasticsearch
# centos: libxslt-devel python-devel
# debian:
import re
import time
import requests
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch
es_client = Elasticsearch(['http://10.0.1.11:9200'])
drop_index = es_client.indices.create(index='myindex-test', ignore=400)
create_index = es_client.indices.delete(index='myindex-test', ignore=[400, 404])
def urlparser(title, url):
# scrape title
p = {}
post = title
page = requests.get(post).content
soup = BeautifulSoup(page, 'lxml')
title_name = soup.title.string
# scrape tags
tag_names = []
desc = soup.findAll(attrs={"property":"article:tag"})
for x in xrange(len(desc)):
tag_names.append(desc[x-1]['content'].encode('utf-8'))
# payload for elasticsearch
doc = {
'date': time.strftime("%Y-%m-%d"),
'title': title_name,
'tags': tag_names,
'url': url
}
# ingest payload into elasticsearch
res = es_client.index(index="myindex-test", doc_type="docs", body=doc)
time.sleep(0.5)
sitemap_feed = 'https://sysadmins.co.za/sitemap-posts.xml'
page = requests.get(sitemap_feed)
sitemap_index = BeautifulSoup(page.content, 'html.parser')
urls = [element.text for element in sitemap_index.findAll('loc')]
for x in urls:
urlparser(x, x)
@alexellis
Copy link

Really concise, looks good 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment