Skip to content

Instantly share code, notes, and snippets.

@Naesstrom
Created February 22, 2019 16:01
Show Gist options
  • Save Naesstrom/a871748f6c10715c6dc826d0436e348b to your computer and use it in GitHub Desktop.
Save Naesstrom/a871748f6c10715c6dc826d0436e348b to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Web scraping\n",
"To get values from websites which don't provide an API is often only through scraping. It can be very tricky to get to the right values but this example here should help you to get started. This is very similar to the work-flow the [`scrape` sensor](https://home-assistant.io/components/sensor.scrape/) is using."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Get the value"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Importing the needed modules."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"from bs4 import BeautifulSoup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want to scrape the counter for all our implementations from the [Component overview](https://home-assistant.io/components/).\n",
"\n",
"The section (extracted from the source) which is relevant for this example is shown below.\n",
"\n",
"```html\n",
"...\n",
"<div class=\"grid__item one-sixth lap-one-whole palm-one-whole\">\n",
"<div class=\"filter-button-group\">\n",
"<a href='#all' class=\"btn\">All (444)</a>\n",
"<a href='#featured' class=\"btn featured\">Featured</a>\n",
"<a href='#alarm' class=\"btn\">\n",
"Alarm\n",
"(9)\n",
"</a>\n",
"...\n",
"```\n",
"\n",
"The line `<a href='#all' class=\"btn\">All (444)</a>` contains the counter. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"URL = 'https://home-assistant.io/components/'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With `requests` the website is retrieved and with `BeautifulSoup` parsed."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"raw_html = requests.get(URL).text\n",
"data = BeautifulSoup(raw_html, 'html.parser')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now you have the complete content of the page. [CSS selectors](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors) can be used to identify the counter. We have several options to get the part in question. As `BeautifulSoup` is giving us a list with the findings, we only need to identify the position in the list."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<a class=\"btn\" href=\"#all\">All (791)</a>\n"
]
}
],
"source": [
"print(data.select('a')[10])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<a class=\"btn\" href=\"#all\">All (791)</a>\n"
]
}
],
"source": [
"print(data.select('.btn')[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`nth-of-type(x)` gives you element `x` back."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[<a class=\"btn\" href=\"#all\">All (791)</a>]\n"
]
}
],
"source": [
"print(data.select('a:nth-of-type(11)'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To make your selector as robust as possible, it's recommended to look for unique elements like `id`, `URL`, etc."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[<a class=\"btn\" href=\"#all\">All (791)</a>]\n"
]
}
],
"source": [
"print(data.select('a[href=\"#all\"]'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The value extration is handled with `value_template` by the [`scrape` sensor](https://home-assistant.io/components/sensor.scrape/). The next two step are only shown here to show all manual steps.\n",
"\n",
"We only need the actual text."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"All (791)\n"
]
}
],
"source": [
"print(data.select('a[href=\"#all\"]')[0].text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a string and can be manipulated. We focus on the number."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"791\n"
]
}
],
"source": [
"print(data.select('a[href=\"#all\"]')[0].text[5:8])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the number of the current platforms/components from the [Component overview](https://home-assistant.io/components/) which are available in Home Assistant.\n",
"\n",
"The details you identified here can be re-used to configure [`scrape` sensor](https://home-assistant.io/components/sensor.scrape/)'s `select`. This means that the most efficient way is to apply `nth-of-type(x)` to your selector."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Send the value to the Home Assistant frontend\n",
"The [\"Using the Home Assistant Python API\"](http://nbviewer.jupyter.org/github/home-assistant/home-assistant-notebooks/blob/master/home-assistant-python-api.ipynb) notebooks contains an intro to the [Python API](https://home-assistant.io/developers/python_api/) of Home Assistant and Jupyther notebooks. Here we are sending the scrapped value to the Home Assistant frontend."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import homeassistant.remote as remote\n",
"\n",
"HOST = '127.0.0.1'\n",
"PASSWORD = 'YOUR_PASSWORD'\n",
"\n",
"api = remote.API(HOST, PASSWORD)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_state = data.select('a[href=\"#all\"]')[0].text[5:8]\n",
"attributes = {\n",
" \"friendly_name\": \"Home Assistant Implementations\",\n",
" \"unit_of_measurement\": \"Count\"\n",
"}\n",
"remote.set_state(api, 'sensor.ha_implement', new_state=new_state, attributes=attributes)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment