Skip to content

Instantly share code, notes, and snippets.

@dylanroy
Last active October 12, 2020 13:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dylanroy/d95becafdab746ef308a73789f66df8b to your computer and use it in GitHub Desktop.
Save dylanroy/d95becafdab746ef308a73789f66df8b to your computer and use it in GitHub Desktop.
name: scrape-wikipedia
on:
push:
branches:
- master
schedule:
- cron: "0 */1 * * *"
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- name: 🍽️ Get working copy
uses: actions/checkout@master
with:
fetch-depth: 1
- name: 🐍 Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: '3.8'
- name: 💿 Install Requirements
run: pip install -r requirements.txt
- name: 🍳 Update dataset
run: python main.py
- name: 🚀 Commit and push if it changed
run: |
git config user.name "${GITHUB_ACTOR}"
git config user.email "${GITHUB_ACTOR}@users.noreply.github.com"
git add -A
timestamp=$(date -u)
git commit -m "Latest data: ${timestamp}" || exit 0
git push
name: scrape-wikipedia
on:
push:
branches:
- master
schedule:
- cron: "0 */1 * * *"
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- name: 🍽️ Get working copy
uses: actions/checkout@master
with:
fetch-depth: 1
- name: 🐍 Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: '3.8'
- name: 💿 Install Requirements
run: pip install -r requirements.txt
- name: 🍳 Update dataset
run: python main.py
- name: 🚀 Commit and push if it changed
run: |
git config user.name "${GITHUB_ACTOR}"
git config user.email "${GITHUB_ACTOR}@users.noreply.github.com"
git add -A
timestamp=$(date -u)
git commit -m "Latest data: ${timestamp}" || exit 0
git push
import re
import pandas as pd
def scrape_wiki_table(url, table_index=0):
df = pd.read_html(url, header=0)[table_index]
return re.sub(r"\[?\s*(\d+)(?=(?:, \d+)|\])(?=[^\[]*\]).", "", df.to_csv(index=False))
if __name__ == '__main__':
with open('data.csv', 'w+') as f:
f.write(scrape_wiki_table('https://en.wikipedia.org/wiki/List_of_chief_executive_officers'))
df = pd.read_html(url, header=0)[table_index]
Company Executive Title Since Notes Updated
Accenture Julie Sweet[1] CEO 2019 Succeeded Pierre Nanterme, died 2019-01-31
Aditya Birla Group Kumar Birla Chairman 1995 Part of the Birla family business house in India[2] 2018-10-01
Adobe Systems Shantanu Narayen Chairman, president and CEO 2007 Formerly with Apple Inc.[3] 2018-10-01
return re.sub(r"\[?\s*(\d+)(?=(?:, \d+)|\])(?=[^\[]*\]).", "", df.to_csv(index=False))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment