Skip to content

Instantly share code, notes, and snippets.

@pteacher
Created April 10, 2023 13:33
Show Gist options
  • Save pteacher/596190a776bb4c3f6425ae253ec7f6ed to your computer and use it in GitHub Desktop.
Save pteacher/596190a776bb4c3f6425ae253ec7f6ed to your computer and use it in GitHub Desktop.
Text mining and scrapping from Wikipedia (List of Scientists)
import wikipedia
import pandas as pd
import string
page = wikipedia.page("List of chemists")
scientists_dict = {"scientist": [], "summary": [], "birth_year": []}
for p in page.links[:2]:
scientist = wikipedia.page(p)
scientists_dict["scientist"].append(p)
summary = scientist.summary
scientists_dict["summary"].append(summary)
# split check each word is_numeric len == 4, first
print(p)
for s in summary.split(" "):
s = s.translate(str.maketrans('', '', string.punctuation))
if s.isnumeric() and len(s) == 4:
scientists_dict["birth_year"].append(s)
break
# regex
# print(scientist.summary)
df = pd.DataFrame(scientists_dict)
df.to_csv("out.csv", index=False)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment