Created
September 24, 2019 14:43
-
-
Save powerexploit/cc40a40cccd69bd646aaa06b7a05046e to your computer and use it in GitHub Desktop.
Wikipedia scraping with python
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/python3 | |
#Scraping wikipedia page according to your command line input | |
import sys | |
import requests | |
import bs4 | |
RED = '\033[31m' | |
END = '\033[0m' | |
ascii_art = RED \ | |
+ """ | |
iiii kkkkkkkk iiii | |
i::::i k::::::k i::::i | |
iiii k::::::k iiii | |
k::::::k | |
wwwwwww wwwww wwwwwwwiiiiiii k:::::k kkkkkkkiiiiiiippppp pppppppppyyyyyyy yyyyyyy | |
w:::::w w:::::w w:::::w i:::::i k:::::k k:::::k i:::::ip::::ppp:::::::::py:::::y y:::::y | |
w:::::w w:::::::w w:::::w i::::i k:::::k k:::::k i::::ip:::::::::::::::::py:::::y y:::::y | |
w:::::w w:::::::::w w:::::w i::::i k:::::k k:::::k i::::ipp::::::ppppp::::::py:::::y y:::::y | |
w:::::w w:::::w:::::w w:::::w i::::i k::::::k:::::k i::::i p:::::p p:::::p y:::::y y:::::y | |
w:::::w w:::::w w:::::w w:::::w i::::i k:::::::::::k i::::i p:::::p p:::::p y:::::y y:::::y | |
w:::::w:::::w w:::::w:::::w i::::i k:::::::::::k i::::i p:::::p p:::::p y:::::y:::::y | |
w:::::::::w w:::::::::w i::::i k::::::k:::::k i::::i p:::::p p::::::p y:::::::::y | |
w:::::::w w:::::::w i::::::ik::::::k k:::::k i::::::ip:::::ppppp:::::::p y:::::::y | |
w:::::w w:::::w i::::::ik::::::k k:::::k i::::::ip::::::::::::::::p y:::::y | |
w:::w w:::w i::::::ik::::::k k:::::k i::::::ip::::::::::::::pp y:::::y | |
www www iiiiiiiikkkkkkkk kkkkkkkiiiiiiiip::::::pppppppp y:::::y | |
p:::::p y:::::y | |
p:::::p y:::::y | |
p:::::::p y:::::y | |
p:::::::p y:::::y | |
p:::::::p yyyyyyy | |
ppppppppp | |
[++] wikipy is simple wikipedia scraper [++] | |
Coded By: Ankit Dobhal | |
Let's Begin To Scrape..! | |
------------------------------------------------------------------------------- | |
wikipy version 1.0 | |
""" \ | |
+ END | |
print(ascii_art) | |
res = requests.get('https://en.wikipedia.org/wiki/' + ' '.join(sys.argv[1:])) | |
res.raise_for_status() | |
#Just to raise the status code | |
wiki = bs4.BeautifulSoup(res.text,"lxml") | |
elems = wiki.select('p') | |
for i in range(len(elems)): | |
print(elems[i].getText()) |
@heelrayner , just change the line 47. API are common across wikis (except wikidata). The question is more : where do we get the full list of wiki pages. See below:
Namespaces
0
: (main)1
: Talk:2
: User:3
: User_talk:
Dumps' & paths
- List of dumps
- /ngwiki/20200220 - manual (change the date)
- /ngwiki/latest - directory
- /ngwiki-latest-all-titles.gz
- /ngwiki-latest-all-titles-in-ns0.gz) - articles only
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I get the error:
Traceback (most recent call last):
File "/data/user/0/ru.iiec.pydroid3/files/accomp_files/iiec_run/iiec_run.py", line 31, in
start(fakepyfile,mainpyfile)
File "/data/user/0/ru.iiec.pydroid3/files/accomp_files/iiec_run/iiec_run.py", line 30, in start
exec(open(mainpyfile).read(), main.dict)
File "", line 48, in
File "/data/user/0/ru.iiec.pydroid3/files/aarch64-linux-android/lib/python3.8/site-packages/bs4/init.py", line 242, in init
raise FeatureNotFound(
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?