Skip to content

Instantly share code, notes, and snippets.

@Ajinkz
Created June 15, 2020 18:44
Show Gist options
  • Save Ajinkz/1d1b0066f0cef54ccc7e1a5d344b600c to your computer and use it in GitHub Desktop.
Save Ajinkz/1d1b0066f0cef54ccc7e1a5d344b600c to your computer and use it in GitHub Desktop.
Generate fake profile and fetch its bios data using Beautiful soup and selenium
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [],
"source": [
"\n",
"import re\n",
"import time\n",
"import random\n",
"import requests\n",
"import numpy as np\n",
"import pandas as pd\n",
"import _pickle as pickle\n",
"from selenium import webdriver\n",
"from bs4 import BeautifulSoup as bs\n",
"# from tqdm import tqdm_notebook as tqdm\n",
"from tqdm.notebook import tqdm\n",
"\n",
"\n",
"from selenium.webdriver.common.keys import Keys"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [],
"source": [
"biolist = []"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# launch WebDriver instance\n",
"driver = webdriver.Chrome()\n",
"\n",
"# launch website\n",
"driver.get(\"https://www.character-generator.org.uk/bio/\")\n",
"# find required element and perform action \n",
"driver.find_element_by_id('fill_all').click()\n",
"driver.find_element_by_id('quick_submit').click()\n",
"#wait for reponse\n",
"time.sleep(2)\n",
"# get new url\n",
"url = driver.current_url\n",
"#get conents from url\n",
"page = requests.get(url)\n",
"soup = bs(page.content)\n",
"biostr=''\n",
"\n",
"try:\n",
" # Getting the bios\n",
" bios = soup.find('div', class_='bio').find_all('p')\n",
"\n",
" # Adding to a list of the bios\n",
" for i in bios:\n",
" biostr += re.sub('<p>|</p>','',str(i)) \n",
"\n",
" biolist.append(biostr)\n",
" # close the instance\n",
" driver.close()\n",
"except:\n",
" pass\n",
"\n",
"len(biolist)"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [],
"source": [
"biolist = [\"Hannah Helen Chan is a 22-year-old medical student who enjoys walking, meditation and watching YouTube videos. She is inspiring and kind, but can also be very unfriendly and a bit rude.She is a Thai Buddhist who defines herself as pansexual. She is currently at college. studying medicine. She is allergic to milk. Physically, Hannah is slightly overweight but otherwise in good shape. She is average-height with pale skin, ginger hair and brown eyes. She grew up in a working class neighbourhood. After her mother died when she was young, she was raised by her father She is currently single. Her most recent romance was with a gym assistant called Gemma Nellie Simmons, who was the same age as her. They broke up because Gemma wanted a quieter life than Hannah could provide.Hannah's best friend is a medical student called Hector Mclaughlin. They are inseparable. She also hangs around with Dawn Matthews and Aliya O'Connor. They enjoy Rubix cube together. \",\n",
" \"John Jeff Trescothik is a 18-year-old teenager who enjoys swimming, glamping and painting. He is considerate and energetic, but can also be very evil and a bit stingy.He is Thai who defines himself as asexual. He didn't finish school. He has a severe phobia of heightsPhysically, John is slightly overweight but otherwise in good shape. He is very short with bronze skin, blonde hair and yellow eyes. He grew up in a middle class neighbourhood. He was raised by his father, his mother having left when he was young. John's best friend is a teenager called Allison Bell. They have a very firey friendship. He also hangs around with Geraldine Hamilton and Anthony Zhang. They enjoy duck herding together. \",\n",
" \"Christiana Wenna Nolan is a 40-year-old tea maker who enjoys competitive dog grooming, cookery and watching television. She is inspiring and gentle, but can also be very selfish and a bit untrustworthy.She is addicted to the internet, something which a friend pointed out when she was 17. The problem intensified in 1999. Christiana has lost two jobs as a result of her addiction, specifically: golf caddy and health centre receptionist.She is Australian who defines herself as asexual. She has a degree in medicine. She has a severe phobia of cats, and is obsessed with films. Physically, Christiana is in good shape. She is short with bronze skin, brown hair and yellow eyes. She has Due to the the internet, Christiana looks older than her actual age..She grew up in an upper class neighbourhood. Her mother left when she was young, leaving her with her father, who was a drunk. Christiana's best friend is a tea maker called Jasmine Barnes. They have a very firey friendship. She also hangs around with a tea maker called Spencer Wang. They enjoy ferret racing together. \",\n",
" \"Forest Cameron Hemingway is a 20-year-old gym assistant who enjoys spreading right-wing propoganda, ferret racing and watching television. He is bright and generous, but can also be very unstable and a bit evil.He is Brazilian who defines himself as bisexual. He started studying sports science at college but never finished the course. He is a vegetarian. He has phobias of blood, sausages and fingersPhysically, Forest is not in great shape. He needs to lose quite a lot of weight. He is very tall with dark chocolate skin, brown hair and blue eyes. He grew up in an upper class neighbourhood. After his father died when he was young, he was raised by his mother He is currently single. His most recent romance was with an electrician called Brayden Larry Young, who was 19 years older than him. They broke up because their strong personalities clashed.Forest's best friend is a gym assistant called Perry King. They are inseparable. He also hangs around with Teddy Cook and Joanna Ramos. They enjoy hockey together. \"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "3b56c37fd44e4165a361a6d44755f445",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Exception ignored in: <function Service.__del__ at 0x0000016EB0B311F8>\n",
"Traceback (most recent call last):\n",
" File \"d:\\ProgramData\\Anaconda3\\lib\\site-packages\\selenium\\webdriver\\common\\service.py\", line 176, in __del__\n",
" self.stop()\n",
" File \"d:\\ProgramData\\Anaconda3\\lib\\site-packages\\selenium\\webdriver\\common\\service.py\", line 151, in stop\n",
" self.send_remote_shutdown_command()\n",
" File \"d:\\ProgramData\\Anaconda3\\lib\\site-packages\\selenium\\webdriver\\common\\service.py\", line 132, in send_remote_shutdown_command\n",
" if not self.is_connectable():\n",
" File \"d:\\ProgramData\\Anaconda3\\lib\\site-packages\\selenium\\webdriver\\common\\service.py\", line 115, in is_connectable\n",
" return utils.is_connectable(self.port)\n",
" File \"d:\\ProgramData\\Anaconda3\\lib\\site-packages\\selenium\\webdriver\\common\\utils.py\", line 106, in is_connectable\n",
" socket_ = socket.create_connection((host, port), 1)\n",
" File \"d:\\ProgramData\\Anaconda3\\lib\\socket.py\", line 716, in create_connection\n",
" sock.connect(sa)\n",
"KeyboardInterrupt: \n"
]
},
{
"ename": "KeyboardInterrupt",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-95-f37148ee7ebe>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m 7\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 8\u001b[0m \u001b[1;31m# launch WebDriver instance\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 9\u001b[1;33m \u001b[0mdriver\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mwebdriver\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mChrome\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 10\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 11\u001b[0m \u001b[1;31m# launch website\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\site-packages\\selenium\\webdriver\\chrome\\webdriver.py\u001b[0m in \u001b[0;36m__init__\u001b[1;34m(self, executable_path, port, options, service_args, desired_capabilities, service_log_path, chrome_options, keep_alive)\u001b[0m\n\u001b[0;32m 79\u001b[0m \u001b[0mremote_server_addr\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mservice\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mservice_url\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 80\u001b[0m keep_alive=keep_alive),\n\u001b[1;32m---> 81\u001b[1;33m desired_capabilities=desired_capabilities)\n\u001b[0m\u001b[0;32m 82\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mException\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 83\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mquit\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\site-packages\\selenium\\webdriver\\remote\\webdriver.py\u001b[0m in \u001b[0;36m__init__\u001b[1;34m(self, command_executor, desired_capabilities, browser_profile, proxy, keep_alive, file_detector, options)\u001b[0m\n\u001b[0;32m 155\u001b[0m warnings.warn(\"Please use FirefoxOptions to set browser profile\",\n\u001b[0;32m 156\u001b[0m DeprecationWarning, stacklevel=2)\n\u001b[1;32m--> 157\u001b[1;33m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mstart_session\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mcapabilities\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mbrowser_profile\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 158\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_switch_to\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mSwitchTo\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 159\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_mobile\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mMobile\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\site-packages\\selenium\\webdriver\\remote\\webdriver.py\u001b[0m in \u001b[0;36mstart_session\u001b[1;34m(self, capabilities, browser_profile)\u001b[0m\n\u001b[0;32m 250\u001b[0m parameters = {\"capabilities\": w3c_caps,\n\u001b[0;32m 251\u001b[0m \"desiredCapabilities\": capabilities}\n\u001b[1;32m--> 252\u001b[1;33m \u001b[0mresponse\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mexecute\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mCommand\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mNEW_SESSION\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mparameters\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 253\u001b[0m \u001b[1;32mif\u001b[0m \u001b[1;34m'sessionId'\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mresponse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 254\u001b[0m \u001b[0mresponse\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mresponse\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'value'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\site-packages\\selenium\\webdriver\\remote\\webdriver.py\u001b[0m in \u001b[0;36mexecute\u001b[1;34m(self, driver_command, params)\u001b[0m\n\u001b[0;32m 317\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 318\u001b[0m \u001b[0mparams\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_wrap_value\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mparams\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 319\u001b[1;33m \u001b[0mresponse\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcommand_executor\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mexecute\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdriver_command\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 320\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mresponse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 321\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0merror_handler\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcheck_response\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mresponse\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\site-packages\\selenium\\webdriver\\remote\\remote_connection.py\u001b[0m in \u001b[0;36mexecute\u001b[1;34m(self, command, params)\u001b[0m\n\u001b[0;32m 372\u001b[0m \u001b[0mdata\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mutils\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdump_json\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mparams\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 373\u001b[0m \u001b[0murl\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'%s%s'\u001b[0m \u001b[1;33m%\u001b[0m \u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_url\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mpath\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 374\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_request\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mcommand_info\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0murl\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mbody\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdata\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 375\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 376\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m_request\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmethod\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0murl\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mbody\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\site-packages\\selenium\\webdriver\\remote\\remote_connection.py\u001b[0m in \u001b[0;36m_request\u001b[1;34m(self, method, url, body)\u001b[0m\n\u001b[0;32m 395\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 396\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mkeep_alive\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 397\u001b[1;33m \u001b[0mresp\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_conn\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mrequest\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0murl\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mbody\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mbody\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mheaders\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mheaders\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 398\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 399\u001b[0m \u001b[0mstatuscode\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mresp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\site-packages\\urllib3\\request.py\u001b[0m in \u001b[0;36mrequest\u001b[1;34m(self, method, url, fields, headers, **urlopen_kw)\u001b[0m\n\u001b[0;32m 78\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 79\u001b[0m return self.request_encode_body(\n\u001b[1;32m---> 80\u001b[1;33m \u001b[0mmethod\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0murl\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mfields\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mfields\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mheaders\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mheaders\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0murlopen_kw\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 81\u001b[0m )\n\u001b[0;32m 82\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\site-packages\\urllib3\\request.py\u001b[0m in \u001b[0;36mrequest_encode_body\u001b[1;34m(self, method, url, fields, headers, encode_multipart, multipart_boundary, **urlopen_kw)\u001b[0m\n\u001b[0;32m 169\u001b[0m \u001b[0mextra_kw\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0murlopen_kw\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 170\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 171\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0murlopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0murl\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mextra_kw\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\site-packages\\urllib3\\poolmanager.py\u001b[0m in \u001b[0;36murlopen\u001b[1;34m(self, method, url, redirect, **kw)\u001b[0m\n\u001b[0;32m 328\u001b[0m \u001b[0mresponse\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mconn\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0murlopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0murl\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 329\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 330\u001b[1;33m \u001b[0mresponse\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mconn\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0murlopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mu\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mrequest_uri\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 331\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 332\u001b[0m \u001b[0mredirect_location\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mredirect\u001b[0m \u001b[1;32mand\u001b[0m \u001b[0mresponse\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget_redirect_location\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\site-packages\\urllib3\\connectionpool.py\u001b[0m in \u001b[0;36murlopen\u001b[1;34m(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)\u001b[0m\n\u001b[0;32m 670\u001b[0m \u001b[0mbody\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mbody\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 671\u001b[0m \u001b[0mheaders\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mheaders\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 672\u001b[1;33m \u001b[0mchunked\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mchunked\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 673\u001b[0m )\n\u001b[0;32m 674\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\site-packages\\urllib3\\connectionpool.py\u001b[0m in \u001b[0;36m_make_request\u001b[1;34m(self, conn, method, url, timeout, chunked, **httplib_request_kw)\u001b[0m\n\u001b[0;32m 419\u001b[0m \u001b[1;31m# Python 3 (including for exceptions like SystemExit).\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 420\u001b[0m \u001b[1;31m# Otherwise it looks like a bug in the code.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 421\u001b[1;33m \u001b[0msix\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mraise_from\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0me\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 422\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[1;33m(\u001b[0m\u001b[0mSocketTimeout\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mBaseSSLError\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mSocketError\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 423\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_raise_timeout\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0merr\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0me\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0murl\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0murl\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtimeout_value\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mread_timeout\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\site-packages\\urllib3\\packages\\six.py\u001b[0m in \u001b[0;36mraise_from\u001b[1;34m(value, from_value)\u001b[0m\n",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\site-packages\\urllib3\\connectionpool.py\u001b[0m in \u001b[0;36m_make_request\u001b[1;34m(self, conn, method, url, timeout, chunked, **httplib_request_kw)\u001b[0m\n\u001b[0;32m 414\u001b[0m \u001b[1;31m# Python 3\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 415\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 416\u001b[1;33m \u001b[0mhttplib_response\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mconn\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mgetresponse\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 417\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mBaseException\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 418\u001b[0m \u001b[1;31m# Remove the TypeError from the exception chain in\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\http\\client.py\u001b[0m in \u001b[0;36mgetresponse\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 1342\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1343\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1344\u001b[1;33m \u001b[0mresponse\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mbegin\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1345\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mConnectionError\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1346\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mclose\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\http\\client.py\u001b[0m in \u001b[0;36mbegin\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 304\u001b[0m \u001b[1;31m# read until we get a non-100 response\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 305\u001b[0m \u001b[1;32mwhile\u001b[0m \u001b[1;32mTrue\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 306\u001b[1;33m \u001b[0mversion\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mstatus\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mreason\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_read_status\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 307\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mstatus\u001b[0m \u001b[1;33m!=\u001b[0m \u001b[0mCONTINUE\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 308\u001b[0m \u001b[1;32mbreak\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\http\\client.py\u001b[0m in \u001b[0;36m_read_status\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 265\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 266\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m_read_status\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 267\u001b[1;33m \u001b[0mline\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mstr\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreadline\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0m_MAXLINE\u001b[0m \u001b[1;33m+\u001b[0m \u001b[1;36m1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"iso-8859-1\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 268\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mline\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m>\u001b[0m \u001b[0m_MAXLINE\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 269\u001b[0m \u001b[1;32mraise\u001b[0m \u001b[0mLineTooLong\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"status line\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32md:\\ProgramData\\Anaconda3\\lib\\socket.py\u001b[0m in \u001b[0;36mreadinto\u001b[1;34m(self, b)\u001b[0m\n\u001b[0;32m 587\u001b[0m \u001b[1;32mwhile\u001b[0m \u001b[1;32mTrue\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 588\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 589\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_sock\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mrecv_into\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mb\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 590\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mtimeout\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 591\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_timeout_occurred\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;32mTrue\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mKeyboardInterrupt\u001b[0m: "
]
}
],
"source": [
"# Randomizing the refresh rate\n",
"seq = [i/10 for i in range(8,18)]\n",
"\n",
"\n",
"# Gathering bios by looping and refreshing the web page\n",
"for _ in tqdm(range(1000)):\n",
" \n",
" # launch WebDriver instance\n",
" driver = webdriver.Chrome()\n",
"\n",
" # launch website\n",
" driver.get(\"https://www.character-generator.org.uk/bio/\")\n",
" # find required element and perform action \n",
" driver.find_element_by_id('fill_all').click()\n",
" driver.find_element_by_id('quick_submit').click()\n",
" #wait for reponse\n",
" time.sleep(2)\n",
" # get new url\n",
" url = driver.current_url\n",
" #get conents from url\n",
" page = requests.get(url)\n",
" soup = bs(page.content)\n",
" biostr=''\n",
"\n",
" try:\n",
" # Getting the bios\n",
" bios = soup.find('div', class_='bio').find_all('p')\n",
"\n",
" # Adding to a list of the bios\n",
" for i in bios:\n",
" biostr += re.sub('<p>|</p>','',str(i)) \n",
"\n",
" biolist.append(biostr)\n",
" # close the instance\n",
" driver.close()\n",
" except:\n",
" pass\n",
"\n",
" \n",
" # Sleeping \n",
" time.sleep(random.choice(seq))\n",
" \n",
"len(biolist)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"biolist"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [],
"source": [
"# Creating a DF from the bio list\n",
"bio_df = pd.DataFrame(biolist, columns=['Bios'])"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Bios</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Hannah Helen Chan is a 22-year-old medical stu...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>John Jeff Trescothik is a 18-year-old teenager...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Christiana Wenna Nolan is a 40-year-old tea ma...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Forest Cameron Hemingway is a 20-year-old gym ...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Bios\n",
"0 Hannah Helen Chan is a 22-year-old medical stu...\n",
"1 John Jeff Trescothik is a 18-year-old teenager...\n",
"2 Christiana Wenna Nolan is a 40-year-old tea ma...\n",
"3 Forest Cameron Hemingway is a 20-year-old gym ..."
]
},
"execution_count": 98,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bio_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment