Skip to content

Instantly share code, notes, and snippets.

@shreyasgokhale
Created January 13, 2020 01:24
Show Gist options
  • Save shreyasgokhale/dc7ce0940ae30ccb26c72fc3e8680f63 to your computer and use it in GitHub Desktop.
Save shreyasgokhale/dc7ce0940ae30ccb26c72fc3e8680f63 to your computer and use it in GitHub Desktop.
BlogPart2
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Part 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hi there again! In the last blog, we were able to get some results. But we noticed that our program spent majority of its time requesting results from the API. As we are spending a lot of time in requests, it is good to have them done in parallel. So, let's leverage **threading** in Python to spawn multiple threads at the same time. Thankfully, as our threads are not interdependant on each other, we can be chill about concurrancy issues. Keep in mind that threading does not mean that our program will run on multiple cores."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import concurrent.futures, threading"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another improvement that we can do is instead of sending all of our information data again and again ( ```response = requests.request(\"GET\", url = browseQuotesURL, headers = self.headers)``` ), we create a session. This session will be remembered by our object and the next time, we just have to make a direct request. This will be also helpful in the future if we want to get the url for ticket purchase."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"class finder:\n",
" \n",
" def __init__(self, originCountry = \"DE\", currency = \"EUR\", locale = \"en-US\", rootURL=\"https://skyscanner-skyscanner-flight-search-v1.p.rapidapi.com\"):\n",
" self.currency = currency\n",
" self.locale = locale\n",
" self.rootURL = rootURL\n",
" self.originCountry = originCountry\n",
" self.airports = {}\n",
" \n",
" def setHeaders(self, headers):\n",
" self.headers = headers\n",
" self.createSession()\n",
"\n",
" # Create a session\n",
" def createSession(self):\n",
" self.session = requests.Session() \n",
" self.session.headers.update(self.headers)\n",
" return self.session\n",
" \n",
" def browseQuotes(self, source, destination, date):\n",
" quoteRequestPath = \"/apiservices/browsequotes/v1.0/\"\n",
" browseQuotesURL = self.rootURL + quoteRequestPath + self.originCountry + \"/\" + self.currency + \"/\" + self.locale + \"/\" + source + \"/\" + destination + \"/\" + date.strftime(\"%Y-%m-%d\")\n",
" # Use the same session to request again and again\n",
" response = self.session.get(browseQuotesURL)\n",
" resultJSON = json.loads(response.text)\n",
" self.printResult(resultJSON,date)\n",
" \n",
" # A bit more elegant print\n",
" def printResult(self, resultJSON,date):\n",
" if(\"Quotes\" in resultJSON):\n",
" for Places in resultJSON[\"Places\"]:\n",
" self.airports[Places[\"PlaceId\"]] = Places[\"Name\"] \n",
" for Quotes in resultJSON[\"Quotes\"]:\n",
" source = Quotes[\"OutboundLeg\"][\"OriginId\"]\n",
" dest = Quotes[\"OutboundLeg\"][\"DestinationId\"]\n",
" print(date.strftime(\"%d-%b %a\") + \" | \" + \"%s --> %s\"%(self.airports[source],self.airports[dest]) + \" | \" + \"%s EUR\" %Quotes[\"MinPrice\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As this is a new Jupyter Notebook, let's dump all our variables from previous code here"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import requests, json, timeit, time, datetime, dateutil, osmapi\n",
"import calendar\n",
"import pandas as pd\n",
"import time\n",
"\n",
"rootURL = \"https://skyscanner-skyscanner-flight-search-v1.p.rapidapi.com\"\n",
"originCountry = \"DE\"\n",
"currancy = \"EUR\"\n",
"locale = \"en-US\"\n",
"\n",
"source_begin_date = \"2020-01-18\"\n",
"source_end_date = \"2020-01-24\" \n",
"daterange_source = pd.date_range(source_begin_date, source_end_date)\n",
"airports = { }\n",
"source_array = {\"BERL-sky\"} \n",
"destination_array = {\"MAD-sky\", \"BCN-sky\", \"SVQ-sky\", \"VLC-sky\"}\n",
"\n",
"\n",
"headers = {\n",
" 'x-rapidapi-host': \"skyscanner-skyscanner-flight-search-v1.p.rapidapi.com\",\n",
" 'x-rapidapi-key': \"ae922034c6mshbd47a2c270cbe96p127c54jsnfec4819a7799\"\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now here comes the multi-threading part. The concurrent features library hands it for us. Using threadpoolexecutor as a wrapper, we just submit our task and its parameters to a pool of threads. The executor automatically schedules them as they arrive and they get executed in paralllel. Easiest multithreading ever! "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"19-Jan Sun | Berlin Tegel --> Madrid | 44.0 EUR\n",
"22-Jan Wed | Berlin Schoenefeld --> Valencia | 45.0 EUR\n",
"20-Jan Mon | Berlin Tegel --> Valencia | 67.0 EUR\n",
"18-Jan Sat | Berlin Tegel --> Madrid | 44.0 EUR\n",
"18-Jan Sat | Berlin Schoenefeld --> Valencia | 62.0 EUR\n",
"19-Jan Sun | Berlin Schoenefeld --> Valencia | 60.0 EUR\n",
"24-Jan Fri | Berlin Tegel --> Madrid | 44.0 EUR\n",
"20-Jan Mon | Berlin Tegel --> Barcelona | 57.0 EUR\n",
"20-Jan Mon | Berlin Schoenefeld --> Barcelona | 74.0 EUR\n",
"18-Jan Sat | Berlin Schoenefeld --> Barcelona | 71.0 EUR\n",
"18-Jan Sat | Berlin Schoenefeld --> Barcelona | 204.0 EUR\n",
"21-Jan Tue | Berlin Schoenefeld --> Seville | 118.0 EUR\n",
"21-Jan Tue | Berlin Tegel --> Seville | 50.0 EUR\n",
"23-Jan Thu | Berlin Tegel --> Madrid | 49.0 EUR\n",
"23-Jan Thu | Berlin Tegel --> Madrid | 53.0 EUR\n",
"21-Jan Tue | Berlin Tegel --> Madrid | 46.0 EUR\n",
"21-Jan Tue | Berlin Tegel --> Barcelona | 46.0 EUR\n",
"21-Jan Tue | Berlin Schoenefeld --> Barcelona | 103.0 EUR\n",
"24-Jan Fri | Berlin Schoenefeld --> Valencia | 44.0 EUR\n",
"23-Jan Thu | Berlin Tegel --> Seville | 58.0 EUR\n",
"19-Jan Sun | Berlin Tegel --> Barcelona | 100.0 EUR\n",
"19-Jan Sun | Berlin Schoenefeld --> Barcelona | 72.0 EUR\n",
"20-Jan Mon | Berlin Tegel --> Madrid | 46.0 EUR\n",
"21-Jan Tue | Berlin Schoenefeld --> Madrid | 44.0 EUR\n",
"23-Jan Thu | Berlin Tegel --> Barcelona | 60.0 EUR\n",
"23-Jan Thu | Berlin Tegel --> Barcelona | 53.0 EUR\n",
"22-Jan Wed | Berlin Schoenefeld --> Seville | 45.0 EUR\n",
"24-Jan Fri | Berlin Schoenefeld --> Barcelona | 49.0 EUR\n",
"24-Jan Fri | Berlin Tegel --> Barcelona | 66.0 EUR\n",
"22-Jan Wed | Berlin Tegel --> Madrid | 44.0 EUR\n",
"18-Jan Sat | Berlin Schoenefeld --> Seville | 49.0 EUR\n",
"21-Jan Tue | Berlin Tegel --> Valencia | 36.0 EUR\n",
"21-Jan Tue | Berlin Schoenefeld --> Valencia | 236.0 EUR\n",
"19-Jan Sun | Berlin Tegel --> Seville | 74.0 EUR\n",
"24-Jan Fri | Berlin Schoenefeld --> Seville | 58.0 EUR22-Jan Wed | Berlin Tegel --> Barcelona | 48.0 EUR\n",
"\n",
"22-Jan Wed | Berlin Schoenefeld --> Barcelona | 95.0 EUR\n",
"23-Jan Thu | Berlin Tegel --> Valencia | 44.0 EUR\n",
"\n",
"Benchmark Stats :\n",
"Time spent in program: 2.650348 seconds\n"
]
}
],
"source": [
"cheapest_flight_finder2 = finder()\n",
"cheapest_flight_finder2.setHeaders(headers)\n",
"\n",
"function_start = time.time()\n",
"\n",
"with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:\n",
" for single_date in daterange_source:\n",
" for destination in destination_array:\n",
" for source in source_array:\n",
" request_start = time.time()\n",
" executor.submit(cheapest_flight_finder2.browseQuotes,source, destination,single_date)\n",
"\n",
"print(\"\\nBenchmark Stats :\")\n",
"print(\"Time spent in program: %f seconds\"%(time.time()-function_start))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whoa, thats a lot faster! But just how fast? Lets compare the same code using single thread."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"18-Jan Sat | Berlin Tegel --> Madrid | 44.0 EUR\n",
"18-Jan Sat | Berlin Schoenefeld --> Valencia | 62.0 EUR\n",
"18-Jan Sat | Berlin Schoenefeld --> Barcelona | 204.0 EUR\n",
"18-Jan Sat | Berlin Schoenefeld --> Barcelona | 71.0 EUR\n",
"18-Jan Sat | Berlin Schoenefeld --> Seville | 49.0 EUR\n",
"19-Jan Sun | Berlin Tegel --> Madrid | 44.0 EUR\n",
"19-Jan Sun | Berlin Schoenefeld --> Valencia | 60.0 EUR\n",
"19-Jan Sun | Berlin Tegel --> Barcelona | 100.0 EUR\n",
"19-Jan Sun | Berlin Schoenefeld --> Barcelona | 72.0 EUR\n",
"19-Jan Sun | Berlin Tegel --> Seville | 74.0 EUR\n",
"20-Jan Mon | Berlin Tegel --> Madrid | 46.0 EUR\n",
"20-Jan Mon | Berlin Tegel --> Valencia | 67.0 EUR\n",
"20-Jan Mon | Berlin Tegel --> Barcelona | 57.0 EUR\n",
"20-Jan Mon | Berlin Schoenefeld --> Barcelona | 74.0 EUR\n",
"21-Jan Tue | Berlin Tegel --> Madrid | 46.0 EUR\n",
"21-Jan Tue | Berlin Schoenefeld --> Madrid | 44.0 EUR\n",
"21-Jan Tue | Berlin Schoenefeld --> Valencia | 236.0 EUR\n",
"21-Jan Tue | Berlin Tegel --> Valencia | 36.0 EUR\n",
"21-Jan Tue | Berlin Tegel --> Barcelona | 46.0 EUR\n",
"21-Jan Tue | Berlin Schoenefeld --> Barcelona | 103.0 EUR\n",
"21-Jan Tue | Berlin Tegel --> Seville | 50.0 EUR\n",
"21-Jan Tue | Berlin Schoenefeld --> Seville | 118.0 EUR\n",
"22-Jan Wed | Berlin Tegel --> Madrid | 44.0 EUR\n",
"22-Jan Wed | Berlin Schoenefeld --> Valencia | 45.0 EUR\n",
"22-Jan Wed | Berlin Schoenefeld --> Barcelona | 95.0 EUR\n",
"22-Jan Wed | Berlin Tegel --> Barcelona | 48.0 EUR\n",
"22-Jan Wed | Berlin Schoenefeld --> Seville | 45.0 EUR\n",
"23-Jan Thu | Berlin Tegel --> Madrid | 49.0 EUR\n",
"23-Jan Thu | Berlin Tegel --> Madrid | 53.0 EUR\n",
"23-Jan Thu | Berlin Tegel --> Valencia | 44.0 EUR\n",
"23-Jan Thu | Berlin Tegel --> Barcelona | 60.0 EUR\n",
"23-Jan Thu | Berlin Tegel --> Barcelona | 53.0 EUR\n",
"23-Jan Thu | Berlin Tegel --> Seville | 58.0 EUR\n",
"24-Jan Fri | Berlin Tegel --> Madrid | 44.0 EUR\n",
"24-Jan Fri | Berlin Schoenefeld --> Valencia | 44.0 EUR\n",
"24-Jan Fri | Berlin Tegel --> Barcelona | 66.0 EUR\n",
"24-Jan Fri | Berlin Schoenefeld --> Barcelona | 49.0 EUR\n",
"24-Jan Fri | Berlin Schoenefeld --> Seville | 58.0 EUR\n",
"\n",
"Benchmark Stats :\n",
"Time spent in program: 16.976593 seconds\n"
]
}
],
"source": [
"\n",
"function_start = time.time()\n",
"\n",
"for single_date in daterange_source:\n",
" for destination in destination_array:\n",
" for source in source_array:\n",
" request_start = time.time()\n",
" cheapest_flight_finder2.browseQuotes(source, destination,single_date)\n",
"\n",
"print(\"\\nBenchmark Stats :\")\n",
"print(\"Time spent in program: %f seconds\"%(time.time()-function_start))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Thats a tremendous imporvement. Our program is almost 7-8 times faster! You have to do a bit of trial and error for ```max_thread``` value in order to get the optimum number of threads. For me 32 was the best.\n",
"\n",
"\n",
"Also, did you notice the text glitches in our multi-thread version? Our threads are independant, so it could happen that two threads want to use print at the exact same time. Thus, somehow, the next line is not inserted properly. However, if our threads needed a synchronisation, we would have to deal with a lot of stuff like semaphores and mutexs. But, we are safe for now! \n",
"\n",
"\n",
"Now that we have finished our thread *thread* , we move back to our application"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Databases and Docker"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can see that we are generating lots of data for a single request. Now we also have to operate on it to find the cheapest trip.We can use python's data structures or even pandas. But why not use what is actually used in real world products? Databases!\n",
"\n",
"Databases are organized way of holding data. Think of something like an Excel Spreadsheet, but for easy access from other programs. Like Pokemons, there are 100s of differnt datasbases. But there 2 major types: Relational ones (SQL) and Non relational ones (NoSQL). If you want to know more about their differences and how they function, [this](https://clockwise.software/blog/relational-vs-non-relational-databases-advantages-and-disadvantages/) blog has explained it in much detail. But in our case we are working with JSON anyway and MongoDB is just meant for that kind of data! I could go on and on about why we chose a perticular database. For now, let's use MonogoDB. \n",
"\n",
"We can install it like a normal program using exe or deb packages. But, let's consider a practical scenario. If this reaches production, it will likeley be running on a server. And if it , god forbids, gets popular! ; we might me getting a tons of different requests per second from different parts of the globe. In this case, if our code fails, everything will just stop.\n",
"\n",
"The clever way of overcoming this is using microservices, i.e. using Kubernetes and Docker. Now this is definately a topic of another blog. But now, to explain you how easy it is to set up a service, let's use mongodb's official docker container. Follow the installation process for [docker](https://docs.docker.com/install/). Once you have done that, just run following command in the terminal:\n",
"\n",
"```docker run --name eurotrip-planner-mongo mongo:latest```\n",
"\n",
"This will pull the latest mongodb image and all it's required parts, build a contianer and then start a mongodb server! All in one line! You can check if it is working or not using [MongoDB client application](https://www.mongodb.com/products/compass). Our setup is done!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, I am cheating a bit, but I already wrote wrapper drvier module for MongoDB. You can read the docs on GitHub but it is fairly easy to understand. I am just going to use it directly here.\n",
"\n",
"Our MongoDB paramerters are here. We create 2 different collections or tables for our Incoming and Outgoing flights."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"import wrapymongo\n",
"authdb = \"admin\"\n",
"monogdbport = \"27017\"\n",
"host = \"localhost\"\n",
"link = \"mongodb://\" + host + \":\" + monogdbport\n",
"database = \"SkyScanner\"\n",
"outgoingTable = \"Outgoing\"\n",
"incomingTable = \"Incoming\"\n",
"placesTable = \"Places\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I am adding a function that acts like a template maker that gets an object instance of our MongoDB class."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# Function to make wrapymongo object\n",
"def makeObject(link,dbName = \"SkyScanner\", dbCollection=\"test\"):\n",
" mdbobject = wrapymongo.driver(link)\n",
" mdbobject.defineDB(dbName)\n",
" mdbobject.defineCollection(dbCollection)\n",
" return mdbobject"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We instantiate our objects and clear their contents if they have any."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mdbOutgoing = makeObject(link,dbName = database,dbCollection = outgoingTable)\n",
"mdbPlaces = makeObject(link,dbName = database,dbCollection = placesTable)\n",
"mdbIncoming = makeObject(link,dbName = database,dbCollection = incomingTable)\n",
"\n",
"mdbOutgoing.dropCollection()\n",
"mdbPlaces.dropCollection()\n",
"mdbIncoming.dropCollection()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And we initalize our arrays as usual. I saved our ```finder``` class in another file called ```flightfinder.py``` to make it easier to work with."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"import flightfinder as ff\n",
"\n",
"airports = { }\n",
"\n",
"\n",
"outgoing_flight_finder = ff.finder()\n",
"outgoing_flight_finder.setHeaders(headers)\n",
"\n",
"incoming_flight_finder = ff.finder()\n",
"incoming_flight_finder.setHeaders(headers)\n",
"\n",
"source_array = {\"BERL-sky\"} \n",
"destination_array = {\"MAD-sky\", \"BCN-sky\", \"SVQ-sky\", \"VLC-sky\"}\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let it rip!"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"An error occured:documents must be a non-empty list\n"
]
}
],
"source": [
"processing_start = time.time()\n",
"\n",
"with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:\n",
" for single_date in daterange_source:\n",
" for destination in destination_array:\n",
" for source in source_array:\n",
" request_start = time.time()\n",
" executor.submit(outgoing_flight_finder.browseQuotes,source, destination,single_date)\n",
"\n",
"outgoingQuotes = outgoing_flight_finder.getQuotes()\n",
"\n",
"for quote in outgoingQuotes:\n",
" mdbOutgoing.insertRecords(quote) \n",
"\n",
"airports.update(outgoing_flight_finder.getAirports())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At this point, you can go on the previously installed mongodb client application and see how the database has updated with a tablename _outgoing_ and all the entries related to it! We are just adding all the quotes, one by one, in the database. Now, let's do the same for the \"coming back\" part of the trip. "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"destination_begin_date = \"2020-01-24\"\n",
"destination_end_date = \"2020-01-30\" \n",
"daterange_destination = pd.date_range(destination_begin_date, destination_end_date)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# We reverse the arrays here\n",
"with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:\n",
" for single_date in daterange_destination:\n",
" for destination in source_array:\n",
" for source in destination_array:\n",
" request_start = time.time()\n",
" executor.submit(incoming_flight_finder.browseQuotes,source, destination,single_date)\n",
"\n",
"incomingQuotes = incoming_flight_finder.getQuotes()\n",
"\n",
"for quote in incomingQuotes:\n",
" mdbIncoming.insertRecords(quote) \n",
"\n",
"airports.update(incoming_flight_finder.getAirports())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At this point we have all we need stored in the database. We just have to make sense of all the data. So first, let's get top 20 cheapest entries from each of the colllection. We just make a query for ```sortRecords``` with ```MinPrice``` as key and sorted from lowest to highest (indicated by 1)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# Sort both dbs by cheapest\n",
"cheapestOutgoingFlights = {}\n",
"cheapestOutgoingFlights = mdbOutgoing.sortRecords([('MinPrice', 1)], 20)\n",
"\n",
"cheapestIncomingFlights = {}\n",
"cheapestIncomingFlights = mdbIncoming.sortRecords([('MinPrice', 1)], 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get the cheapest trip, we check for all possible combinations between incoming and outgoing quotes. Let's just combine the data first and print it."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"finalListElement = {}\n",
"finalList = []\n",
"for incomingQuotes in cheapestIncomingFlights:\n",
" for outgoingQuotes in cheapestOutgoingFlights:\n",
" finalListElement = {}\n",
" finalListElement[\"TotalPrice\"] = incomingQuotes[\"MinPrice\"] + outgoingQuotes[\"MinPrice\"]\n",
" finalListElement[\"TakeOff1\"] = airports[outgoingQuotes[\"OutboundLeg\"][\"OriginId\"]] \n",
" finalListElement[\"Land1\"] = airports[outgoingQuotes[\"OutboundLeg\"][\"DestinationId\"]] \n",
" finalListElement[\"TakeOff2\"] = airports[incomingQuotes[\"OutboundLeg\"][\"OriginId\"]] \n",
" finalListElement[\"Land2\"] = airports[incomingQuotes[\"OutboundLeg\"][\"DestinationId\"]] \n",
" finalListElement[\"Date1\"] = outgoingQuotes[\"OutboundLeg\"][\"DepartureDate\"]\n",
" finalListElement[\"Date2\"] = incomingQuotes[\"OutboundLeg\"][\"DepartureDate\"]\n",
" finalList.append(finalListElement)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[{'TotalPrice': 54.0, 'TakeOff1': 'Berlin Tegel', 'Land1': 'Valencia', 'TakeOff2': 'Seville', 'Land2': 'Berlin Schoenefeld', 'Date1': '2020-01-21T00:00:00', 'Date2': '2020-01-28T00:00:00'}, {'TotalPrice': 62.0, 'TakeOff1': 'Berlin Tegel', 'Land1': 'Madrid', 'TakeOff2': 'Seville', 'Land2': 'Berlin Schoenefeld', 'Date1': '2020-01-22T00:00:00', 'Date2': '2020-01-28T00:00:00'}, {'TotalPrice': 62.0, 'TakeOff1': 'Berlin Tegel', 'Land1': 'Madrid', 'TakeOff2': 'Seville', 'Land2': 'Berlin Schoenefeld', 'Date1': '2020-01-18T00:00:00', 'Date2': '2020-01-28T00:00:00'}, {'TotalPrice': 62.0, 'TakeOff1': 'Berlin Tegel', 'Land1': 'Madrid', 'TakeOff2': 'Seville', 'Land2': 'Berlin Schoenefeld', 'Date1': '2020-01-24T00:00:00', 'Date2': '2020-01-28T00:00:00'}, {'TotalPrice': 62.0, 'TakeOff1': 'Berlin Schoenefeld', 'Land1': 'Madrid', 'TakeOff2': 'Seville', 'Land2': 'Berlin Schoenefeld', 'Date1': '2020-01-21T00:00:00', 'Date2': '2020-01-28T00:00:00'}, {'TotalPrice': 62.0, 'TakeOff1': 'Berlin Tegel', 'Land1': 'Valencia', 'TakeOff2': 'Seville', 'Land2': 'Berlin Schoenefeld', 'Date1': '2020-01-23T00:00:00', 'Date2': '2020-01-28T00:00:00'}, {'TotalPrice': 62.0, 'TakeOff1': 'Berlin Tegel', 'Land1': 'Madrid', 'TakeOff2': 'Seville', 'Land2': 'Berlin Schoenefeld', 'Date1': '2020-01-19T00:00:00', 'Date2': '2020-01-28T00:00:00'}, {'TotalPrice': 62.0, 'TakeOff1': 'Berlin Schoenefeld', 'Land1': 'Valencia', 'TakeOff2': 'Seville', 'Land2': 'Berlin Schoenefeld', 'Date1': '2020-01-24T00:00:00', 'Date2': '2020-01-28T00:00:00'}, {'TotalPrice': 63.0, 'TakeOff1': 'Berlin Schoenefeld', 'Land1': 'Seville', 'TakeOff2': 'Seville', 'Land2': 'Berlin Schoenefeld', 'Date1': '2020-01-22T00:00:00', 'Date2': '2020-01-28T00:00:00'}, {'TotalPrice': 63.0, 'TakeOff1': 'Berlin Schoenefeld', 'Land1': 'Valencia', 'TakeOff2': 'Seville', 'Land2': 'Berlin Schoenefeld', 'Date1': '2020-01-22T00:00:00', 'Date2': '2020-01-28T00:00:00'}]\n"
]
}
],
"source": [
"print(finalList[:10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Awesome!! Now we just leverage our mongodb to put all these records in the database and then sort them by their total cost"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The Top ten cheapest flights are:\n",
"\n",
"*****\n",
"Onwards: 2020-01-21T00:00:00 Berlin Tegel --> Valencia \n",
"Return: 2020-01-28T00:00:00 Seville --> Berlin Schoenefeld \n",
" \t | 54.0 EUR\n",
"\n",
"*****\n",
"Onwards: 2020-01-21T00:00:00 Berlin Tegel --> Valencia \n",
"Return: 2020-01-29T00:00:00 Valencia --> Berlin Schoenefeld \n",
" \t | 55.0 EUR\n",
"\n",
"*****\n",
"Onwards: 2020-01-21T00:00:00 Berlin Tegel --> Valencia \n",
"Return: 2020-01-30T00:00:00 Valencia --> Berlin Schoenefeld \n",
" \t | 55.0 EUR\n",
"\n",
"*****\n",
"Onwards: 2020-01-21T00:00:00 Berlin Tegel --> Valencia \n",
"Return: 2020-01-30T00:00:00 Barcelona --> Berlin Schoenefeld \n",
" \t | 55.0 EUR\n",
"\n",
"*****\n",
"Onwards: 2020-01-21T00:00:00 Berlin Tegel --> Valencia \n",
"Return: 2020-01-26T00:00:00 Barcelona --> Berlin Schoenefeld \n",
" \t | 57.0 EUR\n",
"\n",
"*****\n",
"Onwards: 2020-01-21T00:00:00 Berlin Tegel --> Valencia \n",
"Return: 2020-01-30T00:00:00 Seville --> Berlin Schoenefeld \n",
" \t | 59.0 EUR\n",
"\n",
"*****\n",
"Onwards: 2020-01-21T00:00:00 Berlin Tegel --> Valencia \n",
"Return: 2020-01-28T00:00:00 Seville --> Berlin Schoenefeld \n",
" \t | 59.0 EUR\n",
"\n",
"*****\n",
"Onwards: 2020-01-21T00:00:00 Berlin Tegel --> Valencia \n",
"Return: 2020-01-30T00:00:00 Madrid --> Berlin Schoenefeld \n",
" \t | 59.0 EUR\n",
"\n",
"*****\n",
"Onwards: 2020-01-21T00:00:00 Berlin Tegel --> Valencia \n",
"Return: 2020-01-29T00:00:00 Barcelona --> Berlin Tegel \n",
" \t | 60.0 EUR\n",
"\n",
"*****\n",
"Onwards: 2020-01-21T00:00:00 Berlin Tegel --> Valencia \n",
"Return: 2020-01-28T00:00:00 Barcelona --> Berlin Schoenefeld \n",
" \t | 60.0 EUR\n"
]
}
],
"source": [
"mdbFinal = makeObject(link, dbName=database, dbCollection=\"FinalDatabase\")\n",
"mdbFinal.dropCollection()\n",
"mdbFinal.insertRecords(finalList)\n",
"\n",
"print(\"The Top ten cheapest flights are:\")\n",
"topQuotes = mdbFinal.sortRecords([('TotalPrice', 1)], 10)\n",
"for quote in topQuotes:\n",
" print(\"\\n*****\\nOnwards: \" + quote[\"Date1\"] + \" \" + quote[\"TakeOff1\"] + \" --> \" + quote[\"Land1\"] + \" \\nReturn: \" +\n",
" quote[\"Date2\"] + \" \" + quote[\"TakeOff2\"] + \" --> \" + quote[\"Land2\"] + \" \\n \\t | \" + \"%s EUR\" % quote[\"TotalPrice\"])\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wohoo! We have finally received what we were looking for. Now I have some cool options to choose my trip from.\n",
"\n",
"If you have made it until this point, then congratulations!! You have conqured the mountain and the summit is yours! Now its time for retrospection. Was this really necessary, or you could have just flown on the top of the using helicopter? (Wiz: could you just have used Google Flights over and over to do this?) Yes!! Is our path (solution) the most elegent and the easiest of all? Of course not! Does it even make sense to use Docker and MongoDB for such small tasks? Mostly not! Is it overengineered? You bet! Is it at least useful? Mostly not as we don't even get the timings!\n",
"\n",
"But then, even in this toy problem, we went through major steps in software designing. We developed a real, scalable system which can give us some results. It may seem useless to just get the chepest flights, but we can easily extend it to any number of parameters that we want. We could sort using dates, airlines, stopovers and create a real product. This was somewhat real problem, and we found a real solution. I think thats a win!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code of this project is available on GitHub. If you would like to contribute towards it, just send me a pull request. Also, if you think this project can be developed into a real website which a lot of people would like to use it for, hit me up!!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
#!/usr/bin/env python
# coding: utf-8
# # Part 2
# Hi there again! In the last blog, we were able to get some results. But we noticed that our program spent majority of its time requesting results from the API. As we are spending a lot of time in requests, it is good to have them done in parallel. So, let's leverage **threading** in Python to spawn multiple threads at the same time. Thankfully, as our threads are not interdependant on each other, we can be chill about concurrancy issues. Keep in mind that threading does not mean that our program will run on multiple cores.
# In[1]:
import concurrent.futures, threading
# Another improvement that we can do is instead of sending all of our information data again and again ( ```response = requests.request("GET", url = browseQuotesURL, headers = self.headers)``` ), we create a session. This session will be remembered by our object and the next time, we just have to make a direct request. This will be also helpful in the future if we want to get the url for ticket purchase.
# In[2]:
class finder:
def __init__(self, originCountry = "DE", currency = "EUR", locale = "en-US", rootURL="https://skyscanner-skyscanner-flight-search-v1.p.rapidapi.com"):
self.currency = currency
self.locale = locale
self.rootURL = rootURL
self.originCountry = originCountry
self.airports = {}
def setHeaders(self, headers):
self.headers = headers
self.createSession()
# Create a session
def createSession(self):
self.session = requests.Session()
self.session.headers.update(self.headers)
return self.session
def browseQuotes(self, source, destination, date):
quoteRequestPath = "/apiservices/browsequotes/v1.0/"
browseQuotesURL = self.rootURL + quoteRequestPath + self.originCountry + "/" + self.currency + "/" + self.locale + "/" + source + "/" + destination + "/" + date.strftime("%Y-%m-%d")
# Use the same session to request again and again
response = self.session.get(browseQuotesURL)
resultJSON = json.loads(response.text)
self.printResult(resultJSON,date)
# A bit more elegant print
def printResult(self, resultJSON,date):
if("Quotes" in resultJSON):
for Places in resultJSON["Places"]:
self.airports[Places["PlaceId"]] = Places["Name"]
for Quotes in resultJSON["Quotes"]:
source = Quotes["OutboundLeg"]["OriginId"]
dest = Quotes["OutboundLeg"]["DestinationId"]
print(date.strftime("%d-%b %a") + " | " + "%s --> %s"%(self.airports[source],self.airports[dest]) + " | " + "%s EUR" %Quotes["MinPrice"])
# As this is a new Jupyter Notebook, let's dump all our variables from previous code here
# In[3]:
import requests, json, timeit, time, datetime, dateutil, osmapi
import calendar
import pandas as pd
import time
rootURL = "https://skyscanner-skyscanner-flight-search-v1.p.rapidapi.com"
originCountry = "DE"
currancy = "EUR"
locale = "en-US"
source_begin_date = "2020-01-18"
source_end_date = "2020-01-24"
daterange_source = pd.date_range(source_begin_date, source_end_date)
airports = { }
source_array = {"BERL-sky"}
destination_array = {"MAD-sky", "BCN-sky", "SVQ-sky", "VLC-sky"}
headers = {
'x-rapidapi-host': "skyscanner-skyscanner-flight-search-v1.p.rapidapi.com",
'x-rapidapi-key': "ae922034c6mshbd47a2c270cbe96p127c54jsnfec4819a7799"
}
# Now here comes the multi-threading part. The concurrent features library hands it for us. Using threadpoolexecutor as a wrapper, we just submit our task and its parameters to a pool of threads. The executor automatically schedules them as they arrive and they get executed in paralllel. Easiest multithreading ever!
# In[4]:
cheapest_flight_finder2 = finder()
cheapest_flight_finder2.setHeaders(headers)
function_start = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:
for single_date in daterange_source:
for destination in destination_array:
for source in source_array:
request_start = time.time()
executor.submit(cheapest_flight_finder2.browseQuotes,source, destination,single_date)
print("\nBenchmark Stats :")
print("Time spent in program: %f seconds"%(time.time()-function_start))
# Whoa, thats a lot faster! But just how fast? Lets compare the same code using single thread.
# In[5]:
function_start = time.time()
for single_date in daterange_source:
for destination in destination_array:
for source in source_array:
request_start = time.time()
cheapest_flight_finder2.browseQuotes(source, destination,single_date)
print("\nBenchmark Stats :")
print("Time spent in program: %f seconds"%(time.time()-function_start))
# Thats a tremendous imporvement. Our program is almost 7-8 times faster! You have to do a bit of trial and error for ```max_thread``` value in order to get the optimum number of threads. For me 32 was the best.
#
#
# Also, did you notice the text glitches in our multi-thread version? Our threads are independant, so it could happen that two threads want to use print at the exact same time. Thus, somehow, the next line is not inserted properly. However, if our threads needed a synchronisation, we would have to deal with a lot of stuff like semaphores and mutexs. But, we are safe for now!
#
#
# Now that we have finished our thread *thread* , we move back to our application
# ### Databases and Docker
# You can see that we are generating lots of data for a single request. Now we also have to operate on it to find the cheapest trip.We can use python's data structures or even pandas. But why not use what is actually used in real world products? Databases!
#
# Databases are organized way of holding data. Think of something like an Excel Spreadsheet, but for easy access from other programs. Like Pokemons, there are 100s of differnt datasbases. But there 2 major types: Relational ones (SQL) and Non relational ones (NoSQL). If you want to know more about their differences and how they function, [this](https://clockwise.software/blog/relational-vs-non-relational-databases-advantages-and-disadvantages/) blog has explained it in much detail. But in our case we are working with JSON anyway and MongoDB is just meant for that kind of data! I could go on and on about why we chose a perticular database. For now, let's use MonogoDB.
#
# We can install it like a normal program using exe or deb packages. But, let's consider a practical scenario. If this reaches production, it will likeley be running on a server. And if it , god forbids, gets popular! ; we might me getting a tons of different requests per second from different parts of the globe. In this case, if our code fails, everything will just stop.
#
# The clever way of overcoming this is using microservices, i.e. using Kubernetes and Docker. Now this is definately a topic of another blog. But now, to explain you how easy it is to set up a service, let's use mongodb's official docker container. Follow the installation process for [docker](https://docs.docker.com/install/). Once you have done that, just run following command in the terminal:
#
# ```docker run --name eurotrip-planner-mongo mongo:latest```
#
# This will pull the latest mongodb image and all it's required parts, build a contianer and then start a mongodb server! All in one line! You can check if it is working or not using [MongoDB client application](https://www.mongodb.com/products/compass). Our setup is done!
# Now, I am cheating a bit, but I already wrote wrapper drvier module for MongoDB. You can read the docs on GitHub but it is fairly easy to understand. I am just going to use it directly here.
#
# Our MongoDB paramerters are here. We create 2 different collections or tables for our Incoming and Outgoing flights.
# In[6]:
import wrapymongo
authdb = "admin"
monogdbport = "27017"
host = "localhost"
link = "mongodb://" + host + ":" + monogdbport
database = "SkyScanner"
outgoingTable = "Outgoing"
incomingTable = "Incoming"
placesTable = "Places"
# I am adding a function that acts like a template maker that gets an object instance of our MongoDB class.
# In[7]:
# Function to make wrapymongo object
def makeObject(link,dbName = "SkyScanner", dbCollection="test"):
mdbobject = wrapymongo.driver(link)
mdbobject.defineDB(dbName)
mdbobject.defineCollection(dbCollection)
return mdbobject
# We instantiate our objects and clear their contents if they have any.
# In[8]:
mdbOutgoing = makeObject(link,dbName = database,dbCollection = outgoingTable)
mdbPlaces = makeObject(link,dbName = database,dbCollection = placesTable)
mdbIncoming = makeObject(link,dbName = database,dbCollection = incomingTable)
mdbOutgoing.dropCollection()
mdbPlaces.dropCollection()
mdbIncoming.dropCollection()
# And we initalize our arrays as usual. I saved our ```finder``` class in another file called ```flightfinder.py``` to make it easier to work with.
# In[9]:
import flightfinder as ff
airports = { }
outgoing_flight_finder = ff.finder()
outgoing_flight_finder.setHeaders(headers)
incoming_flight_finder = ff.finder()
incoming_flight_finder.setHeaders(headers)
source_array = {"BERL-sky"}
destination_array = {"MAD-sky", "BCN-sky", "SVQ-sky", "VLC-sky"}
# Let it rip!
# In[10]:
processing_start = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:
for single_date in daterange_source:
for destination in destination_array:
for source in source_array:
request_start = time.time()
executor.submit(outgoing_flight_finder.browseQuotes,source, destination,single_date)
outgoingQuotes = outgoing_flight_finder.getQuotes()
for quote in outgoingQuotes:
mdbOutgoing.insertRecords(quote)
airports.update(outgoing_flight_finder.getAirports())
# At this point, you can go on the previously installed mongodb client application and see how the database has updated with a tablename _outgoing_ and all the entries related to it! We are just adding all the quotes, one by one, in the database. Now, let's do the same for the "coming back" part of the trip.
# In[11]:
destination_begin_date = "2020-01-24"
destination_end_date = "2020-01-30"
daterange_destination = pd.date_range(destination_begin_date, destination_end_date)
# In[12]:
# We reverse the arrays here
with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:
for single_date in daterange_destination:
for destination in source_array:
for source in destination_array:
request_start = time.time()
executor.submit(incoming_flight_finder.browseQuotes,source, destination,single_date)
incomingQuotes = incoming_flight_finder.getQuotes()
for quote in incomingQuotes:
mdbIncoming.insertRecords(quote)
airports.update(incoming_flight_finder.getAirports())
# At this point we have all we need stored in the database. We just have to make sense of all the data. So first, let's get top 20 cheapest entries from each of the colllection. We just make a query for ```sortRecords``` with ```MinPrice``` as key and sorted from lowest to highest (indicated by 1)
# In[13]:
# Sort both dbs by cheapest
cheapestOutgoingFlights = {}
cheapestOutgoingFlights = mdbOutgoing.sortRecords([('MinPrice', 1)], 20)
cheapestIncomingFlights = {}
cheapestIncomingFlights = mdbIncoming.sortRecords([('MinPrice', 1)], 20)
# To get the cheapest trip, we check for all possible combinations between incoming and outgoing quotes. Let's just combine the data first and print it.
# In[14]:
finalListElement = {}
finalList = []
for incomingQuotes in cheapestIncomingFlights:
for outgoingQuotes in cheapestOutgoingFlights:
finalListElement = {}
finalListElement["TotalPrice"] = incomingQuotes["MinPrice"] + outgoingQuotes["MinPrice"]
finalListElement["TakeOff1"] = airports[outgoingQuotes["OutboundLeg"]["OriginId"]]
finalListElement["Land1"] = airports[outgoingQuotes["OutboundLeg"]["DestinationId"]]
finalListElement["TakeOff2"] = airports[incomingQuotes["OutboundLeg"]["OriginId"]]
finalListElement["Land2"] = airports[incomingQuotes["OutboundLeg"]["DestinationId"]]
finalListElement["Date1"] = outgoingQuotes["OutboundLeg"]["DepartureDate"]
finalListElement["Date2"] = incomingQuotes["OutboundLeg"]["DepartureDate"]
finalList.append(finalListElement)
# In[15]:
print(finalList[:10])
# Awesome!! Now we just leverage our mongodb to put all these records in the database and then sort them by their total cost
# In[16]:
mdbFinal = makeObject(link, dbName=database, dbCollection="FinalDatabase")
mdbFinal.dropCollection()
mdbFinal.insertRecords(finalList)
print("The Top ten cheapest flights are:")
topQuotes = mdbFinal.sortRecords([('TotalPrice', 1)], 10)
for quote in topQuotes:
print("\n*****\nOnwards: " + quote["Date1"] + " " + quote["TakeOff1"] + " --> " + quote["Land1"] + " \nReturn: " +
quote["Date2"] + " " + quote["TakeOff2"] + " --> " + quote["Land2"] + " \n \t | " + "%s EUR" % quote["TotalPrice"])
# Wohoo! We have finally received what we were looking for. Now I have some cool options to choose my trip from.
#
# If you have made it until this point, then congratulations!! You have conqured the mountain and the summit is yours! Now its time for retrospection. Was this really necessary, or you could have just flown on the top of the using helicopter? (Wiz: could you just have used Google Flights over and over to do this?) Yes!! Is our path (solution) the most elegent and the easiest of all? Of course not! Does it even make sense to use Docker and MongoDB for such small tasks? Mostly not! Is it overengineered? You bet! Is it at least useful? Mostly not as we don't even get the timings!
#
# But then, even in this toy problem, we went through major steps in software designing. We developed a real, scalable system which can give us some results. It may seem useless to just get the chepest flights, but we can easily extend it to any number of parameters that we want. We could sort using dates, airlines, stopovers and create a real product. This was somewhat real problem, and we found a real solution. I think thats a win!
# The code of this project is available on GitHub. If you would like to contribute towards it, just send me a pull request. Also, if you think this project can be developed into a real website which a lot of people would like to use it for, hit me up!!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment