Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save analyticsindiamagazine/f9b2ba171a0eef9ad396ce6f1b83bbbc to your computer and use it in GitHub Desktop.
Save analyticsindiamagazine/f9b2ba171a0eef9ad396ce6f1b83bbbc to your computer and use it in GitHub Desktop.
Complete Solution to Machinehack's Predicting Restaurant Food Cost Hackathon.
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Predict_Restaurant_Food_Cost_Final.ipynb",
"version": "0.3.2",
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"accelerator": "GPU"
},
"cells": [
{
"metadata": {
"id": "KH0KhRX3hO4a",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"# Predicting Restaurant Food Cost Hackathon by MachineHack\n",
"\n",
"This is one of many approaches to solving [\"Predicting Restaurant Food Cost Hackathon](https://https://www.machinehack.com/course/predicting-restaurant-food-cost-hackathon/), the latest hackathon by [Machinehack](https://https://www.machinehack.com/course/predicting-restaurant-food-cost-hackathon/).\n",
"\n",
"This tutorial is for all Data Science enthusiasts who have just begun the journey. Use this tutorial to learn and submit your predictions at MachineHack. The winner will get a free pass to the Machinecon 2019 event.\n",
"\n",
"Check out the details here : https://www.machinehack.com/course/predicting-restaurant-food-cost-hackathon/"
]
},
{
"metadata": {
"id": "uJ8CZAQukbmD",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"##Importing the Data Sets"
]
},
{
"metadata": {
"id": "aurtDLutuv3k",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"training_set = pd.read_excel(\"Data_Train.xlsx\")\n",
"test_set = pd.read_excel(\"Data_Test.xlsx\")"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "U8sBPtYeggZW",
"colab_type": "code",
"outputId": "0bee7791-cd78-4ad7-d802-cbd168844586",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
}
},
"cell_type": "code",
"source": [
"training_set.head()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" TITLE RESTAURANT_ID \\\n",
"0 CASUAL DINING 9438 \n",
"1 CASUAL DINING,BAR 13198 \n",
"2 CASUAL DINING 10915 \n",
"3 QUICK BITES 6346 \n",
"4 DESSERT PARLOR 15387 \n",
"\n",
" CUISINES \\\n",
"0 Malwani, Goan, North Indian \n",
"1 Asian, Modern Indian, Japanese \n",
"2 North Indian, Chinese, Biryani, Hyderabadi \n",
"3 Tibetan, Chinese \n",
"4 Desserts \n",
"\n",
" TIME CITY LOCALITY RATING \\\n",
"0 11am – 4pm, 7:30pm – 11:30pm (Mon-Sun) Thane Dombivali East 3.6 \n",
"1 6pm – 11pm (Mon-Sun) Chennai Ramapuram 4.2 \n",
"2 11am – 3:30pm, 7pm – 11pm (Mon-Sun) Chennai Saligramam 3.8 \n",
"3 11:30am – 1am (Mon-Sun) Mumbai Bandra West 4.1 \n",
"4 11am – 1am (Mon-Sun) Mumbai Lower Parel 3.8 \n",
"\n",
" VOTES COST \n",
"0 49 votes 1200 \n",
"1 30 votes 1500 \n",
"2 221 votes 800 \n",
"3 24 votes 800 \n",
"4 165 votes 300 "
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>TITLE</th>\n",
" <th>RESTAURANT_ID</th>\n",
" <th>CUISINES</th>\n",
" <th>TIME</th>\n",
" <th>CITY</th>\n",
" <th>LOCALITY</th>\n",
" <th>RATING</th>\n",
" <th>VOTES</th>\n",
" <th>COST</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>CASUAL DINING</td>\n",
" <td>9438</td>\n",
" <td>Malwani, Goan, North Indian</td>\n",
" <td>11am – 4pm, 7:30pm – 11:30pm (Mon-Sun)</td>\n",
" <td>Thane</td>\n",
" <td>Dombivali East</td>\n",
" <td>3.6</td>\n",
" <td>49 votes</td>\n",
" <td>1200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>CASUAL DINING,BAR</td>\n",
" <td>13198</td>\n",
" <td>Asian, Modern Indian, Japanese</td>\n",
" <td>6pm – 11pm (Mon-Sun)</td>\n",
" <td>Chennai</td>\n",
" <td>Ramapuram</td>\n",
" <td>4.2</td>\n",
" <td>30 votes</td>\n",
" <td>1500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>CASUAL DINING</td>\n",
" <td>10915</td>\n",
" <td>North Indian, Chinese, Biryani, Hyderabadi</td>\n",
" <td>11am – 3:30pm, 7pm – 11pm (Mon-Sun)</td>\n",
" <td>Chennai</td>\n",
" <td>Saligramam</td>\n",
" <td>3.8</td>\n",
" <td>221 votes</td>\n",
" <td>800</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>QUICK BITES</td>\n",
" <td>6346</td>\n",
" <td>Tibetan, Chinese</td>\n",
" <td>11:30am – 1am (Mon-Sun)</td>\n",
" <td>Mumbai</td>\n",
" <td>Bandra West</td>\n",
" <td>4.1</td>\n",
" <td>24 votes</td>\n",
" <td>800</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>DESSERT PARLOR</td>\n",
" <td>15387</td>\n",
" <td>Desserts</td>\n",
" <td>11am – 1am (Mon-Sun)</td>\n",
" <td>Mumbai</td>\n",
" <td>Lower Parel</td>\n",
" <td>3.8</td>\n",
" <td>165 votes</td>\n",
" <td>300</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
]
},
"metadata": {
"tags": []
},
"execution_count": 11
}
]
},
{
"metadata": {
"id": "kROz9ISJggC7",
"colab_type": "code",
"outputId": "aca5a101-daea-4c39-83b2-c133ab44755c",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
}
},
"cell_type": "code",
"source": [
"test_set.head()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" TITLE RESTAURANT_ID CUISINES \\\n",
"0 CASUAL DINING 4085 North Indian, Chinese, Mughlai, Kebab \n",
"1 QUICK BITES 12680 South Indian, Fast Food, Pizza, North Indian \n",
"2 CASUAL DINING 1411 North Indian, Seafood, Biryani, Chinese \n",
"3 None 204 Biryani \n",
"4 QUICK BITES 13453 South Indian, Kerala \n",
"\n",
" TIME CITY LOCALITY \\\n",
"0 12noon – 12midnight (Mon-Sun) Noida Sector 18 \n",
"1 7am – 12:30AM (Mon-Sun) Mumbai Grant Road \n",
"2 11am – 11:30pm (Mon-Sun) Mumbai Marine Lines \n",
"3 9am – 10pm (Mon, Wed, Thu, Fri, Sat, Sun), 10:... Faridabad NIT \n",
"4 11am – 10pm (Mon-Sun) Kochi Kaloor \n",
"\n",
" RATING VOTES \n",
"0 4.3 564 votes \n",
"1 4.2 61 votes \n",
"2 3.8 350 votes \n",
"3 3.8 1445 votes \n",
"4 3.6 23 votes "
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>TITLE</th>\n",
" <th>RESTAURANT_ID</th>\n",
" <th>CUISINES</th>\n",
" <th>TIME</th>\n",
" <th>CITY</th>\n",
" <th>LOCALITY</th>\n",
" <th>RATING</th>\n",
" <th>VOTES</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>CASUAL DINING</td>\n",
" <td>4085</td>\n",
" <td>North Indian, Chinese, Mughlai, Kebab</td>\n",
" <td>12noon – 12midnight (Mon-Sun)</td>\n",
" <td>Noida</td>\n",
" <td>Sector 18</td>\n",
" <td>4.3</td>\n",
" <td>564 votes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>QUICK BITES</td>\n",
" <td>12680</td>\n",
" <td>South Indian, Fast Food, Pizza, North Indian</td>\n",
" <td>7am – 12:30AM (Mon-Sun)</td>\n",
" <td>Mumbai</td>\n",
" <td>Grant Road</td>\n",
" <td>4.2</td>\n",
" <td>61 votes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>CASUAL DINING</td>\n",
" <td>1411</td>\n",
" <td>North Indian, Seafood, Biryani, Chinese</td>\n",
" <td>11am – 11:30pm (Mon-Sun)</td>\n",
" <td>Mumbai</td>\n",
" <td>Marine Lines</td>\n",
" <td>3.8</td>\n",
" <td>350 votes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>None</td>\n",
" <td>204</td>\n",
" <td>Biryani</td>\n",
" <td>9am – 10pm (Mon, Wed, Thu, Fri, Sat, Sun), 10:...</td>\n",
" <td>Faridabad</td>\n",
" <td>NIT</td>\n",
" <td>3.8</td>\n",
" <td>1445 votes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>QUICK BITES</td>\n",
" <td>13453</td>\n",
" <td>South Indian, Kerala</td>\n",
" <td>11am – 10pm (Mon-Sun)</td>\n",
" <td>Kochi</td>\n",
" <td>Kaloor</td>\n",
" <td>3.6</td>\n",
" <td>23 votes</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
]
},
"metadata": {
"tags": []
},
"execution_count": 13
}
]
},
{
"metadata": {
"id": "cKcy8vq4gwQA",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Exploratory Data Analysis"
]
},
{
"metadata": {
"id": "VzYSKPlVOAT8",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"In the below code blocks, we will perform an analysis on the given data.\n",
"\n",
"1. Checking the features in the Datasets\n",
"2. Data Analysis :\n",
"\n",
"\n",
"* The training and test data are combined for further analysis. \n",
"* For the features TITLE and CUISINES we will identify the maximum number of items listed in a single cell and then split the features in to as many new features. \n",
"* If you observe the code below you will find that TITLE is split into two new columns TITLE1 and TITLE2. CUISINES is split into 8 different features.\n",
"* The NANs in the CITY and LOCALITY columns are replaced by \"NOT AVAILABLE\".\n",
"* Also the unique values in TITLE, CUISINES, CITY and LOALITY are identified which are to be used in encoding in the Data Preprocesing part.\n",
"\n",
"\n",
"---\n",
"\n"
]
},
{
"metadata": {
"id": "DyrO4buQuyZs",
"colab_type": "code",
"outputId": "37710797-f6f7-4951-a49d-0e72de1953c6",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1428
}
},
"cell_type": "code",
"source": [
"###############################################################################################################################################\n",
"\n",
"# chechking the features in the Datasets\n",
"\n",
"###############################################################################################################################################\n",
"\n",
"\n",
"#Training Set\n",
"\n",
"print(\"\\nEDA on Training Set\\n\")\n",
"print(\"#\"*30)\n",
"print(\"\\nFeatures/Columns : \\n\", training_set.columns)\n",
"print(\"\\n\\nNumber of Features/Columns : \", len(training_set.columns))\n",
"print(\"\\nNumber of Rows : \",len(training_set))\n",
"print(\"\\n\\nData Types :\\n\", training_set.dtypes)\n",
"print(\"\\nContains NaN/Empty cells : \", training_set.isnull().values.any())\n",
"print(\"\\nTotal empty cells by column :\\n\", training_set.isnull().sum(), \"\\n\\n\")\n",
"\n",
"\n",
"# Test Set\n",
"print(\"#\"*30)\n",
"print(\"\\nEDA on Test Set\\n\")\n",
"print(\"#\"*30)\n",
"print(\"\\nFeatures/Columns : \\n\",test_set.columns)\n",
"print(\"\\n\\nNumber of Features/Columns : \",len(test_set.columns))\n",
"print(\"\\nNumber of Rows : \",len(test_set))\n",
"print(\"\\n\\nData Types :\\n\", test_set.dtypes)\n",
"print(\"\\nContains NaN/Empty cells : \", test_set.isnull().values.any())\n",
"print(\"\\nTotal empty cells by column :\\n\", test_set.isnull().sum())\n",
"\n",
"\n",
"\n"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"\n",
"EDA on Training Set\n",
"\n",
"##############################\n",
"\n",
"Features/Columns : \n",
" Index(['TITLE', 'RESTAURANT_ID', 'CUISINES', 'TIME', 'CITY', 'LOCALITY',\n",
" 'RATING', 'VOTES', 'COST'],\n",
" dtype='object')\n",
"\n",
"\n",
"Number of Features/Columns : 9\n",
"\n",
"Number of Rows : 12690\n",
"\n",
"\n",
"Data Types :\n",
" TITLE object\n",
"RESTAURANT_ID int64\n",
"CUISINES object\n",
"TIME object\n",
"CITY object\n",
"LOCALITY object\n",
"RATING object\n",
"VOTES object\n",
"COST int64\n",
"dtype: object\n",
"\n",
"Contains NaN/Empty cells : True\n",
"\n",
"Total empty cells by column :\n",
" TITLE 0\n",
"RESTAURANT_ID 0\n",
"CUISINES 0\n",
"TIME 0\n",
"CITY 112\n",
"LOCALITY 98\n",
"RATING 2\n",
"VOTES 1204\n",
"COST 0\n",
"dtype: int64 \n",
"\n",
"\n",
"##############################\n",
"\n",
"EDA on Test Set\n",
"\n",
"##############################\n",
"\n",
"Features/Columns : \n",
" Index(['TITLE', 'RESTAURANT_ID', 'CUISINES', 'TIME', 'CITY', 'LOCALITY',\n",
" 'RATING', 'VOTES'],\n",
" dtype='object')\n",
"\n",
"\n",
"Number of Features/Columns : 8\n",
"\n",
"Number of Rows : 4231\n",
"\n",
"\n",
"Data Types :\n",
" TITLE object\n",
"RESTAURANT_ID int64\n",
"CUISINES object\n",
"TIME object\n",
"CITY object\n",
"LOCALITY object\n",
"RATING object\n",
"VOTES object\n",
"dtype: object\n",
"\n",
"Contains NaN/Empty cells : True\n",
"\n",
"Total empty cells by column :\n",
" TITLE 0\n",
"RESTAURANT_ID 0\n",
"CUISINES 0\n",
"TIME 0\n",
"CITY 35\n",
"LOCALITY 30\n",
"RATING 2\n",
"VOTES 402\n",
"dtype: int64\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "31f9dsW8u21w",
"colab_type": "code",
"outputId": "1677efa2-8e73-4555-a797-67addf317801",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 3196
}
},
"cell_type": "code",
"source": [
"###############################################################################################################################################\n",
"\n",
"# Data Analysisng\n",
"\n",
"###############################################################################################################################################\n",
"\n",
"\n",
"#Combining trainig set and test sets for analysing data and finding patterns\n",
"\n",
"data_temp = [training_set[['TITLE', 'RESTAURANT_ID', 'CUISINES', 'TIME', 'CITY', 'LOCALITY','RATING', 'VOTES']], test_set]\n",
"\n",
"data_temp = pd.concat(data_temp)\n",
"\n",
"\n",
"# Analysing Titles \n",
"\n",
"titles = list(data_temp['TITLE'])\n",
"\n",
"# Finding Maximum number of titles mentioned in a single cell\n",
"maxim = 1\n",
"for i in titles :\n",
" if len(i.split(',')) > maxim:\n",
" maxim = len(i.split(','))\n",
" \n",
"print(\"\\n\\nMaximum Titles in a Cell : \", maxim) \n",
"\n",
"all_titles = []\n",
"\n",
"for i in titles :\n",
" if len(i.split(',')) == 1:\n",
" all_titles.append(i.split(',')[0].strip().upper())\n",
" else :\n",
" for it in range(len(i.split(','))):\n",
" all_titles.append(i.split(',')[it].strip().upper())\n",
"\n",
"print(\"\\n\\nNumber of Unique Titles : \", len(pd.Series(all_titles).unique()))\n",
"print(\"\\n\\nUnique Titles:\\n\", pd.Series(all_titles).unique())\n",
"\n",
"all_titles = list(pd.Series(all_titles).unique())\n",
"\n",
"# Analysing cuisines \n",
"\n",
"cuisines = list(data_temp['CUISINES'])\n",
"\n",
"maxim = 1\n",
"for i in cuisines :\n",
" if len(i.split(',')) > maxim:\n",
" maxim = len(i.split(','))\n",
" \n",
"print(\"\\n\\nMaximum cuisines in a Cell : \", maxim) \n",
"\n",
"all_cuisines = []\n",
"\n",
"for i in cuisines :\n",
" if len(i.split(',')) == 1:\n",
" #print(i.split(',')[0])\n",
" all_cuisines.append(i.split(',')[0].strip().upper())\n",
" else :\n",
" for it in range(len(i.split(','))):\n",
" #print(i.split(',')[it])\n",
" all_cuisines.append(i.split(',')[it].strip().upper())\n",
"\n",
"print(\"\\n\\nNumber of Unique Cuisines : \", len(pd.Series(all_cuisines).unique()))\n",
"print(\"\\n\\nUnique Cuisines:\\n\", pd.Series(all_cuisines).unique())\n",
"\n",
"all_cuisines = list(pd.Series(all_cuisines).unique())\n",
"\n",
"# Analysing CITY\n",
"\n",
"all_cities = list(data_temp['CITY'])\n",
"\n",
"for i in range(len(all_cities)):\n",
" if type(all_cities[i]) == float:\n",
" all_cities[i] = 'NOT AVAILABLE'\n",
" all_cities[i] = all_cities[i].strip().upper()\n",
" \n",
"print(\"\\n\\nNumber of Unique cities (Including NOT AVAILABLE): \", len(pd.Series(all_cities).unique()))\n",
"print(\"\\n\\nUnique Cities:\\n\", pd.Series(all_cities).unique())\n",
" \n",
"all_cities = list(pd.Series(all_cities).unique())\n",
"\n",
"\n",
"# Cleaning LOCALITY\n",
"\n",
"all_localities = list(data_temp['LOCALITY'])\n",
"\n",
"for i in range(len(all_localities)):\n",
" if type(all_localities[i]) == float:\n",
" all_localities[i] = 'NOT AVAILABLE'\n",
" all_localities[i] = all_localities[i].strip().upper()\n",
" \n",
"print(\"\\n\\nNumber of Unique Localities (Including NOT AVAILABLE) : \", len(pd.Series(all_localities).unique()))\n",
"print(\"\\n\\nUnique Localities:\\n\", pd.Series(all_localities).unique())\n",
"\n",
"all_localities = list(pd.Series(all_localities).unique())\n"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"\n",
"\n",
"Maximum Titles in a Cell : 2\n",
"\n",
"\n",
"Number of Unique Titles : 25\n",
"\n",
"\n",
"Unique Titles:\n",
" ['CASUAL DINING' 'BAR' 'QUICK BITES' 'DESSERT PARLOR' 'CAFÉ'\n",
" 'MICROBREWERY' 'BEVERAGE SHOP' 'IRANI CAFE' 'BAKERY' 'NONE' 'PUB'\n",
" 'FINE DINING' 'SWEET SHOP' 'LOUNGE' 'FOOD COURT' 'FOOD TRUCK' 'MESS'\n",
" 'KIOSK' 'CLUB' 'CONFECTIONERY' 'DHABA' 'MEAT SHOP' 'COCKTAIL BAR'\n",
" 'PAAN SHOP' 'BHOJANALYA']\n",
"\n",
"\n",
"Maximum cuisines in a Cell : 8\n",
"\n",
"\n",
"Number of Unique Cuisines : 130\n",
"\n",
"\n",
"Unique Cuisines:\n",
" ['MALWANI' 'GOAN' 'NORTH INDIAN' 'ASIAN' 'MODERN INDIAN' 'JAPANESE'\n",
" 'CHINESE' 'BIRYANI' 'HYDERABADI' 'TIBETAN' 'DESSERTS' 'SEAFOOD' 'CAFE'\n",
" 'PIZZA' 'BURGER' 'BAR FOOD' 'SOUTH INDIAN' 'FAST FOOD' 'BEVERAGES'\n",
" 'ARABIAN' 'MUGHLAI' 'MAHARASHTRIAN' 'PARSI' 'THAI' 'BAKERY' 'MOMOS'\n",
" 'CONTINENTAL' 'EUROPEAN' 'ROLLS' 'ANDHRA' 'ITALIAN' 'BBQ' 'FINGER FOOD'\n",
" 'TEA' 'AMERICAN' 'HEALTHY FOOD' 'COFFEE' 'INDONESIAN' 'KOREAN' 'NEPALESE'\n",
" 'ICE CREAM' 'MEXICAN' 'KERALA' 'INDIAN' 'MITHAI' 'STREET FOOD'\n",
" 'MALAYSIAN' 'VIETNAMESE' 'IRANIAN' 'KEBAB' 'JUICES' 'SANDWICH'\n",
" 'MEDITERRANEAN' 'SALAD' 'GUJARATI' 'RAJASTHANI' 'TEX-MEX' 'ROAST CHICKEN'\n",
" 'BURMESE' 'CHETTINAD' 'NORTH EASTERN' 'LEBANESE' 'COFFEE AND TEA' 'GRILL'\n",
" '' 'BIHARI' 'BENGALI' 'LUCKNOWI' 'AWADHI' 'STEAK' 'FRENCH' 'PORTUGUESE'\n",
" 'WRAPS' 'SRI LANKAN' 'ORIYA' 'ETHIOPIAN' 'KONKAN' 'SUSHI' 'SPANISH'\n",
" 'RUSSIAN' 'MANGALOREAN' 'TURKISH' 'BUBBLE TEA' 'AFGHAN' 'NAGA'\n",
" 'SINGAPOREAN' 'GERMAN' 'MIDDLE EASTERN' 'SINDHI' 'CANTONESE' 'HOT POT'\n",
" 'PAN ASIAN' 'SATAY' 'DUMPLINGS' 'KASHMIRI' 'RAW MEATS' 'DRINKS ONLY'\n",
" 'MOROCCAN' 'PANINI' 'CAFE FOOD' 'CHARCOAL CHICKEN' 'BELGIAN' 'MONGOLIAN'\n",
" 'TAMIL' 'AFRICAN' 'PAAN' 'ASSAMESE' 'HOT DOGS' 'POKÉ' 'BRITISH' 'BOHRI'\n",
" 'FUSION' 'ARMENIAN' 'SOUTH AMERICAN' 'GREEK' 'PAKISTANI' 'PERUVIAN'\n",
" 'CUISINE VARIES' 'IRISH' 'MULTI CUISINE' 'JEWISH' 'VEGAN' 'ORIENTAL'\n",
" 'MODERN AUSTRALIAN' 'EGYPTIAN' 'FISH AND CHIPS' 'BRAZILIAN' 'MISHTI'\n",
" 'FALAFEL' 'HAWAIIAN']\n",
"\n",
"\n",
"Number of Unique cities (Including NOT AVAILABLE): 445\n",
"\n",
"\n",
"Unique Cities:\n",
" ['THANE' 'CHENNAI' 'MUMBAI' 'BANGALORE' 'GURGAON' 'HYDERABAD' 'KOCHI'\n",
" 'THANE WEST' 'ANDHERI LOKHANDWALA' 'NEW DELHI' 'ANDHERI WEST'\n",
" 'MALAD EAST' '682036' 'BANGALOR' 'NAVI MUMBAI' 'BANDRA WEST' 'DELHI'\n",
" 'NOIDA' 'BANGALORE-560066' 'SECUNDERABAD' 'NOT AVAILABLE' 'INDIA'\n",
" 'MADHURANAGAR' 'CHENNAI TEYNAMPET' 'FARIDABAD' 'CHEMBUR.' 'MAHARASHTRA'\n",
" 'OPP GURUDWARA SHAKURPUR' 'TELAGANA LAND LINE:040-48507016' 'GHAZIABAD'\n",
" 'KARNATAKA' 'KERALA' 'EDAPPALLY' 'KADAVANTHRA' 'ERNAKULAM CIRCLE KOCHI'\n",
" 'BENGALORE' 'NEAR RELIANCE FRESH' 'KILPAUK' 'BENGALURU' 'KOTHAGUDA'\n",
" 'GOREGAON WEST' 'BANGLORE' 'TAMIL NADU' 'KAKKANAD' 'KOCHI ELAMKULAM'\n",
" 'OUTER RING ROAD' 'MULUND EAST'\n",
" 'SECUNDERABAD MAIN ROAD NEAR SIGNAL NMREC COLLEGE' 'TELANGANA'\n",
" 'PONNURUNI KOCHI' 'GACHIBOWLI' 'SEMMANCHERI'\n",
" '5TH MAIN TEACHERS COLONY KORAMANGALA BLOCK 1 BANGALORE 560034'\n",
" 'MUMBAI MAHIM' 'POWAI (NEXT TO POWAI PLAZA)' 'DOMBIVALI EAST'\n",
" 'KOCHI VYTTILA' 'KANDIVALI' 'KOCHI PALARIVATTOM' 'DEWAN RAMA ROAD'\n",
" 'GURUGRAM' 'SECTOR 51 NOIDA' 'KALOOR' 'BESANT NAGAR'\n",
" 'ARUMBAKKAM CHENNAI-600106.' 'ADJACENT TO COMMERCIAL STREET' 'DELHI NCR'\n",
" 'DWARKA' '682035.' 'KALYAN WEST' 'AVADI' 'KONDAPUR' 'MEHDIPATNAM'\n",
" 'GANDIPET' 'VELACHERY' 'PALLAVARAM' 'VIJAYA NAGAR' 'BTM LAYOUT'\n",
" 'CHENNAI 600034.'\n",
" 'METRO PILLAR NO 21. METTUGUDA MAIN ROAD NEAR RAILWAY DEGREE COLLEGE.'\n",
" 'CHENNAI - 600040' 'JP NAGAR BANGALORE' 'MADHAPUR' 'ERNAKULAM' 'SARJAPUR'\n",
" 'WHITEFIELD BANGALORE' 'KOCHI CHULLICKAL' 'KOCHI-683101'\n",
" 'BANGALORE - 560076' 'ROHINI' 'HYDERABAD BEHIND VACS PASTRIES'\n",
" 'HYDERABAD NEERUS EMPORIUM.' 'NAVI MUMBAI.' 'KAROL BAGH' 'PERUNGUDI'\n",
" 'THYKOODAM' 'GREATER NOIDA' 'BANGALORE.' 'KHAIRATABAD' 'CHULLICKAL'\n",
" 'GRANT ROAD WEST' 'HITECH CITY' 'WEST MAREDPALLY' 'MUMBAI - 400007'\n",
" 'CHENNAI PADUR' 'CHANDER NAGAR NEW DELHI' 'NEDUMBASSERY' 'MG ROAD'\n",
" 'NAYA NAGAR MIRA ROAD' 'PITAMPURA' 'LOWER PAREL' 'HBR LAYOUT'\n",
" 'TELANGANA 500003' 'RAJIV GANDHI NAGAR' 'NEW DELHI.' 'MEDAVAKKAM'\n",
" 'SATHYA NAGAR' 'P.O KOCHI' 'BEHIND RAMALAYAM TEMPLE' 'PALARIVATTOM'\n",
" 'BRIGADE ROAD' 'MUMBAI.' 'MUMBAI ANDHERI EAST' 'VIRAR WEST' 'B-1 STAGE'\n",
" 'CHENNAI KOVALAM' 'HYDERABAD.' 'ALUVA' 'TELANGANA 500034'\n",
" 'IOB BANK KAMALA NAGAR' 'HSR LAYOUT' 'MARINE DRIVE' 'DLF GALLERIA'\n",
" 'NALLATHAMBI MAIN ROAD' 'CHENNAI OPP: VASANTH & CO' 'CITYPARK'\n",
" 'KARNATAKA 560103' 'BHAYANDAR' 'ALUVA CIRCLE' 'THAMMENAHALLI VILLAGE'\n",
" 'SG PALYA' 'ATTAPUR.' 'NEAR SHANGRILLA BUS STOP' 'KHAR (WEST)' 'ROAD 3'\n",
" 'KUKATPALLY' 'FARIDABD' 'TELANGANA 500032' 'DILSUKHNAGAR'\n",
" 'MOGAPPAIR. CHENNAI' 'NEAR MUNRSHWARA TEMPLE' 'OFF BRIGADE ROAD'\n",
" 'KHAR WEST' 'POTHERI' 'CHENNAI PERUNGUDI' 'CHENNAI THURAIPAKKAM'\n",
" 'OMR KARAPAKKAM' 'HYDERABAD-500032' 'MUMBAI DOMBIVALI EAST'\n",
" 'CHENNAI THOUSAND LIGHTS' 'MAHIM' 'LINGAMPALLY' 'POWAI'\n",
" 'NEW DELHI-110024' 'CHENNAI- 600107' 'KERALA 683104' 'VASAI WEST.'\n",
" 'THANE (W)' 'NEAR SANTOSH BANJARA HYDERABAD'\n",
" 'BANASWADI (NEXT TO INDIAN BANK) BANGALORE' 'BTM BANGALORE'\n",
" 'GREATER KAILASH 2 NEW DELHI' 'SECUNDERABAD ECIL'\n",
" 'BANGALORE KORAMANGALA 7TH BLOCK' 'BANGALORE : 560085'\n",
" 'GACHIBOWLI HYDERABAD'\n",
" 'CPR LAYOUT HARLUR MAIN ROAD OPPOSITE TO OZONE EVER GREEN APARTMENT BANGALORE -'\n",
" 'ECR NEELANKARAI CHENNAI 600115' 'WARD X11' 'PERUMBAVOOR'\n",
" 'MIRA RAOD EAST' 'KERALA 682013' 'CHENNAI.' 'POKHRAN ROAD 2'\n",
" 'UTTAR PRADESH' 'KARNATAKA 560102' 'MUMBAI - 400013' 'NAHARPAR'\n",
" 'HOSUR ROAD' 'NEAR BHARAT PETROLEUM.'\n",
" 'CHENNAI (BANG OPPOSITE INDIAN BANK)' 'SRIRAM NAGAR' 'WEST MUMBAI'\n",
" 'VYTTILA' 'BANJARA HILLS' 'MALAPALLIPURAM P .O THRISSUR'\n",
" 'ANDHERI WEST MUMBAI' 'KARNATAKA 560043' 'PANAMPILLY NAGAR'\n",
" 'BORIVALI EAST.' 'ECIL' 'JUBILEE HILLS'\n",
" 'AMRIT KAUR MARKET OPPOSITE NEW DELHI RAILWAY STATION PAHARGANJ'\n",
" 'CHENNAI OPPOSITE 5C BUS STAND' 'TELENGANA' 'KOCHI RAVIPURAM' 'RAJANPADA'\n",
" 'MAHABALIPURAM' 'SECUNDERABAD. WE HAVE NO BRANCHES.' 'TELANGANA 500081'\n",
" 'GURGOAN' 'ELAMAKKARA' 'SECTOR 1' 'BANDRA W' 'KOLATHUR'\n",
" 'CHENNAI MAHABALIPURAM' '3RD STREET' 'MUMBAI CHAKALA' 'BORIVALI WEST'\n",
" 'RODEO DRIVE SECTOR 49' 'PALLIMUKKU' 'DELHI 110085' 'SECTOR 51'\n",
" 'CHAMPAPET' 'ANDAVAR NAGAR' 'BANGALORE - 560103' 'KERALA 690525'\n",
" 'OPP MUKTESHWAR ASHRAM POWAI' 'NUNGAMBAKKAM' 'BK GUDA'\n",
" 'JOGESHWARI (W) MUMBAI' 'KUKATAPALLY' 'NEAR SECTOR 110 NOIDA' 'NAVALLUR'\n",
" 'BESIDE EXCELLENCY GARDENS' 'MUMBAI - 80' 'BEGUMPET'\n",
" 'MAHARAJA HOTEL BESIDE GARDANIA BAR' 'ASHOK VIHAR PHASE 1' 'TRIVANDRUM'\n",
" 'KOCHI-18' 'NARAYANGUDA' 'THEVERA' 'CHENNAI-40' 'PALM BEACH ROAD'\n",
" 'EAST COAST ROAD (ECR)' 'RAMAPURAM' 'CHENNAI CHROMPET' 'NANDANAM' 'SAKET'\n",
" 'MG ROAD ERNAKULAM' 'ANDHERI LOKHANDWALA.' 'INDIRANAGAR' 'THIRUVANMIYUR'\n",
" 'AMBATTUR' 'BANGLAORE' 'CHENNAI - 34 LANDMARK - NEAR LOYOLA COLLEGE'\n",
" 'ANNA NAGAR WEST' 'OLD RAILWAY ROAD' 'EAST MUMBAI'\n",
" 'KANAKAPURA ROAD BANGLORE' 'KOCHI KAKKANAD' 'KALYAN'\n",
" 'NEAR RAMLILA GROUND' 'SERILINGAMPALLY' 'HIMAYATH NAGAR' 'NALLALA STREET'\n",
" 'ANNA SALAI' 'OLD DELHI' 'WAGLE ESTATE' '1ST STAGE' 'KOCHI-16'\n",
" 'KOCHI INTERNATIONAL AIRPORT VIP ROAD' 'FIRST STREET' 'CHENN AI'\n",
" '6 & 7 - 4/64 SUBHASH NAGAR' '1ST TAVAREKERE' 'PERAMBUR'\n",
" 'VAISHALI GHAZIABAD' 'THANISANDRA' 'BLOCK F' 'SECTOR 7 DWARKA'\n",
" 'OPPOSITE BARATHI GAS COMPANY' 'VADAPALANI' 'KONDAPUR.' 'BADLAPUR WEST.'\n",
" 'KALAMASSERY' 'PALAVAKKAM' 'TCS SYNERGY PARK' 'BTM 1ST STAGE'\n",
" 'MAHADEVPURA' 'NEW BEL ROAD 560054'\n",
" 'VELIAVEETIL HOUSE VIVEKANANDA NAGAR ELAMAKKARA' 'SHOLINGANALLUR'\n",
" 'MAHARASHTRA 400102' 'LOWER PAREL WEST' 'TRIPUNITHURA' 'MOGAPPAIR'\n",
" 'TELANGANA 500070' 'JP NAGAR' 'NAVI-MUMBAI' 'ASHOK NAGAR' 'MARATHAHALLI'\n",
" 'HARIDWAR APARTMENTS' 'KERALA 682001 INDIA' 'KARNATAKA 560037'\n",
" 'KERALA 683585' 'CHENNAI. (NEAR HOTEL MATSHYA)' 'INDIRAPURAM'\n",
" 'BEGUMPET HYDERABAD' 'MANIKONDA'\n",
" 'BANGALORE LAND MARK ABOVE MAHAVEER HARD WARE' 'KERALA 682304'\n",
" 'RAJARAJESHWARI NAGAR BANGALORE' 'GST ROAD' 'FORT KOCHI'\n",
" 'LAHARI APARTMENTS' 'RAMANTHAPUR' 'MULUND WEST' 'GURGAON HARYANA INDIA'\n",
" 'NEW DELHI..NEAR BY SBI BANK' 'KOCHI ALUVA 102' 'PHASE 1 BANGALORE'\n",
" 'HYDERABAD MANIKONDA'\n",
" 'MUMBAI THIS IS A DELIVERY & TAKE-AWAY RESTAURANT ONLY.' '10TH AVENUE'\n",
" 'UPPAL' 'NEW DELHI 110075' 'NIZAMPET' 'ULSOO' 'BANGALORE 560076'\n",
" 'PVR PLAZA CINEMA BUILDING CONNAUGHT PLACE' 'GURGAON HARYANA' 'CHROMEPET'\n",
" 'KERALA 682024' 'JANAKPURI' 'SECUNDERABAD.'\n",
" 'B.B.M.P EAST (KARNATAKA) - 560049' 'TAMBARAM' 'MALLESHWARAM BANGALORE'\n",
" 'VADAPALANI.' 'DIST. CENTER NEW DELHI' 'BANGALORE ROAD' 'KOCHI.'\n",
" 'THANE MUMBAI' 'KADUBESANAHALLI BANGALORE' 'VASAI WEST'\n",
" 'MIG HOUSING SOCIETY' 'HARYANA' 'BORIVALI WEST.' 'GOLF COURSE ROAD'\n",
" 'KHAR MUMBAI' 'NEAR JYOTHINIVAS COLLEGE' 'ANNA NAGAR EAST' 'MASAB TANK'\n",
" 'VASAI MUMBAI' 'PANATHUR MAIN ROAD' 'NEAR ANDHERI WEST STATION'\n",
" 'OPPOSITE TO WESTERN SIDE OF ITPL SERVICE GATE' 'KALKAJI' 'APR CHAMBERS'\n",
" 'TAMIL NADU 600102' 'MAHARASHTRA.' 'GANDHINAGAR RD'\n",
" 'NEAR ANDHERI EAST STATION' 'WHITEFIELD' 'KERALA 682036'\n",
" 'MIRA ROAD THANE MUMBAI' 'INDIA GATE NEW DELHI' 'BANGALORE - 560095'\n",
" 'SHOLINGANALLUR. CHENNAI' 'CHENNAI (ABOVE BOMBAY BRASSERIE)' 'CHENNAI 37'\n",
" '682024' 'GIRGAUM' 'GREATER KAILASH 1 (GK 1) NEW DELHI' 'KURLA (W)'\n",
" 'MUMBAI 400015' 'THANE WEST THANE WEST' 'KOCHI PANAMPILLY NAGAR' 'MARAD'\n",
" 'MAHARASHTRA 400092' 'NEAR SECTOR 34' 'MEHDIPATNAM HYDERABAD'\n",
" 'NALLAGANDLA' 'VANDALUR' 'CHENNAI 40' 'SECUNDERBAD' 'MM NAGAR'\n",
" 'MUMBAI 400070' 'CHITTETHUKKARA' 'BTM' 'DOMBIVLI' 'SAHAKARA NAGAR'\n",
" 'MOHAMMAD ALI ROAD MUMBAI' 'CHENNAI 600040' 'TAVAREKERE MAIN ROAD'\n",
" 'COMMUNITY CENTRE' 'KERALA 682022' 'DELH.' 'SECTOR-6 NOIDA 201301'\n",
" 'KAARAIKUDI COMPLEX' 'THIRUVANMIYUR (OPP EUROKIDS LB ROAD)'\n",
" 'VIRAR MUMBAI' 'TOLICHOWKI' 'HYDERABA' 'KERALA 682305' 'ALWARPET'\n",
" 'KERALA 682015' 'MUMBAI VEERA DESAI AREA' 'KERALA 682018' 'KERALA 682028'\n",
" 'SURARAM' 'CHENNAI VELACHERY'\n",
" 'FORUM SUJANA MALL OPPOSITE TO MALAYSIAN TOWNSHIP' 'OLD HAFEEZPET'\n",
" 'YOUSUFGUDA' 'CHENNAI-600008' 'MUMBAI ULHASNAGAR'\n",
" 'JOGESHWARI WEST MUMBAI' 'CHEPAUK' 'CHOWPATTY' 'CHURCH STREET'\n",
" 'BALAVINAYAGAR NAGAR CHENNAI' 'T-NAGAR CHENNAI' 'RA PURAM'\n",
" 'HYDERABAD.STAR HYPERMARKET OPPOSITE SIDE SERVICE ROAD'\n",
" 'CHENNAI INJAMBAKKAM' 'MUMBAI MUMBRA' 'HABSIGUDA' 'KURLA MUMBAI'\n",
" 'TELANGANA 500027' 'CHENNA' 'KERALA 682021' 'KANDIVALI WEST'\n",
" 'CHENNAI-119' 'NOIDA EXTENTION' 'SHIHAB THANGAL ROAD' 'NEW DELHI 110011'\n",
" 'MIUMBAI' 'BORIVALI (W) MUMBAI: 400 092.' 'VANASTHALIPURAM' 'KK ROAD'\n",
" 'CHENNAI - 600018' 'OPPOSITE ELLORA BUILDING']\n",
"\n",
"\n",
"Number of Unique Localities (Including NOT AVAILABLE) : 1611\n",
"\n",
"\n",
"Unique Localities:\n",
" ['DOMBIVALI EAST' 'RAMAPURAM' 'SALIGRAMAM' ... 'OFF CARTER ROAD'\n",
" 'SRM BACK GATE' 'PERRY CROSS ROAD']\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "XVvyPUIKg8-t",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Data Cleaning"
]
},
{
"metadata": {
"id": "0OYvuR8TWIte",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"After the analysis we will proceed to cleaning the training and test sets.\n",
"\n",
"You will find some of the steps are same as we did in the Data Analysis part above. In Data Analysis part we created a temporary data set just to analyze the data, in the below codeblock however we are appliying it in the training and test sets.\n",
"\n",
"The following steps are performed for bot training_set and test_set\n",
"\n",
"* Splitting TITLE and CUISINES in to new feature sets and replacing NANs/empty cells with text \"NONE\" .\n",
"* Replacing NANs in CITY and LOCALITY with text \"NOT AVAILABLE\"\n",
"* Converting RATING type to float\n",
"* Cleaning the VOTES column and converting it to integers.\n",
"* The cleaned features are stored to a new dataset.(new_data_train & new_data_test)\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"metadata": {
"id": "YtWf3Vyxu8gA",
"colab_type": "code",
"outputId": "199982b7-edb9-41b7-8233-e97cb0802287",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 799
}
},
"cell_type": "code",
"source": [
"###############################################################################################################################################\n",
"\n",
"# Data Cleaning\n",
"\n",
"###############################################################################################################################################\n",
"\n",
"\n",
"# Cleaning Training Set\n",
"#______________________\n",
"\n",
"# TITLE\n",
"\n",
"\n",
"titles = list(training_set['TITLE'])\n",
"\n",
"# Since Maximum number of titles in a cell is 2 will will split title in to 2 columns\n",
"T1 = []\n",
"T2 = []\n",
"\n",
"for i in titles:\n",
" T1.append(i.split(',')[0].strip().upper())\n",
" try :\n",
" T2.append(i.split(',')[1].strip().upper())\n",
" except :\n",
" T2.append('NONE')\n",
"\n",
"# appending NONE to Unique titles list\n",
"all_titles.append('NONE')\n",
"\n",
"#Cleaning CUISINES \n",
"\n",
"cuisines = list(training_set['CUISINES'])\n",
" \n",
"# Since Maximum number of cuisines in a cell is 8 will will split title in to 8 columns\n",
" \n",
"C1 = []\n",
"C2 = []\n",
"C3 = []\n",
"C4 = []\n",
"C5 = []\n",
"C6 = []\n",
"C7 = []\n",
"C8 = []\n",
"\n",
"\n",
"for i in cuisines:\n",
" try :\n",
" C1.append(i.split(',')[0].strip().upper())\n",
" except :\n",
" C1.append('NONE')\n",
" try :\n",
" C2.append(i.split(',')[1].strip().upper())\n",
" except :\n",
" C2.append('NONE')\n",
" try :\n",
" C3.append(i.split(',')[2].strip().upper())\n",
" except :\n",
" C3.append('NONE')\n",
" try :\n",
" C4.append(i.split(',')[3].strip().upper())\n",
" except :\n",
" C4.append('NONE')\n",
" try :\n",
" C5.append(i.split(',')[4].strip().upper())\n",
" except :\n",
" C5.append('NONE')\n",
" try :\n",
" C6.append(i.split(',')[5].strip().upper())\n",
" except :\n",
" C6.append('NONE')\n",
" try :\n",
" C7.append(i.split(',')[6].strip().upper())\n",
" except :\n",
" C7.append('NONE')\n",
" try :\n",
" C8.append(i.split(',')[7].strip().upper())\n",
" except :\n",
" C8.append('NONE')\n",
"\n",
"# appending NONE to Unique cuisines list\n",
"all_cuisines.append('NONE')\n",
"\n",
"# Cleaning CITY\n",
"\n",
"cities = list(training_set['CITY'])\n",
"\n",
"for i in range(len(cities)):\n",
" if type(cities[i]) == float:\n",
" cities[i] = 'NOT AVAILABLE'\n",
" cities[i] = cities[i].strip().upper()\n",
" \n",
"\n",
"# Cleaning LOCALITY\n",
"\n",
"localities = list(training_set['LOCALITY'])\n",
"\n",
"for i in range(len(localities)):\n",
" if type(localities[i]) == float:\n",
" localities[i] = 'NOT AVAILABLE'\n",
" localities[i] = localities[i].strip().upper() \n",
" \n",
"\n",
"#Cleaning Rating\n",
"\n",
"rates = list(training_set['RATING'])\n",
"\n",
"for i in range(len(rates)) :\n",
" try:\n",
" rates[i] = float(rates[i])\n",
" except :\n",
" rates[i] = np.nan\n",
"\n",
"\n",
"# Votes\n",
" \n",
"votes = list(training_set['VOTES'])\n",
"\n",
"for i in range(len(votes)) :\n",
" try:\n",
" votes[i] = int(votes[i].split(\" \")[0].strip())\n",
" except :\n",
" pass \n",
" \n",
" \n",
"\n",
"new_data_train = {}\n",
"\n",
"new_data_train['TITLE1'] = T1\n",
"new_data_train['TITLE2'] = T2\n",
"new_data_train['RESTAURANT_ID'] = training_set[\"RESTAURANT_ID\"]\n",
"new_data_train['CUISINE1'] = C1\n",
"new_data_train['CUISINE2'] = C2\n",
"new_data_train['CUISINE3'] = C3\n",
"new_data_train['CUISINE4'] = C4\n",
"new_data_train['CUISINE5'] = C5\n",
"new_data_train['CUISINE6'] = C6\n",
"new_data_train['CUISINE7'] = C7\n",
"new_data_train['CUISINE8'] = C8\n",
"new_data_train['CITY'] = cities\n",
"new_data_train['LOCALITY'] = localities\n",
"new_data_train['RATING'] = rates\n",
"new_data_train['VOTES'] = votes\n",
"new_data_train['COST'] = training_set[\"COST\"]\n",
"\n",
"new_data_train = pd.DataFrame(new_data_train)\n",
"#______________________\n",
"\n",
"\n",
"\n",
"#______________________\n",
"# Cleaning Test Set\n",
"#______________________\n",
"\n",
"# TITLE\n",
"\n",
"titles = list(test_set['TITLE'])\n",
"\n",
"# Since Maximum number of titles in a cell is 2 will will split title in to 2 columns\n",
"T1 = []\n",
"T2 = []\n",
"\n",
"for i in titles:\n",
" T1.append(i.split(',')[0].strip().upper())\n",
" try :\n",
" T2.append(i.split(',')[1].strip().upper())\n",
" except :\n",
" T2.append('NONE')\n",
"\n",
"\n",
"#Cleaning CUISINES \n",
"\n",
"cuisines = list(test_set['CUISINES'])\n",
" \n",
"# Since Maximum number of cuisines in a cell is 8 will will split title in to 8 columns\n",
" \n",
"C1 = []\n",
"C2 = []\n",
"C3 = []\n",
"C4 = []\n",
"C5 = []\n",
"C6 = []\n",
"C7 = []\n",
"C8 = []\n",
"\n",
"\n",
"for i in cuisines:\n",
" try :\n",
" C1.append(i.split(',')[0].strip().upper())\n",
" except :\n",
" C1.append('NONE')\n",
" try :\n",
" C2.append(i.split(',')[1].strip().upper())\n",
" except :\n",
" C2.append('NONE')\n",
" try :\n",
" C3.append(i.split(',')[2].strip().upper())\n",
" except :\n",
" C3.append('NONE')\n",
" try :\n",
" C4.append(i.split(',')[3].strip().upper())\n",
" except :\n",
" C4.append('NONE')\n",
" try :\n",
" C5.append(i.split(',')[4].strip().upper())\n",
" except :\n",
" C5.append('NONE')\n",
" try :\n",
" C6.append(i.split(',')[5].strip().upper())\n",
" except :\n",
" C6.append('NONE')\n",
" try :\n",
" C7.append(i.split(',')[6].strip().upper())\n",
" except :\n",
" C7.append('NONE')\n",
" try :\n",
" C8.append(i.split(',')[7].strip().upper())\n",
" except :\n",
" C8.append('NONE')\n",
"\n",
"\n",
"# Cleaning CITY\n",
"\n",
"cities = list(test_set['CITY'])\n",
"\n",
"for i in range(len(cities)):\n",
" if type(cities[i]) == float:\n",
" cities[i] = 'NOT AVAILABLE'\n",
" cities[i] = cities[i].strip().upper()\n",
" \n",
"\n",
"# Cleaning LOCALITY\n",
"\n",
"localities = list(test_set['LOCALITY'])\n",
"\n",
"for i in range(len(localities)):\n",
" if type(localities[i]) == float:\n",
" localities[i] = 'NOT AVAILABLE'\n",
" localities[i] = localities[i].strip().upper() \n",
" \n",
"\n",
"#Cleaning Rating\n",
"\n",
"rates = list(test_set['RATING'])\n",
"\n",
"for i in range(len(rates)) :\n",
" try:\n",
" rates[i] = float(rates[i])\n",
" except :\n",
" rates[i] = np.nan\n",
"\n",
"\n",
"# Votes\n",
" \n",
"votes = list(test_set['VOTES'])\n",
"\n",
"for i in range(len(votes)) :\n",
" try:\n",
" votes[i] = int(votes[i].split(\" \")[0].strip())\n",
" except :\n",
" pass \n",
" \n",
" \n",
"\n",
"new_data_test = {}\n",
"\n",
"new_data_test['TITLE1'] = T1\n",
"new_data_test['TITLE2'] = T2\n",
"new_data_test['RESTAURANT_ID'] = test_set[\"RESTAURANT_ID\"]\n",
"new_data_test['CUISINE1'] = C1\n",
"new_data_test['CUISINE2'] = C2\n",
"new_data_test['CUISINE3'] = C3\n",
"new_data_test['CUISINE4'] = C4\n",
"new_data_test['CUISINE5'] = C5\n",
"new_data_test['CUISINE6'] = C6\n",
"new_data_test['CUISINE7'] = C7\n",
"new_data_test['CUISINE8'] = C8\n",
"new_data_test['CITY'] = cities\n",
"new_data_test['LOCALITY'] = localities\n",
"new_data_test['RATING'] = rates\n",
"new_data_test['VOTES'] = votes\n",
"\n",
"new_data_test = pd.DataFrame(new_data_test)\n",
"\n",
"print(\"\\n\\nnew_data_train: \\n\", new_data_train.head())\n",
"print(\"\\n\\nnew_data_test: \\n\", new_data_test.head())\n",
"\n",
"#______________________\n"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"\n",
"\n",
"new_data_train: \n",
" TITLE1 TITLE2 RESTAURANT_ID CUISINE1 CUISINE2 \\\n",
"0 CASUAL DINING NONE 9438 MALWANI GOAN \n",
"1 CASUAL DINING BAR 13198 ASIAN MODERN INDIAN \n",
"2 CASUAL DINING NONE 10915 NORTH INDIAN CHINESE \n",
"3 QUICK BITES NONE 6346 TIBETAN CHINESE \n",
"4 DESSERT PARLOR NONE 15387 DESSERTS NONE \n",
"\n",
" CUISINE3 CUISINE4 CUISINE5 CUISINE6 CUISINE7 CUISINE8 CITY \\\n",
"0 NORTH INDIAN NONE NONE NONE NONE NONE THANE \n",
"1 JAPANESE NONE NONE NONE NONE NONE CHENNAI \n",
"2 BIRYANI HYDERABADI NONE NONE NONE NONE CHENNAI \n",
"3 NONE NONE NONE NONE NONE NONE MUMBAI \n",
"4 NONE NONE NONE NONE NONE NONE MUMBAI \n",
"\n",
" LOCALITY RATING VOTES COST \n",
"0 DOMBIVALI EAST 3.6 49.0 1200 \n",
"1 RAMAPURAM 4.2 30.0 1500 \n",
"2 SALIGRAMAM 3.8 221.0 800 \n",
"3 BANDRA WEST 4.1 24.0 800 \n",
"4 LOWER PAREL 3.8 165.0 300 \n",
"\n",
"\n",
"new_data_test: \n",
" TITLE1 TITLE2 RESTAURANT_ID CUISINE1 CUISINE2 CUISINE3 \\\n",
"0 CASUAL DINING NONE 4085 NORTH INDIAN CHINESE MUGHLAI \n",
"1 QUICK BITES NONE 12680 SOUTH INDIAN FAST FOOD PIZZA \n",
"2 CASUAL DINING NONE 1411 NORTH INDIAN SEAFOOD BIRYANI \n",
"3 NONE NONE 204 BIRYANI NONE NONE \n",
"4 QUICK BITES NONE 13453 SOUTH INDIAN KERALA NONE \n",
"\n",
" CUISINE4 CUISINE5 CUISINE6 CUISINE7 CUISINE8 CITY LOCALITY \\\n",
"0 KEBAB NONE NONE NONE NONE NOIDA SECTOR 18 \n",
"1 NORTH INDIAN NONE NONE NONE NONE MUMBAI GRANT ROAD \n",
"2 CHINESE NONE NONE NONE NONE MUMBAI MARINE LINES \n",
"3 NONE NONE NONE NONE NONE FARIDABAD NIT \n",
"4 NONE NONE NONE NONE NONE KOCHI KALOOR \n",
"\n",
" RATING VOTES \n",
"0 4.3 564.0 \n",
"1 4.2 61.0 \n",
"2 3.8 350.0 \n",
"3 3.8 1445.0 \n",
"4 3.6 23.0 \n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "nSx2rYBzco1o",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"**Note :**\n",
"\n",
"* I have chosen not to use the TIME feature in my approach.\n",
"\n",
"---\n",
"\n"
]
},
{
"metadata": {
"id": "Vxi-g4WihDGJ",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"##Data Preprocessing"
]
},
{
"metadata": {
"id": "SN-AmpMpZdaK",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"The below code deals with the following tasks:\n",
"\n",
"* Dealing with missing values\n",
"* Encoding categorical features\n",
"* Feature Scaling\n",
"\n"
]
},
{
"metadata": {
"id": "GrMK_fZSvAg4",
"colab_type": "code",
"outputId": "3ccb3ae1-b641-459d-eb46-0987388a1863",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1241
}
},
"cell_type": "code",
"source": [
"###############################################################################################################################################\n",
"\n",
"# Data Preprocessing\n",
"\n",
"###############################################################################################################################################\n",
"\n",
"# Missing Values\n",
"#_______________\n",
"\n",
"# Training Set\n",
"\n",
"print(\"\\n\\nMissing Values in Training Set\\n\",\"#\"*60)\n",
"print(\"\\nContains NaN/Empty cells : \", new_data_train.isnull().values.any())\n",
"print(\"\\nTotal empty cells by column\\n\",\"_\"*60,\"\\n\", new_data_train.isnull().sum())\n",
"\n",
"new_data_train.fillna(0, inplace = True)\n",
"\n",
"print(\"\\n\\nAfter Filling 0:\\n\",\"_\"*60,\"\\n\")\n",
"print(\"\\nContains NaN/Empty cells : \", new_data_train.isnull().values.any())\n",
"\n",
"# Test Set\n",
"\n",
"print(\"\\n\\nMissing Values in Test Set \\n\",\"#\"*60)\n",
"print(\"\\nContains NaN/Empty cells : \", new_data_test.isnull().values.any())\n",
"print(\"\\nTotal empty cells by column\\n\",\"_\"*60,\"\\n\", new_data_test.isnull().sum())\n",
"\n",
"\n",
"new_data_test.fillna(0, inplace = True)\n",
"\n",
"print(\"\\n\\nAfter Filling 0 :\\n\",\"_\"*60,\"\\n\")\n",
"print(\"\\nContains NaN/Empty cells : \", new_data_test.isnull().values.any())\n",
"print(\"\\n\\n\")\n",
"\n",
"\n",
"# Encoding Categorical Variables\n",
"#_______________________________\n",
"\n",
"\n",
"from sklearn.preprocessing import LabelEncoder\n",
"\n",
"le_titles = LabelEncoder()\n",
"le_cuisines = LabelEncoder()\n",
"\n",
"le_city = LabelEncoder()\n",
"\n",
"le_locality = LabelEncoder()\n",
"\n",
"\n",
"le_titles.fit(all_titles)\n",
"le_cuisines.fit(all_cuisines)\n",
"\n",
"le_city.fit(all_cities)\n",
"le_locality.fit(all_localities)\n",
"\n",
"\n",
"\n",
"# Training Set \n",
"\n",
"new_data_train['TITLE1'] = le_titles.transform(new_data_train['TITLE1'])\n",
"new_data_train['TITLE2'] = le_titles.transform(new_data_train['TITLE2'])\n",
"\n",
"\n",
"new_data_train['CUISINE1'] = le_cuisines.transform(new_data_train['CUISINE1'])\n",
"new_data_train['CUISINE2'] = le_cuisines.transform(new_data_train['CUISINE2'])\n",
"new_data_train['CUISINE3'] = le_cuisines.transform(new_data_train['CUISINE3'])\n",
"new_data_train['CUISINE4'] = le_cuisines.transform(new_data_train['CUISINE4'])\n",
"new_data_train['CUISINE5'] = le_cuisines.transform(new_data_train['CUISINE5'])\n",
"new_data_train['CUISINE6'] = le_cuisines.transform(new_data_train['CUISINE6'])\n",
"new_data_train['CUISINE7'] = le_cuisines.transform(new_data_train['CUISINE7'])\n",
"new_data_train['CUISINE8'] = le_cuisines.transform(new_data_train['CUISINE8'])\n",
"\n",
"\n",
"new_data_train['CITY'] = le_city.transform(new_data_train['CITY'])\n",
"new_data_train['LOCALITY'] = le_locality.transform(new_data_train['LOCALITY'])\n",
"\n",
"# Test Set\n",
"\n",
"new_data_test['TITLE1'] = le_titles.transform(new_data_test['TITLE1'])\n",
"new_data_test['TITLE2'] = le_titles.transform(new_data_test['TITLE2'])\n",
"\n",
"\n",
"new_data_test['CUISINE1'] = le_cuisines.transform(new_data_test['CUISINE1'])\n",
"new_data_test['CUISINE2'] = le_cuisines.transform(new_data_test['CUISINE2'])\n",
"new_data_test['CUISINE3'] = le_cuisines.transform(new_data_test['CUISINE3'])\n",
"new_data_test['CUISINE4'] = le_cuisines.transform(new_data_test['CUISINE4'])\n",
"new_data_test['CUISINE5'] = le_cuisines.transform(new_data_test['CUISINE5'])\n",
"new_data_test['CUISINE6'] = le_cuisines.transform(new_data_test['CUISINE6'])\n",
"new_data_test['CUISINE7'] = le_cuisines.transform(new_data_test['CUISINE7'])\n",
"new_data_test['CUISINE8'] = le_cuisines.transform(new_data_test['CUISINE8'])\n",
"\n",
"\n",
"new_data_test['CITY'] = le_city.transform(new_data_test['CITY'])\n",
"new_data_test['LOCALITY'] = le_locality.transform(new_data_test['LOCALITY'])\n",
"\n",
"\n",
"# Classifying Independent and Dependent Features\n",
"#_______________________________________________\n",
"\n",
"# Dependent Variable\n",
"Y_train = new_data_train.iloc[:, -1].values \n",
"\n",
"# Independent Variables\n",
"X_train = new_data_train.iloc[:,0 : -1].values\n",
"\n",
"# Independent Variables for Test Set\n",
"X_test = new_data_test.iloc[:,:].values\n",
"\n",
"\n",
"# Feature Scaling\n",
"#________________\n",
"\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"sc = StandardScaler()\n",
"\n",
"X_train = sc.fit_transform(X_train)\n",
"\n",
"X_test = sc.transform(X_test)\n",
"\n",
"\n",
"Y_train = Y_train.reshape((len(Y_train), 1)) \n",
"\n",
"Y_train = sc.fit_transform(Y_train)\n",
"\n",
"Y_train = Y_train.ravel()\n",
"\n",
"\n",
"\n"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"\n",
"\n",
"Missing Values in Training Set\n",
" ############################################################\n",
"\n",
"Contains NaN/Empty cells : True\n",
"\n",
"Total empty cells by column\n",
" ____________________________________________________________ \n",
" TITLE1 0\n",
"TITLE2 0\n",
"RESTAURANT_ID 0\n",
"CUISINE1 0\n",
"CUISINE2 0\n",
"CUISINE3 0\n",
"CUISINE4 0\n",
"CUISINE5 0\n",
"CUISINE6 0\n",
"CUISINE7 0\n",
"CUISINE8 0\n",
"CITY 0\n",
"LOCALITY 0\n",
"RATING 1204\n",
"VOTES 1204\n",
"COST 0\n",
"dtype: int64\n",
"\n",
"\n",
"After Filling 0:\n",
" ____________________________________________________________ \n",
"\n",
"\n",
"Contains NaN/Empty cells : False\n",
"\n",
"\n",
"Missing Values in Test Set \n",
" ############################################################\n",
"\n",
"Contains NaN/Empty cells : True\n",
"\n",
"Total empty cells by column\n",
" ____________________________________________________________ \n",
" TITLE1 0\n",
"TITLE2 0\n",
"RESTAURANT_ID 0\n",
"CUISINE1 0\n",
"CUISINE2 0\n",
"CUISINE3 0\n",
"CUISINE4 0\n",
"CUISINE5 0\n",
"CUISINE6 0\n",
"CUISINE7 0\n",
"CUISINE8 0\n",
"CITY 0\n",
"LOCALITY 0\n",
"RATING 402\n",
"VOTES 402\n",
"dtype: int64\n",
"\n",
"\n",
"After Filling 0 :\n",
" ____________________________________________________________ \n",
"\n",
"\n",
"Contains NaN/Empty cells : False\n",
"\n",
"\n",
"\n"
],
"name": "stdout"
},
{
"output_type": "stream",
"text": [
"/Users/aim/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.\n",
" warnings.warn(msg, DataConversionWarning)\n",
"/Users/aim/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.\n",
" warnings.warn(msg, DataConversionWarning)\n"
],
"name": "stderr"
}
]
},
{
"metadata": {
"id": "qvyeb_YOPyF5",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"##Modelling"
]
},
{
"metadata": {
"id": "CwZ6e-JLaFl-",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"We will use the GradientBoostingRegressor to predict the values of the COST feature for the test set."
]
},
{
"metadata": {
"id": "9GNjZM5JPvs6",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"from sklearn.ensemble import GradientBoostingRegressor\n",
"\n",
"gbr=GradientBoostingRegressor( loss = 'huber',learning_rate=0.001,n_estimators=350, max_depth=6\n",
" ,subsample=1,\n",
" verbose=False,random_state=126) # Leaderboard SCORE : 0.8364249755816828 @ RS =126 ,n_estimators=350, max_depth=6\n",
"\n",
"gbr.fit(X_train,Y_train)\n",
"\n",
"y_pred_gbr = sc.inverse_transform(gbr.predict(X_test))\n",
"\n"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "GhE_VVECWQeP",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"y_pred_gbr = pd.DataFrame(y_pred_gbr, columns = ['COST']) # Converting to dataframe\n",
"print(y_pred_gbr)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "vQNdfqMKQQGu",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"y_pred_gbr.to_excel(\"GradientBoostingRegressor.xlsx\", index = False ) # Saving the output in to an excel"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "zqE3flcqf3IK",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"MachineHack leaderboard score for the above solution is 0.83642.\n",
"\n"
]
}
]
}
@Shravannaikini
Copy link

how can i download the data brother?

@ankitamhetre
Copy link

hie why havent you treated the time column

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment