Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save lirnli/b6bbca4d0e49f5695e0f5c930beceb12 to your computer and use it in GitHub Desktop.
Save lirnli/b6bbca4d0e49f5695e0f5c930beceb12 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Assignment 1\n",
"\n",
"In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data. \n",
"\n",
"Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.\n",
"\n",
"The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates. \n",
"\n",
"Here is a list of some of the variants you might encounter in this dataset:\n",
"* 04/20/2009; 04/20/09; 4/20/09; 4/3/09\n",
"* Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;\n",
"* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009\n",
"* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009\n",
"* Feb 2009; Sep 2009; Oct 2010\n",
"* 6/2008; 12/2009\n",
"* 2009; 2010\n",
"\n",
"Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:\n",
"* Assume all dates in xx/xx/xx format are mm/dd/yy\n",
"* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)\n",
"* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).\n",
"* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).\n",
"* Watch out for potential typos as this is a raw, real-life derived dataset.\n",
"\n",
"With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.\n",
"\n",
"For example if the original series was this:\n",
"\n",
" 0 1999\n",
" 1 2010\n",
" 2 1978\n",
" 3 2015\n",
" 4 1985\n",
"\n",
"Your function should return this:\n",
"\n",
" 0 2\n",
" 1 4\n",
" 2 0\n",
" 3 1\n",
" 4 3\n",
"\n",
"Your score will be calculated using [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient), a correlation measure for ordinal data.\n",
"\n",
"*This function should return a Series of length 500 and dtype int.*"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 03/25/93 Total time of visit (in minutes):\\n\n",
"1 6/18/85 Primary Care Doctor:\\n\n",
"2 sshe plans to move as of 7/8/71 In-Home Services: None\\n\n",
"3 7 on 9/27/75 Audit C Score Current:\\n\n",
"4 2/6/96 sleep studyPain Treatment Pain Level (Numeric Scale): 7\\n\n",
"dtype: object"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"from functools import reduce\n",
"pd.options.display.max_colwidth=1000\n",
"pd.options.display.max_rows = 500\n",
"\n",
"doc = []\n",
"with open('dates.txt') as file:\n",
" for line in file:\n",
" doc.append(line)\n",
"\n",
"df = pd.Series(doc)\n",
"df.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 247,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"04/20/2009; 04/20/09; 4/20/09; 4/3/09\n",
"Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;\n",
"20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009\n",
"Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009\n",
"Feb 2009; Sep 2009; Oct 2010\n",
"6/2008; 12/2009\n",
"2009; 2010\n",
"\n",
"rulex\n",
"[('', '9', '1984')]\n",
"rule2\n",
"[('mar', '20', '2009'), ('mar', '20', '2009'), ('march', '20', '2009'), ('mar', '20', '2009'), ('mar', '20', '2009')]\n",
"rule3\n",
"[('20', 'mar', '2009'), ('20', 'march', '2009'), ('20', 'mar', '2009'), ('20', 'march', '2009')]\n",
"rule4\n",
"[('mar', '20', '2009'), ('mar', '21', '2009'), ('mar', '22', '2009')]\n",
"rule5\n",
"[('', 'feb', '2009'), ('', 'sep', '2009'), ('', 'oct', '2010')]\n",
"rule6\n",
"[('', '6', '2008'), ('', '12', '2009')]\n",
"rule7\n",
"[('', '', '2009'), ('', '', '2010')]\n"
]
}
],
"source": [
"#play ground\n",
"\n",
"rule1 = r'(?P<month>\\d{1,2})[/-](?P<day>\\d{1,2})[/-](?P<year>\\d{2}|\\d{4})'\n",
"rule2 = r'(?P<month>(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\\w*)[.]?[ -](?P<day>\\d{1,2})(?:\\w{2})?[,]?[ -](?P<year>\\d{4})'\n",
"rule3 = r'(?P<day>\\d{1,2}) (?P<month>(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\\w*)[.,]? (?P<year>\\d{4})'\n",
"rule4 = rule2\n",
"rule5 = r'(?P<day>)(?P<month>(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\\w*) (?P<year>\\d{4})'\n",
"rule6 = r'(?P<day>)(?P<month>\\d{1,2})/(?P<year>\\d{4})'\n",
"rule7 = r'(?P<day>)(?P<month>)(?:\\D|^)(?P<year>\\d{4})(?:\\D|$)'\n",
"\n",
"text = '''\n",
"04/20/2009; 04/20/09; 4/20/09; 4/3/09\n",
"Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;\n",
"20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009\n",
"Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009\n",
"Feb 2009; Sep 2009; Oct 2010\n",
"6/2008; 12/2009\n",
"2009; 2010\n",
"'''\n",
"import re\n",
"print (text)\n",
"\n",
"print ('rulex')\n",
"print (re.findall(rule6, \"\"\"sChesterfield 9/1984 for 3 weeks for dual diagnosis alcohol and PTSDHx of Outpatient Treatment: Yes\\\n",
"\"\"\".lower()))\n",
"\n",
"print ('rule2')\n",
"print (re.findall(rule2, 'Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009'.lower()))\n",
"\n",
"print ('rule3')\n",
"print (re.findall(rule3, '20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009'.lower()))\n",
"\n",
"print ('rule4')\n",
"print (re.findall(rule4, 'Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009'.lower()))\n",
"\n",
"print ('rule5')\n",
"print (re.findall(rule5, 'Feb 2009; Sep 2009; Oct 2010'.lower()))\n",
"\n",
"print ('rule6')\n",
"print (re.findall(rule6, '6/2008; 12/2009'.lower()))\n",
"\n",
"print ('rule7')\n",
"print (re.findall(rule7, '2009; 2010'.lower()))\n",
"# pd.DataFrame(\n",
"# df.str.extract(rule2),df\n",
"# )"
]
},
{
"cell_type": "code",
"execution_count": 243,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def date_sorter(df):\n",
" \n",
" # Your code here\n",
" mon2int = dict([(m, i+1) for i, m in enumerate('jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec'.split(\",\"))])\n",
" rule1 = r'(?P<month>\\d{1,2})[/-](?P<day>\\d{1,2})[/-](?P<year>\\d{2}|\\d{4})'\n",
" rule2 = r'(?P<month>(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\\w*)[.]?[ -](?P<day>\\d{1,2})(?:\\w{2})?[,]?[ -](?P<year>\\d{4})'\n",
" rule3 = r'(?P<day>\\d{1,2}) (?P<month>(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\\w*)[.,]? (?P<year>\\d{4})'\n",
" rule4 = rule2\n",
" rule5 = r'(?P<day>)(?P<month>(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\\w*) (?P<year>\\d{4})'\n",
" rule6 = r'(?P<day>)(?P<month>\\d{1,2})/(?P<year>\\d{4})'\n",
" rule7 = r'(?P<day>)(?P<month>)(?:\\D|^)(?P<year>\\d{4})(?:\\D|$)'\n",
" rule_list = [rule1, rule2, rule3, rule4, rule5, rule6, rule7]\n",
" extract_df_list = [df.str.lower().str.extract(rule) for rule in rule_list]\n",
" date_df = reduce(lambda x,y: x.fillna(y),extract_df_list)\n",
" date_df['text'] = df\n",
" date_df['month'] = date_df['month'].apply(lambda x: mon2int.get(str(x)[:3],x))\n",
" date_df['month'] = date_df['month'].fillna('1').apply(lambda x: '1' if x=='' else x)\n",
" date_df['year'] = date_df['year'].apply(lambda x: '19'+x if len(x)==2 else x)\n",
" date_df['day'] = date_df['day'].fillna('1').apply(lambda x: '1' if x=='' else x)\n",
" date_df = date_df[[\"year\",\"month\",\"day\"]].astype(int)\n",
" index_df = date_df.sort_values(['year','month','day'], axis=0, ascending=False).reset_index()\n",
" index_df = index_df['index']\n",
" return index_df"
]
},
{
"cell_type": "code",
"execution_count": 244,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/conda/lib/python3.6/site-packages/ipykernel/__main__.py:13: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)\n"
]
},
{
"data": {
"text/plain": [
"0 413\n",
"1 161\n",
"2 186\n",
"3 141\n",
"4 231\n",
"5 427\n",
"6 253\n",
"7 464\n",
"8 235\n",
"9 152\n",
"10 257\n",
"11 475\n",
"12 401\n",
"13 255\n",
"14 439\n",
"15 366\n",
"16 381\n",
"17 463\n",
"18 198\n",
"19 279\n",
"20 431\n",
"21 244\n",
"22 286\n",
"23 480\n",
"24 383\n",
"25 320\n",
"26 139\n",
"27 208\n",
"28 243\n",
"29 220\n",
"30 282\n",
"31 393\n",
"32 460\n",
"33 290\n",
"34 227\n",
"35 217\n",
"36 229\n",
"37 316\n",
"38 170\n",
"39 240\n",
"40 467\n",
"41 242\n",
"42 364\n",
"43 408\n",
"44 411\n",
"45 176\n",
"46 448\n",
"47 287\n",
"48 273\n",
"49 472\n",
"50 268\n",
"51 288\n",
"52 379\n",
"53 386\n",
"54 389\n",
"55 297\n",
"56 371\n",
"57 491\n",
"58 360\n",
"59 400\n",
"60 293\n",
"61 271\n",
"62 396\n",
"63 183\n",
"64 187\n",
"65 445\n",
"66 497\n",
"67 337\n",
"68 428\n",
"69 132\n",
"70 341\n",
"71 160\n",
"72 289\n",
"73 264\n",
"74 325\n",
"75 277\n",
"76 391\n",
"77 252\n",
"78 314\n",
"79 410\n",
"80 490\n",
"81 173\n",
"82 307\n",
"83 433\n",
"84 270\n",
"85 496\n",
"86 348\n",
"87 346\n",
"88 344\n",
"89 250\n",
"90 339\n",
"91 498\n",
"92 451\n",
"93 146\n",
"94 302\n",
"95 126\n",
"96 207\n",
"97 281\n",
"98 425\n",
"99 241\n",
"100 350\n",
"101 414\n",
"102 188\n",
"103 291\n",
"104 306\n",
"105 484\n",
"106 144\n",
"107 356\n",
"108 362\n",
"109 234\n",
"110 452\n",
"111 469\n",
"112 184\n",
"113 449\n",
"114 128\n",
"115 262\n",
"116 446\n",
"117 304\n",
"118 494\n",
"119 394\n",
"120 367\n",
"121 377\n",
"122 147\n",
"123 195\n",
"124 159\n",
"125 125\n",
"126 328\n",
"127 457\n",
"128 374\n",
"129 443\n",
"130 158\n",
"131 190\n",
"132 417\n",
"133 392\n",
"134 420\n",
"135 260\n",
"136 305\n",
"137 329\n",
"138 456\n",
"139 201\n",
"140 145\n",
"141 266\n",
"142 407\n",
"143 212\n",
"144 384\n",
"145 327\n",
"146 432\n",
"147 376\n",
"148 321\n",
"149 471\n",
"150 120\n",
"151 251\n",
"152 387\n",
"153 79\n",
"154 12\n",
"155 224\n",
"156 20\n",
"157 343\n",
"158 222\n",
"159 18\n",
"160 338\n",
"161 459\n",
"162 114\n",
"163 333\n",
"164 101\n",
"165 353\n",
"166 87\n",
"167 213\n",
"168 365\n",
"169 468\n",
"170 166\n",
"171 131\n",
"172 121\n",
"173 324\n",
"174 124\n",
"175 4\n",
"176 142\n",
"177 479\n",
"178 292\n",
"179 169\n",
"180 90\n",
"181 326\n",
"182 203\n",
"183 181\n",
"184 60\n",
"185 248\n",
"186 110\n",
"187 236\n",
"188 372\n",
"189 56\n",
"190 388\n",
"191 140\n",
"192 193\n",
"193 238\n",
"194 311\n",
"195 483\n",
"196 406\n",
"197 409\n",
"198 42\n",
"199 359\n",
"200 133\n",
"201 47\n",
"202 196\n",
"203 308\n",
"204 163\n",
"205 300\n",
"206 450\n",
"207 477\n",
"208 331\n",
"209 357\n",
"210 172\n",
"211 86\n",
"212 192\n",
"213 249\n",
"214 0\n",
"215 354\n",
"216 272\n",
"217 298\n",
"218 478\n",
"219 113\n",
"220 301\n",
"221 143\n",
"222 310\n",
"223 442\n",
"224 100\n",
"225 119\n",
"226 269\n",
"227 52\n",
"228 211\n",
"229 62\n",
"230 138\n",
"231 399\n",
"232 487\n",
"233 66\n",
"234 322\n",
"235 46\n",
"236 67\n",
"237 453\n",
"238 33\n",
"239 51\n",
"240 91\n",
"241 97\n",
"242 226\n",
"243 461\n",
"244 122\n",
"245 245\n",
"246 206\n",
"247 151\n",
"248 194\n",
"249 22\n",
"250 412\n",
"251 157\n",
"252 177\n",
"253 233\n",
"254 482\n",
"255 435\n",
"256 7\n",
"257 26\n",
"258 149\n",
"259 89\n",
"260 210\n",
"261 373\n",
"262 385\n",
"263 440\n",
"264 202\n",
"265 218\n",
"266 71\n",
"267 115\n",
"268 265\n",
"269 312\n",
"270 476\n",
"271 180\n",
"272 174\n",
"273 216\n",
"274 256\n",
"275 390\n",
"276 37\n",
"277 330\n",
"278 462\n",
"279 76\n",
"280 116\n",
"281 130\n",
"282 416\n",
"283 63\n",
"284 41\n",
"285 349\n",
"286 32\n",
"287 68\n",
"288 284\n",
"289 485\n",
"290 54\n",
"291 423\n",
"292 437\n",
"293 29\n",
"294 168\n",
"295 261\n",
"296 404\n",
"297 368\n",
"298 96\n",
"299 334\n",
"300 230\n",
"301 276\n",
"302 8\n",
"303 352\n",
"304 199\n",
"305 492\n",
"306 135\n",
"307 228\n",
"308 185\n",
"309 280\n",
"310 447\n",
"311 1\n",
"312 178\n",
"313 10\n",
"314 274\n",
"315 136\n",
"316 489\n",
"317 426\n",
"318 48\n",
"319 275\n",
"320 421\n",
"321 455\n",
"322 99\n",
"323 175\n",
"324 285\n",
"325 107\n",
"326 61\n",
"327 209\n",
"328 247\n",
"329 438\n",
"330 35\n",
"331 295\n",
"332 137\n",
"333 294\n",
"334 397\n",
"335 205\n",
"336 358\n",
"337 470\n",
"338 179\n",
"339 88\n",
"340 429\n",
"341 112\n",
"342 103\n",
"343 44\n",
"344 70\n",
"345 454\n",
"346 127\n",
"347 25\n",
"348 16\n",
"349 458\n",
"350 403\n",
"351 74\n",
"352 263\n",
"353 215\n",
"354 246\n",
"355 444\n",
"356 80\n",
"357 430\n",
"358 355\n",
"359 197\n",
"360 39\n",
"361 134\n",
"362 466\n",
"363 221\n",
"364 267\n",
"365 361\n",
"366 441\n",
"367 340\n",
"368 78\n",
"369 150\n",
"370 347\n",
"371 499\n",
"372 259\n",
"373 21\n",
"374 167\n",
"375 75\n",
"376 254\n",
"377 296\n",
"378 5\n",
"379 398\n",
"380 424\n",
"381 495\n",
"382 313\n",
"383 378\n",
"384 164\n",
"385 434\n",
"386 65\n",
"387 81\n",
"388 200\n",
"389 6\n",
"390 336\n",
"391 105\n",
"392 148\n",
"393 239\n",
"394 318\n",
"395 369\n",
"396 493\n",
"397 189\n",
"398 72\n",
"399 232\n",
"400 117\n",
"401 19\n",
"402 123\n",
"403 419\n",
"404 309\n",
"405 283\n",
"406 395\n",
"407 303\n",
"408 488\n",
"409 93\n",
"410 27\n",
"411 315\n",
"412 258\n",
"413 204\n",
"414 342\n",
"415 23\n",
"416 237\n",
"417 465\n",
"418 219\n",
"419 363\n",
"420 50\n",
"421 3\n",
"422 370\n",
"423 382\n",
"424 165\n",
"425 418\n",
"426 40\n",
"427 319\n",
"428 11\n",
"429 49\n",
"430 317\n",
"431 473\n",
"432 223\n",
"433 155\n",
"434 214\n",
"435 278\n",
"436 351\n",
"437 182\n",
"438 332\n",
"439 156\n",
"440 108\n",
"441 73\n",
"442 402\n",
"443 154\n",
"444 162\n",
"445 299\n",
"446 104\n",
"447 436\n",
"448 481\n",
"449 57\n",
"450 345\n",
"451 380\n",
"452 375\n",
"453 422\n",
"454 323\n",
"455 405\n",
"456 36\n",
"457 335\n",
"458 415\n",
"459 486\n",
"460 191\n",
"461 171\n",
"462 31\n",
"463 225\n",
"464 111\n",
"465 98\n",
"466 129\n",
"467 13\n",
"468 153\n",
"469 474\n",
"470 28\n",
"471 53\n",
"472 2\n",
"473 84\n",
"474 9\n",
"475 58\n",
"476 83\n",
"477 109\n",
"478 102\n",
"479 34\n",
"480 118\n",
"481 43\n",
"482 92\n",
"483 106\n",
"484 15\n",
"485 94\n",
"486 69\n",
"487 17\n",
"488 55\n",
"489 59\n",
"490 85\n",
"491 64\n",
"492 38\n",
"493 24\n",
"494 82\n",
"495 14\n",
"496 95\n",
"497 30\n",
"498 45\n",
"499 77\n",
"Name: index, dtype: int64"
]
},
"execution_count": 244,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"date_sorter(df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"coursera": {
"course_slug": "python-text-mining",
"graded_item_id": "LvcWI",
"launcher_item_id": "krne9",
"part_id": "Mkp1I"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment