Skip to content

Instantly share code, notes, and snippets.

@zabop
Created August 18, 2020 19:45
Show Gist options
  • Save zabop/5c8ab78176003ad992b070c53e161040 to your computer and use it in GitHub Desktop.
Save zabop/5c8ab78176003ad992b070c53e161040 to your computer and use it in GitHub Desktop.
#necessary imports
import pandas as pd
import glob
import datetime
import time
import matplotlib.pyplot as plt
import pathlib
import pathlib
from collections import ChainMap
import os
import re
import numpy as np
def didMoreThanThisManySecondsElapsedSinceEndOfThatDay(singleDate,seconds):
'''
datetime.datetime.now() returns an object like this:
datetime.datetime(2020, 8, 17, 9, 26, 21, 78730)
'''
return (datetime.datetime.now() - (singleDate + datetime.timedelta(days=1))).total_seconds() > seconds
excelFile='/mnt/volume/jupyter/szokereso/vip_szotar_1.4.xlsx'
'''
This file looks like this: https://raw.githubusercontent.com/zabop/szokeresoDocs/master/howTheSzotarLooksLike.png
It has some tabs, each tab contains rows, some rows have more than one non-constant number of non-empty columns.
The reason it is stored as an excel file is that non-python people need to access & modify it sometimes.
Keep an eye on: if tab names include spaces, ie "Momentum ", it will not be obvious from looking at the file in Excel.
Don't put spaces there, but if coworkers accidentaly do, that might cause problems which need to be looked into.
The current version of the file is ok, without extra spaces.
'''
def getSearchListFromTabName(tabName,pathToExcelFile=excelFile):
'''
Example usage: getSearchListFromTabName('LMP') returns:
[['LMP', 'Lehet Más A Politika'],
['Csárdi Antal', 'Csárdi'],
['Demeter Márta', 'Demeter'],
['Ungár Péter', 'Ungár']]
The excel file on the LMP tab looks like this, compare:
https://raw.githubusercontent.com/zabop/szokeresoDocs/master/howTheSzotarLooksLike.png
pd.ExcelFile() reads in the whole excel file, with all tabs.
dfs is a dict created with a dict comprehension.
Keys are tab names of the excel file.
Values are the list of lists, one shown above.
xl.parse(sheetname.header=None) returns a dataframe.
If not all rows have equal number of columns, nans will appear.
For example: xl.parse('MSZP',header=None).values.tolist() returns:
[['MSZP', 'Magyar Szocialista Párt', nan, 'Szocik'],
['Ujhelyi István', 'Ujhelyi', nan, nan],
['Kunhalmi Ágnes', 'Kunhalmi', nan, nan],
['Hiller István', 'Hiller', nan, nan],
['Molnár Gyula', nan, nan, nan],
['Bangóné Borbély Ildikó', 'Bangóné', nan, nan],
['Mesterházy Attila', 'Mesterházy', nan, nan],
['Tóth Bertalan', nan, nan, nan],
['Tóth Csaba', nan, nan, nan],
['Tüttő Kata', 'Tüttő', nan, nan]]
The MSZP tab looks like this in the original excel file:
https://raw.githubusercontent.com/zabop/szokeresoDocs/master/tabMSZP.PNG
We can see that C1 is unreasonably left blank, but so the first line includes a nan in the third place.
There are other nans too, which we wouldn't want to return.
for person in xl.parse(sheetname,header=None).values.tolist() is looping through each list.
Ie first item:
['MSZP', 'Magyar Szocialista Párt', nan, 'Szocik']
second item:
['Ujhelyi István', 'Ujhelyi', nan, nan].
[alias for alias in person if type(alias) is not float] is looping through these lists.
Ie first item:
MSZP
second item:
'Magyar Szocialista Párt'
third item:
nan.
The result of this inner list comprehension is a list not containing nans, by the filtering them out: if not float.
dfs will contain this for every tab of the excel file. We only return one value of the dfs dict.
getSearchListFromTabName('MSZP') will return:
[['MSZP', 'Magyar Szocialista Párt', 'Szocik'],
['Ujhelyi István', 'Ujhelyi'],
['Kunhalmi Ágnes', 'Kunhalmi'],
['Hiller István', 'Hiller'],
['Molnár Gyula'],
['Bangóné Borbély Ildikó', 'Bangóné'],
['Mesterházy Attila', 'Mesterházy'],
['Tóth Bertalan'],
['Tóth Csaba'],
['Tüttő Kata', 'Tüttő']]
We are doing more calculation than strictly needed.
We are returning only one value from a dict, but creating the whole dict.
This isn't a performance-limiting issue though, as this operation is performed once per tab.
The potential runtime gain by doing this better is in the order of, so we don't care.
(If bothered, try tabName == sheetname at the end of dict comprehension, but haven't tested this.)
The commented out line after DONT DO THIS includes a .dropna(), which is not behaving as would be desired here.
It drops every row with a nan, the entire row, not just the nans from it.
'''
pathToExcelFile=excelFile
xl = pd.ExcelFile(pathToExcelFile)
#DONT DO THIS:
#dfs = {sheetname: xl.parse(sheetname, header=None).dropna().values.tolist() for sheetname in xl.sheet_names}
dfs = {sheetname: [[alias for alias in person if type(alias) is not float]
for person in xl.parse(sheetname,header=None).values.tolist()]
for sheetname in xl.sheet_names}
return dfs[tabName]
def validalias(alias):
'''
The lines commented out were used at an earlier stage of the project.
Some dictionaries contained aliases which we didn't want to use.
(By dictionaries I mean list of entities, each entity having different names, ie 'MSZP', 'Magyar Szocialista Párt', 'Szocik'.
The different aliases in this case is 'MSZP', 'Magyar Szocialista Párt', 'Szocik', entity is: 'MSZP')
We filetered them with conditions:
- it had to be longer than 5 characters without trailing and leading spaces
- it had to be at least a bigram, ie at least one space between first and last non-space character
Ilénke wouldn't have been a valid alias, Nagy Ilén would've been.
Left here if needed in future, now everything is taken as a valid alias (that's why return True).
'''
return True
# if len(str(alias).strip()) > 5 and len(str(alias).strip().split(' ')) > 1: return True
# else: return False
def matchfinder(text,searchforthese):
'''
This function uses list comprehensions on nested lists.
A good visualization on how that works: https://i.stack.imgur.com/0GoV5.gif
From here: https://stackoverflow.com/a/45079294/8565438. I recommend to understand this answer.
We iterate through searchforthese, which is a list of alias lists, ie:
[['LMP', 'Lehet Más A Politika'],
['Csárdi Antal', 'Csárdi'],
['Demeter Márta', 'Demeter'],
['Ungár Péter', 'Ungár']]
persondata_searchtarget is the individual alias lists, (ie ['LMP', 'Lehet Más A Politika'], then ['Csárdi Antal', 'Csárdi']).
We iterate through these (ie take LMP, then Lehet Más A Politika) & check for conditions.
If these conditions are fulfilled, the alias which fulfilled those conditions will be part of mathces.
These contitions:
The alias should be a valid alias (ie validalias(alias) should be True)
alias.lower() in str(text).lower() should be True.
'ElEfÁnT'.lower() is elefánt: lowercasing both alias & text is used to avoid inconsistencies in capitalization.
str(np.nan) is 'nan'. It is used in case input data contains nans, which potentially screws up the function.
Example:
matchfinder('LMP, legyen meleg Péter, Csárdi Antal nem csárdi', [['LMP', 'Lehet Más A Politika'],
['Csárdi Antal', 'Csárdi'],
['Demeter Márta', 'Demeter'],
['Ungár Péter', 'Ungár']])
Returns: ['LMP', 'Csárdi Antal', 'Csárdi'].
We see that occasionally we detect a false positive: here, be returning csárdi.
If we don't lowercase everything, we'll have false negatives.
Make your choices.
'''
matches=[alias
for persondata_searchtarget in searchforthese
for alias in persondata_searchtarget
if validalias(alias) and alias.lower() in str(text).lower()]
return matches
class dictionary_class:
'''
Each dictionary has a name, maxcolnum, searchlist, geo attributes.
name: the name of the dictionary
maxcolnum: the maximum number of different entries we want to find in a text
If we find more, we will say we haven't found any of them.
(This is to tackle texts which were signed by a lot of people but is not about those people.)
searchlist is the list of alias lists, ie:
[['LMP', 'Lehet Más A Politika'],
['Csárdi Antal', 'Csárdi'],
['Demeter Márta', 'Demeter'],
['Ungár Péter', 'Ungár']]
geo is a boolean, refering to a dictionary containing geographical names.
They are treated differently to people's names sometimes.
'''
def __init__(self, name, maxcolnum, searchlist=None, geo=False):
self.name = name
self.maxcolnum = maxcolnum
self.searchlist = searchlist
self.geo = geo
def get_files_sorted_by_date_after_a_date(look_for_this_pattern, cutoffdate):
'''
look_for_this_pattern should be a wildcard, ie: '/mnt/volume/anagy/mediascraper/mediaScraper/output/data*csv'
cutoffdate should be a list in the format of [year, month, day, 0], ie: [2020,7,1,0]
csvs will be a list of files like this, with the example look_for_this_pattern wildcard above:
['/mnt/volume/anagy/mediascraper/mediaScraper/output/data_2020-08-17_16:01:34.csv',
'/mnt/volume/anagy/mediascraper/mediaScraper/output/data_2020-07-27_06:00:42.csv',
'/mnt/volume/anagy/mediascraper/mediaScraper/output/data_2020-07-26_22:00:58.csv',
...]
These are the scraped input files.
They are generated every two hours.
data_2020-07-26_22:00:58.csv was generated (roughly) at 2020 July 22th 22h, & it took 00:58 to produce the file.
re.findall(r'\d+', each) finds all the digits in the variable each.
Source: https://stackoverflow.com/a/4289348/8565438
Examples:
re.findall(r'\d+', 'A 3 kismalac és az 1 farkas meséjének 10edik változata')
returns:
['3', '1', '10']
re.findall(r'\d+', '/mnt/volume/anagy/mediascraper/mediaScraper/output/data_2020-08-17_16:01:34.csv')
returns:
['2020', '08', '17', '16', '01', '34']
| !!!WARNING!!!:
| DO NOT USE NUMERAL CHARACTERS IN PATH LEADING TO INPUT FILES.
| For example:
| re.findall(r'\d+', '/mnt/volume/anagy/mediascraper/mediaScraper/output2/data_2020-08-17_16:01:34.csv')
| returns:
| ['2', '2020', '08', '17', '16', '01', '34']
| The '2' in the front is clearly not what we want.
The inner list comprehension converts these to ints.
The first 4 element of this list is what we care about, so we include the [:4].
datetimes will be a list of datetime.datetime objects, for example, one element of the list can be:
datetime.datetime(2020, 8, 17, 16, 0)
The * used in creation of datetimes list is to unpack argument lists, as described in this answer, for example:
https://stackoverflow.com/a/36908/8565438
dt_csvs_filtered will be a list of lists.
These lists will contain a datetime.datetime object & the csv file it was created from.
dt_csvs_filtered will only contain those lists which had a datetime.datetime object in them satisfying a condition.
Condition: datetime.datetime object within list must reference a later point in time than datetime.datetime(*cutoffdate).
sorted_filtered_csvs is the csv filenames from the dt_csvs_filtered list.
They are sorted by chronologically using the datetime.datetime objects within dt_csvs_filtered.
More detailed description here: https://stackoverflow.com/a/6618543/8565438
'''
csvs = glob.glob(look_for_this_pattern)
datetimes=[datetime.datetime(*[int(num) for num in re.findall(r'\d+', each)[:4]]) for each in csvs]
dt_csvs_filtered=[[dt, csv] for dt, csv in zip(datetimes,csvs) if dt >= datetime.datetime(*cutoffdate)]
sorted_filtered_csvs = [csv
for _, csv in sorted(
zip([eachpair[0] for eachpair in dt_csvs_filtered],
[eachpair[1] for eachpair in dt_csvs_filtered]))]
return sorted_filtered_csvs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment