Created
April 20, 2016 21:33
-
-
Save tdhopper/91f03250892c12c6e0d35ca6d2ade1ca to your computer and use it in GitHub Desktop.
unfortunately this doesn't work any more in my setup:
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: None.None
pandas: 0.20.1
pytest: 3.0.6
pip: 9.0.1
setuptools: 34.2.0
Cython: 0.25.2
numpy: 1.12.0
scipy: 0.19.0
xarray: 0.9.1
IPython: 5.2.2
sphinx: 1.5.2
patsy: 0.4.1
dateutil: 2.5.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.2
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.6.4
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: 1.1.4
pymysql: 0.7.10.None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: 0.2.0
pandas_datareader: None
For anyone who wants a shorter version of the above (without using shelve, which give me this complain below:
File "---.py", line 104, in get_possible_values
with shelve.open(shelf_name, writeback=True) as shelf:
AttributeError: DbfilenameShelf instance has no attribute '__exit__'
def concat(dataframes, categorical_columns, ignore_index=False):
"""Concatenate dataframes with unordered categorical columns.
Will mutate categorical columns of origial dataframes.
dataframes: list of dataframes.
categorical_columns: list of names of unordered, categorical columns.
ignore_index: same as from pd.concat.
shelf_name: filename for shelve object to store possible values.
"""
# Get all possible values for all categorical_columns
possible_values = {}
for col in categorical_columns:
possible_values[col] = set()
for df in dataframes:
for col in categorical_columns:
for val in df[col]:
possible_values[col].add(val)
# Use pd.Categorical() to re-categorizing the values in all columns
for df in dataframes:
for col in categorical_columns:
df[col] = pd.Categorical(
df[col], categories=possible_values[col], ordered=False)
return pd.concat(dataframes, axis=0, ignore_index=ignore_index)\
PS: you won't need to do all this if you are running pandas 0.19 or later. In my case I gotta live with 0.18 and this saved my life today! Thank you @tdhopper !
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
There is a ticket about this here: pandas-dev/pandas#12699