Skip to content

Instantly share code, notes, and snippets.

@jlln
Last active November 9, 2023 19:59
Show Gist options
  • Star 96 You must be signed in to star a gist
  • Fork 27 You must be signed in to fork a gist
  • Save jlln/338b4b0b55bd6984f883 to your computer and use it in GitHub Desktop.
Save jlln/338b4b0b55bd6984f883 to your computer and use it in GitHub Desktop.
Efficiently split Pandas Dataframe cells containing lists into multiple rows, duplicating the other column's values.
def splitDataFrameList(df,target_column,separator):
''' df = dataframe to split,
target_column = the column containing the values to split
separator = the symbol used to perform the split
returns: a dataframe with each entry for the target column separated, with each element moved into a new row.
The values in the other columns are duplicated across the newly divided rows.
'''
def splitListToRows(row,row_accumulator,target_column,separator):
split_row = row[target_column].split(separator)
for s in split_row:
new_row = row.to_dict()
new_row[target_column] = s
row_accumulator.append(new_row)
new_rows = []
df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
new_df = pandas.DataFrame(new_rows)
return new_df
@JoaoCarabetta
Copy link

JoaoCarabetta commented Apr 11, 2018

I modified it a little bit.

It knows accepts columns with multiple types and also enforces an output type.

def split_data_frame_list(df, 
                       target_column,
                      output_type=float):
    ''' 
    Accepts a column with multiple types and splits list variables to several rows.

    df: dataframe to split
    target_column: the column containing the values to split
    output_type: type of all outputs
    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    row_accumulator = []

    def split_list_to_rows(row):
        split_row = row[target_column]
        if isinstance(split_row, list):
          for s in split_row:
              new_row = row.to_dict()
              new_row[target_column] = output_type(s)
              row_accumulator.append(new_row)
        else:
          new_row = row.to_dict()
          new_row[target_column] = output_type(split_row)
          row_accumulator.append(new_row)
  
    df.apply(split_list_to_rows, axis=1)
    new_df = pd.DataFrame(row_accumulator)
  
    return new_df

@click-beetle
Copy link

This was handy! Thanks!

@Marshallmaxiao
Copy link

It works without using separator if the value in the column is a list (like [1,2,3])
Thanks!
Is there a way to make it faster? I have a larget dataset.

@nate-robo
Copy link

nate-robo commented Jul 17, 2018

Is there anyway to accomplish this taking into consideration multiple columns at once. For example instead of one column which is a comma delimited list I have multiple columns which correspond to each other.

 col a     |      col b     |      col c          |
  a, b, c          1, 2, 3,      2.0, 3.0, 4.0

Turns into
col a | col b | col c |
a, 1, 2.0
b, 2, 3.0
c, 3, 4.0

Copy link

ghost commented Aug 12, 2018

This variation might be a bit faster.

def split_data_frame_list(df, target_column):
    """
    Splits a column with lists into rows
    
    Keyword arguments:
        df -- dataframe
        target_column -- name of column that contains lists        
    """
    
    # create a new dataframe with each item in a seperate column, dropping rows with missing values
    col_df = pd.DataFrame(df[target_column].dropna().tolist(),index=df[target_column].dropna().index)

    # create a series with columns stacked as rows         
    stacked = col_df.stack()

    # rename last column to 'idx'
    index = stacked.index.rename(names="idx", level=-1)
    new_df = pd.DataFrame(stacked, index=index, columns=[target_column])
    return new_df

Example:

>>> df = pd.DataFrame({'name':['a','b','c'], "items":[['a1','a2','a3'],['b1','b2','b3'],['c1','c2','c3']]})`
>>> df
	name	items
0	a	[a1, a2, a3]
1	b	[b1, b2, b3]
2	c	[c1, c2, c3]

>>> split_data_frame_list(df, target_column="items")

		items
	idx	
0	0	a1
	1	a2
	2	a3
1	0	b1
	1	b2
	2	b3
2	0	c1
	1	c2
	2	c3

@zouweilin
Copy link

zouweilin commented Sep 5, 2018

Hey I made it so it can accept multiple columns and try to split on all of them at the same time

def split_dataframe_rows(df,column_selectors, row_delimiter):
    # we need to keep track of the ordering of the columns
    def _split_list_to_rows(row,row_accumulator,column_selector,row_delimiter):
        split_rows = {}
        max_split = 0
        for column_selector in column_selectors:
            split_row = row[column_selector].split(row_delimiter)
            split_rows[column_selector] = split_row
            if len(split_row) > max_split:
                max_split = len(split_row)
            
        for i in range(max_split):
            new_row = row.to_dict()
            for column_selector in column_selectors:
                try:
                    new_row[column_selector] = split_rows[column_selector].pop(0)
                except IndexError:
                    new_row[column_selector] = ''
            row_accumulator.append(new_row)

    new_rows = []
    df.apply(_split_list_to_rows,axis=1,args = (new_rows,column_selectors,row_delimiter))
    new_df = pd.DataFrame(new_rows, columns=df.columns)
    return new_df

image

into

image

@kleinias
Copy link

And here is some variation of @JoaoCarabetta's split function, that leaves additional columns as they are (no drop of columns) and sets list-columns with empty lists with None, while copying the other rows as they were.

def split_data_frame_list(df, 
                       target_column,
                      output_type=float):
    ''' 
    Accepts a column with multiple types and splits list variables to several rows.

    df: dataframe to split
    target_column: the column containing the values to split
    output_type: type of all outputs
    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    row_accumulator = []
    def split_list_to_rows(row):
        split_row = row[target_column]
        if isinstance(split_row, list):
          for s in split_row:
              new_row = row.to_dict()
              new_row[target_column] = s
              row_accumulator.append(new_row)
          if split_row == []:
              new_row = row.to_dict()
              new_row[target_column] = None
              row_accumulator.append(new_row)
        else:
          new_row = row.to_dict()
          new_row[target_column] = split_row
          row_accumulator.append(new_row)
    df.apply(split_list_to_rows, axis=1)
    new_df = pd.DataFrame(row_accumulator)
    return new_df

>>> df = pd.DataFrame({'name':['a','b','c','d'], "items":[['a1','a2','a3'],['b1','b2','b3'],['c1','c2','c3'],[]],'leave me':range(4)})
>>> df
          items  leave me name
0  [a1, a2, a3]         0    a
1  [b1, b2, b3]         1    b
2  [c1, c2, c3]         2    c
3            []         3    d
>>> split_data_frame_list(df, target_column='items')
  items  leave me name
0    a1         0    a
1    a2         0    a
2    a3         0    a
3    b1         1    b
4    b2         1    b
5    b3         1    b
6    c1         2    c
7    c2         2    c
8    c3         2    c
9  None         3    d

@FernandoFavoretti
Copy link

helps a lot, thank you =D

@gnespatel1618
Copy link

@zouweilin 's extended version for lists

def split_dataframe_rows(df,column_selectors):
    # we need to keep track of the ordering of the columns
    def _split_list_to_rows(row,row_accumulator,column_selector):
        split_rows = {}
        max_split = 0
        for column_selector in column_selectors:
            split_row = row[column_selector]
            split_rows[column_selector] = split_row
            if len(split_row) > max_split:
                max_split = len(split_row)
            
        for i in range(max_split):
            new_row = row.to_dict()
            for column_selector in column_selectors:
                try:
                    new_row[column_selector] = split_rows[column_selector].pop(0)
                except IndexError:
                    new_row[column_selector] = ''
            row_accumulator.append(new_row)

    new_rows = []
    df.apply(_split_list_to_rows,axis=1,args = (new_rows,column_selectors))
    new_df = pd.DataFrame(new_rows, columns=df.columns)
    return new_df

Try this too:

def flatten_data(json = None):
    df = pd.DataFrame(json)
    list_cols = [col for col in df.columns if type(df.loc[0, col]) == list]
    for i in range(len(list_cols)):
        col = list_cols[i]
        meta_cols = [col for col in df.columns if type(df.loc[0, col]) != list] + list_cols[i+1:]
        json_data = df.to_dict('records')
        df = json_normalize(data=json_data, record_path=col, meta=meta_cols, record_prefix=col+str('_'), sep='_')
    return json_normalize(df.to_dict('records'))

ENJOY..!!!

@kanvesh
Copy link

kanvesh commented Feb 7, 2019

Was helpful. Thanks.

@harllos
Copy link

harllos commented Feb 13, 2019

Works just fine! Thank you.

@nitishkmr1989
Copy link

Can someone help me with the code for the below problem:

I have multiple columns with more than 1 value separated by delimiter. I need to create separate rows for those columns such that each value in the column will become a new row keeping the other values same.

I have attached the input and expected output in the excel sheet.

@akshay681
Copy link

Literally just made a github account right now so I could say thank you.

Thank you!

EXACTLY what I was looking for and worked like a charm.

@QtRoS
Copy link

QtRoS commented Jun 4, 2019

Big thanks, worked really well, much faster and cleaner than 1-st link solutions from Google.

@sfiso001
Copy link

So I'm trying to use code by when I call the function and insert my delimeter into the function (semicolon) I get an invalid syntax error. I might be doing something wrong. Sorry I'm new to programming.

splitDataFrameList(df, alert_rule, ;)

@barryhillier
Copy link

Thanks so much, using this saved me a ton of time!

@kaishwary
Copy link

Thank you , I modified it little bit to accomodate multiple delimiters by generating a regex pattern.

Like -
delimiters = ",","|"

Import re module for this
import re

def splitDataFrameList(df,target_column,delimiters):
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split
    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    regexPattern = "|".join(map(re.escape,delimiters))
    def splitListToRows(row,row_accumulator,target_column,regexPattern):
        split_row = re.split(regexPattern,row[target_column])
        for s in split_row:
            new_row = row.to_dict()
            new_row[target_column] = s
            row_accumulator.append(new_row)
    new_rows = []
    df.apply(splitListToRows,axis=1,args = (new_rows,target_column,regexPattern))
    new_df = pd.DataFrame(new_rows)
    return new_df

@namanjh
Copy link

namanjh commented Sep 11, 2019

Hey I made it so it can accept multiple columns and try to split on all of them at the same time

def split_dataframe_rows(df,column_selectors, row_delimiter):
    # we need to keep track of the ordering of the columns
    def _split_list_to_rows(row,row_accumulator,column_selector,row_delimiter):
        split_rows = {}
        max_split = 0
        for column_selector in column_selectors:
            split_row = row[column_selector].split(row_delimiter)
            split_rows[column_selector] = split_row
            if len(split_row) > max_split:
                max_split = len(split_row)
            
        for i in range(max_split):
            new_row = row.to_dict()
            for column_selector in column_selectors:
                try:
                    new_row[column_selector] = split_rows[column_selector].pop(0)
                except IndexError:
                    new_row[column_selector] = ''
            row_accumulator.append(new_row)

    new_rows = []
    df.apply(_split_list_to_rows,axis=1,args = (new_rows,column_selectors,row_delimiter))
    new_df = pd.DataFrame(new_rows, columns=df.columns)
    return new_df

image

into

image

Thanks bro .. worked like a charm..

@Prateek180909
Copy link

Worked wonders, Thank you so much :D

@pjdw
Copy link

pjdw commented Mar 26, 2020

Thank you , I modified it little bit to accomodate multiple delimiters by generating a regex pattern.

Like -
delimiters = ",","|"

Import re module for this
import re

def splitDataFrameList(df,target_column,delimiters):
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split
    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    regexPattern = "|".join(map(re.escape,delimiters))
    def splitListToRows(row,row_accumulator,target_column,regexPattern):
        split_row = re.split(regexPattern,row[target_column])
        for s in split_row:
            new_row = row.to_dict()
            new_row[target_column] = s
            row_accumulator.append(new_row)
    new_rows = []
    df.apply(splitListToRows,axis=1,args = (new_rows,target_column,regexPattern))
    new_df = pd.DataFrame(new_rows)
    return new_df

I have a dataframe that contains fieldnames and field content. There is no specific delimiter, but the fieldnames are limitative.
Have a somewhat clumsy way
https://stackoverflow.com/questions/60773067/split-string-no-delimiter-with-limitative-field-names-and-content
But this is far from optimal. Any advice?

@kpunyakoti
Copy link

kpunyakoti commented Apr 17, 2020

What a help! very handy, saved time and worked like a magic!! Thank you!

Exactly what I was looking for.

@sirishaditya
Copy link

Hey I made it so it can accept multiple columns and try to split on all of them at the same time

def split_dataframe_rows(df,column_selectors, row_delimiter):
    # we need to keep track of the ordering of the columns
    def _split_list_to_rows(row,row_accumulator,column_selector,row_delimiter):
        split_rows = {}
        max_split = 0
        for column_selector in column_selectors:
            split_row = row[column_selector].split(row_delimiter)
            split_rows[column_selector] = split_row
            if len(split_row) > max_split:
                max_split = len(split_row)
            
        for i in range(max_split):
            new_row = row.to_dict()
            for column_selector in column_selectors:
                try:
                    new_row[column_selector] = split_rows[column_selector].pop(0)
                except IndexError:
                    new_row[column_selector] = ''
            row_accumulator.append(new_row)

    new_rows = []
    df.apply(_split_list_to_rows,axis=1,args = (new_rows,column_selectors,row_delimiter))
    new_df = pd.DataFrame(new_rows, columns=df.columns)
    return new_df

image

into

image

Works great, exactly what I was looking for. Thanks!

@tvk66866
Copy link

tvk66866 commented May 4, 2020

Very useful. Thanks

@ShashwatShah24
Copy link

ShashwatShah24 commented Jun 11, 2020

Thanks A lot For this code

@DanielDaCosta
Copy link

Thanks! Very useful!

You could use 'pd.' instead of 'pandas.' :)

@pedrovgp
Copy link

From pandas 0.25 on, one can use explode

@aliheadou
Copy link

Thank you !

@tomasmetroy
Copy link

very helpful!

@Tusmijm
Copy link

Tusmijm commented Oct 2, 2022

Need some help. I am unable to run script provided by @namanjh and others above. I am using the same input data file separated by commas. Script errors due to new_df not being defined. Please help. Thanks.

Input data:

Contact Email email2 phone notes
adam,bob adam.con, bob.com, john.com adam.com2, bob.com2, john.com2 adamphone, bobphone, johnphone should be same for everyone
other contact don’t touch this asdf asdf don’t touch this
rachael, simone, snake rachael.com, simone.com rachael.com2, simone.com2, snake.com2 rachaelphone, simonephone should be same for everyone
other contact don’t touch this asdf asdf don’t touch this

Script:
#import numpy as np
import pandas as pd
from IPython.display import display

df = pd.DataFrame(pd.read_excel("file_path.xlsx"))
column_selectors = list(df)
row_delimiters = ','
#new_df = []
display(df)
display(column_selectors)

def split_dataframe_rows(df, column_selectors, row_delimiters):
def _split_list_to_rows(row, row_accumulator, column_selector, row_delimiter):
split_rows = {}
max_split = 0
for column_selector in column_selectors:
split_row = row[column_selector].split(row_delimiter)
split_rows[column_selector] = split_row
if len(split_row) > max_split:
max_split = len(split_row)

    for i in range(max_split):
        new_row = row.to_dict()
        for column_selector in column_selectors:
            try:
                new_row[column_selector] = split_rows[column_selector].pop(0)
            except IndexError:
                new_row[column_selector] = ''
        row_accumulator.append(new_row)
    
new_rows = []
df.apply(_split_list_to_rows, axis=1, args = (new_rows, column_selectors, row_delimiter))
new_df = pd.DataFrame(new_rows, column=df.columns)
return new_df

df2 = new_df
display(df2)

@john0305
Copy link

Super excited, this is my first time commenting with something I made, be gentle. I needed something similar when a number and a string were in one column separated by a comma (blanks as well). I modified my code a bit to hopefully work a little more universally.

The code checks how many times a delimiter is used in each column row, then repeats that line for each one.

import pandas as pd
from itertools import chain
import numpy as np

def chainer(df,col,sep,dtype):
  df = df.astype({col:dtype})
  lens = df[col].str.split(sep).map(len)
  dicts = {}
  for cols in df.columns:
    if cols == col:
      dicts[cols] = list(chain.from_iterable(df[cols].str.split(sep)))
    else:
      dicts[cols] = np.repeat(df[cols],lens)
  return pd.DataFrame.from_dict(dicts)

df = chainer(df,'Combined Column',',','str')

Added the astype because my column wouldn't convert a float or NaN, after using str worked like a champ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment