Skip to content

Instantly share code, notes, and snippets.

@paddymul
Last active June 12, 2023 21:01
Show Gist options
  • Save paddymul/0899f3e2294d6bf6dfc1808b7c8598d1 to your computer and use it in GitHub Desktop.
Save paddymul/0899f3e2294d6bf6dfc1808b7c8598d1 to your computer and use it in GitHub Desktop.

With the new release of buckaroo, df.head() is obsolete. I have worked to make Buckaroo usable as the default table visualization for pandas dataframes. It does this through sensible defaults and down sampling. The default process of investigating a new dataset with pandas and jupyter is to load a dataframe from csv, parquet, or some other data source. The next step is df.head() or df.describe(), if you just type df pandas will try to show the first 5 rows and last 5 rows, and possibly all of the columns. Pandas needs to limit the output to avoid overwhelming a notebook with text output, and causing performance issues. Soon you will find yourself looking up pd.options.display.width = 0 or pd.options.display.max_rows = 500

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
 print (df)

Eventually you will want to look at a subset of rows, using slicing. Looking up sorting… How do I find the rows with the highest or lowest values in a column you could use some commands like this

 print(df[df['trip_time'] == df['trip_time'].max()])

you could view the top 5 rows with  df[df['trip_time'].nlargest(5).index] That's a lot of typing, even if you remember exactly what to type. It's slow, and leads to mess notebooks.

Buckaroo makes it better:

When you run import buckaroo, the BuckarooWidget becomes the default method to display every pandas dataframe. This default display does a lot.

All columns are shown, if there are less than 5000 rows, all rows are shown. If there are more than 5000 rows, sampling is turned on. But not just any sampling, sampling that also includes the 5 largest and smallest values of each column. At this point you will have around 5000 rows in the interactive widget, which can be sorted by any column. Summary stats are a toggle away.

It gets even better though:

There is also column reordering. Column reordering tries to put the most interesting columns to the far left where they are easily visible without scrolling. What's an interesting column, well that's complicated, but an uninteresting column is simple to describe. A column of data that offers no insight into the dataset, a column that you will not run a computation against, a column that duplicates information from other columns. So a column that only has a single value for all rows offers no additional actionable information. Those columns are ranked lowest. Next are duplicate columns, in my favorite dataset there are two sets of four duplicate columns. Each citibike trip has a start station and an end station, there are 4 columns for both - 'station id', 'station name', 'longitude', 'latitude'. For every row with 'station id' 359, 'station name' will always be 'E 47 St & Park Ave', 'latitude' will be 40.755 and 'longitude' will be -73.975. We only need one of those columns, and station name is the most descriptive. So 'Station name' is put to the left, and the other 3 columns are put to the right. Finally the Buckaroo command UI is available with a single click. The command interface allows you to iterate through normal data cleaning operations with a GUI… while generating python code to perform the operations in a function.

Buckaroo is already the fastest table widget for the jupyter notebook from my testing. Being fast isn't just a bragging rights matter. To be usable as the default dataframe display method, some performance guarantees are necessary. Having your kernel lock up for 30 seconds or longer is unacceptable. So the system has to make some decisions for you. this is why sampling is automatically performed for larger datasets.

What this means for your workflows:

You can just visualize datafarames and all of their important attributes with a single command. This encourages you to look at the data instead of making assumptions about it. This also leads to less cluttered notebooks.

Upcoming features in this area

The biggest feature in this area is making the summary stats and column reordering algorithms pluggable. This will speed up my own development of features. I also want to to experiment with running the initial analysis in a separate thread, dynamically sizing the sample size, this way I can ensure that the table always loads in a reasonable and tunable timeframe. I will also be adding unit tests and integrating it with the pluggable stats algorithms (run a series of tests over user supplied summary stats to check for exceptions). There are also performance improvements, histograms, and group colors coming.

Try Buckaroo

! pip install buckaroo
import buckaroo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment