Skip to content

Instantly share code, notes, and snippets.

@cigrainger
Created August 13, 2021 03:36
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cigrainger/256629b698122a33bac047d26f48b3e4 to your computer and use it in GitHub Desktop.
Save cigrainger/256629b698122a33bac047d26f48b3e4 to your computer and use it in GitHub Desktop.

Exploring Explorer

Installation

Mix.install([
  {:explorer, "~> 0.1.0-dev",
   github: "amplifiedai/explorer", ref: "9b95d08679ecc97fcd4b5a9eedaaade95954bdfb"}
])

Reading and writing data

Currently, data can only be read in from delimited files. Support for JSON and Parquet is forthcoming. Your 'usual suspects' of options are available:

* `delimiter` - A single character used to separate fields within a record. (default: `","`)
* `dtypes` - A keyword list of `[column_name: dtype]`. If `nil`, dtypes are imputed from the first 1000 rows. (default: `nil`)
* `header?` - Does the file have a header of column names as the first row or not? (default: `true`)
* `max_rows` - Maximum number of lines to read. (default: `Inf`)
* `names` - A list of column names. Must match the width of the dataframe. (default: nil)
* `null_character` - The string that should be interpreted as a nil value. (default: `"NA"`)
* `skip_rows` - The number of lines to skip at the beginning of the file. (default: `0`)
* `with_columns` - A list of column names to keep. If present, only these columns are read
* into the dataframe. (default: `nil`)

Explorer also has a dataset built in with more to come. Let's use that.

df = Explorer.Datasets.fossil_fuels()

You'll notice that the output looks slightly different than many dataframe libraries. Explorer takes inspiration on this front from glimpse in R. A benefit to this approach is that you will rarely need to elide columns.

I have an open issue for implementing the Kino.Render protocol and would love help on that front.

Writing delimited files is very similar. The options are a little more limited:

* `header?` - Should the column names be written as the first line of the file? (default: `true`)
* `delimiter` - A single character used to separate fields within a record. (default: `","`)

First, let's add some useful aliases:

alias Explorer.DataFrame
alias Explorer.Series

And then write to a file of your choosing:

filename = "filename" |> IO.gets() |> String.trim()
DataFrame.write_csv(df, filename)

Working with Series

Explorer, like Polars, works up from the concept of a Series. These are like vectors in R or series in Pandas.

For simplicity, Explorer uses the following Series dtypes:

* `:float` - 64-bit floating point number
* `:integer` - 64-bit signed integer
* `:boolean` - Boolean
* `:string` - UTF-8 encoded binary
* `:date` - Date type that unwraps to `Elixir.Date`
* `:datetime` - DateTime type that unwraps to `Elixir.NaiveDateTime`

Series can be constructed from Elixir basic types. For example:

s1 = Series.from_list([1, 2, 3])
s2 = Series.from_list(["a", "b", "c"])
s3 = Series.from_list([~D[2011-01-01], ~D[1965-01-21]])

You'll notice that the dtype and length of the Series are at the top of the printed value. You can get those programmatically as well.

Series.dtype(s3)
Series.length(s3)

And the printed values max out at 50:

1..100 |> Enum.to_list() |> Series.from_list()

Series are also nullable.

s = Series.from_list([1.0, 2.0, nil, nil, 5.0])

And you can fill in those missing values using one of the following strategies:

* `:forward` - replace nil with the previous value
* `:backward` - replace nil with the next value
* `:max` - replace nil with the series maximum
* `:min` - replace nil with the series minimum
* `:mean` - replace nil with the series mean
Series.fill_missing(s, :forward)

However, Series must all be of the same dtype or else you'll get an ArgumentError.

Series.from_list([1, 2, 3, 4.0])

One of the goals of Explorer is useful error messages. If you look at the error above, you get:

Cannot make a series from mismatched types. Type of 4.0 does not match inferred dtype integer.

Hopefully this makes abundantly clear what's going on.

Series also implements the Access protocol. You can slice and dice in many ways:

s = 1..10 |> Enum.to_list() |> Series.from_list()
s[1]
s[-1]
s[0..4]
s[[0, 4, 4]]

And of course, you can convert back to an Elixir list.

Series.to_list(s)

Explorer supports comparisons.

s = 1..11 |> Enum.to_list() |> Series.from_list()
s1 = 11..1 |> Enum.to_list() |> Series.from_list()
Series.eq(s, s1)
Series.eq(s, 5)
Series.neq(s, 10)
Series.gt_eq(s, 4)

And arithmetic.

Series.add(s, s1)
Series.subtract(s, 4)
Series.multiply(s, s1)

Remember those helpful errors? I've tried to add those throughout. So if you try to do arithmetic with mismatching dtypes:

s = Series.from_list([1, 2, 3])
s1 = Series.from_list([1.0, 2.0, 3.0])
Series.add(s, s1)

Just kidding! Integers and floats will downcast to floats. Let's try again:

s = Series.from_list([1, 2, 3])
s1 = Series.from_list(["a", "b", "c"])
Series.add(s, s1)

You can flip them around.

s = Series.from_list([1, 2, 3, 4])
Series.reverse(s)

And sort.

1..100 |> Enum.to_list() |> Enum.shuffle() |> Series.from_list() |> Series.sort()

Or argsort.

s = 1..100 |> Enum.to_list() |> Enum.shuffle() |> Series.from_list()
ids = Series.argsort(s)

Which you can pass to Explorer.Series.take/2 if you want the sorted values.

Series.take(s, ids)

You can calculate cumulative values.

s = 1..100 |> Enum.to_list() |> Series.from_list()
Series.cum_sum(s)

Or rolling ones.

Series.rolling_sum(s, 4)

You can count and list unique values.

s = Series.from_list(["a", "b", "b", "c", "c", "c"])
Series.distinct(s)
Series.n_distinct(s)

And you can even get a dataframe showing the counts for each distinct value.

Series.count(s)

Working with DataFrames

A DataFrame is really just a collection of Series of the same length. Which is why you can create a DataFrame from a Map.

DataFrame.from_map(%{a: [1, 2, 3], b: ["a", "b", "c"]})

Similarly to Series, the Inspect implementation prints some info at the top and to the left. At the top we see the shape of the dataframe (rows and columns) and then for each column we see the name, dtype, and first five values. We can see a bit more from that built-in dataset we loaded in earlier.

df

You will also see grouping information there, but we'll get to that later. You can get the info yourself directly:

DataFrame.names(df)
DataFrame.dtypes(df)
DataFrame.shape(df)
{DataFrame.n_rows(df), DataFrame.n_cols(df)}

We can grab the head.

DataFrame.head(df)

Or the tail. Let's get a few more values from the tail.

DataFrame.tail(df, 10)

Select

Let's jump right into it. We can select columns pretty simply.

DataFrame.select(df, ["year", "country"])

But Elixir gives us some superpowers. In R there's tidy-select. I don't think we need that in Elixir. Anywhere in Explorer where you need to pass a list of column names, you can also execute a filtering callback on the column names. It's just an anonymous function passed to df |> DataFrame.names() |> Enum.filter(callback_here).

DataFrame.select(df, &String.ends_with?(&1, "fuel"))

Want all but some columns? DataFrame.select/3 takes :keep or :drop as the last arg. It just defaults to :keep.

DataFrame.select(df, &String.ends_with?(&1, "fuel"), :drop)

Filter

The next verb we'll look at is filter. You can pass in a boolean mask series of the same length as the dataframe, but I find it's more handy to use callbacks on the dataframe to generate those masks.

DataFrame.filter(df, &Series.eq(&1["country"], "AFGHANISTAN"))
filtered_df =
  df
  |> DataFrame.filter(&Series.eq(&1["country"], "ALGERIA"))
  |> DataFrame.filter(&Series.gt(&1["year"], 2012))
DataFrame.filter(filtered_df, [true, false])

Remember those helpful error messages?

DataFrame.filter(df, &Series.eq(&1["cuontry"], "AFGHANISTAN"))

Mutate

A common task in data analysis is to add columns or change existing ones. Mutate is a handy verb.

DataFrame.mutate(df, new_column: &Series.add(&1["solid_fuel"], &1["cement"]))

Did you catch that? You can pass in new columns as keyword arguments. It also works to transform existing columns.

DataFrame.mutate(df,
  gas_fuel: &Series.cast(&1["gas_fuel"], :float),
  gas_and_liquid_fuel: &Series.add(&1["gas_fuel"], &1["liquid_fuel"])
)

DataFrame.mutate/2 is flexible though. You may not always want to use keyword arguments. Given that column names are String.t(), it may make more sense to use a map.

DataFrame.mutate(df, %{"gas_fuel" => &Series.subtract(&1["gas_fuel"], 10)})

DataFrame.transmute/2, which is DataFrame.mutate/2 that only retains the specified columns, is forthcoming.

Arrange

Sorting the dataframe is pretty straightforward.

DataFrame.arrange(df, ["year"])

But it comes with some tricks up its sleeve.

DataFrame.arrange(df, total: :asc, year: :desc)

Sort operations happen left to right. And keyword list args permit specifying the direction.

Distinct

Okay, as expected here too. Very straightforward.

DataFrame.distinct(df, columns: ["year", "country"])

You can specify whether to keep the other columns as well.

DataFrame.distinct(df, columns: ["country"], keep_all?: true)

Rename

Rename can take either a list of new names or a callback that is passed to Enum.map/2 against the names. You can also use a map or keyword args to rename specific columns.

DataFrame.rename(df, year: "year_test")
DataFrame.rename(df, &(&1 <> "_test"))

Dummies

This is fun! We can get dummy variables for unique values.

DataFrame.dummies(df, ["year"])
DataFrame.dummies(df, ["country"])

Sampling

Random samples can give us a percent or a specific number of samples, with or without replacement, and the function is seedable.

DataFrame.sample(df, 10)
DataFrame.sample(df, 0.4)

Trying for those helpful error messages again.

DataFrame.sample(df, 10000)
DataFrame.sample(df, 10000, with_replacement?: true)

Pull/slice/take

Slicing and dicing can be done with the Access protocol or with explicit pull/slice/take functions.

df["year"]
DataFrame.pull(df, "year")
df[["year", "country"]]
DataFrame.take(df, [1, 20, 50])

Negative offsets work for slice!

DataFrame.slice(df, -10, 5)
DataFrame.slice(df, 10, 5)

Pivot

We can pivot_longer/3 and pivot_wider/4. These are inspired by tidyr.

There are some shortcomings in pivot_wider/4 related to polars. The values_from column must be a numeric type.

DataFrame.pivot_longer(df, ["year", "country"], value_cols: &String.ends_with?(&1, "fuel"))
DataFrame.pivot_wider(df, "country", "total", id_cols: ["year"])

Let's make those names look nicer!

tidy_names = fn name ->
  name
  |> String.downcase()
  |> String.replace(~r/\s/, " ")
  |> String.replace(~r/[^A-Za-z\s]/, "")
  |> String.replace(" ", "_")
end

df |> DataFrame.pivot_wider("country", "total", id_cols: ["year"]) |> DataFrame.rename(tidy_names)

Joins

Joining is fast and easy. You can specify the columns to join on and how to join. Polars even supports cartesian (cross) joins, so Explorer does too.

df1 = DataFrame.select(df, ["year", "country", "total"])
df2 = DataFrame.select(df, ["year", "country", "cement"])
DataFrame.join(df1, df2)
df3 = df |> DataFrame.select(["year", "cement"]) |> DataFrame.slice(0, 500)
DataFrame.join(df1, df3, how: :left)

Grouping

Explorer supports groupby operations. They're limited based on what's possible in Polars, but they do most of what you need to do.

grouped = DataFrame.group_by(df, ["country"])

Notice that the Inspect call now shows groups as well as rows and columns. You can, of course, get them explicitly.

DataFrame.groups(grouped)

And you can ungroup explicitly.

DataFrame.ungroup(grouped)

But what we care about the most is aggregating! Let's see which country has the max per_capita value.

grouped |> DataFrame.summarise(per_capita: [:max]) |> DataFrame.arrange(per_capita_max: :desc)

Qatar it is. You can use the following aggregations:

  * `:min` - Take the minimum value within the group. See `Explorer.Series.min/1`.
  * `:max` - Take the maximum value within the group. See `Explorer.Series.max/1`.
  * `:sum` - Take the sum of the series within the group. See `Explorer.Series.sum/1`.
  * `:mean` - Take the mean of the series within the group. See `Explorer.Series.mean/1`.
  * `:median` - Take the median of the series within the group. See `Explorer.Series.median/1`.
  * `:first` - Take the first value within the group. See `Explorer.Series.first/1`.
  * `:last` - Take the last value within the group. See `Explorer.Series.last/1`.
  * `:count` - Count the number of rows per group.
  * `:n_unique` - Count the number of unique rows per group.

The API is similar to mutate: you can use keyword args or a map and specify aggregations to use.

grouped |> DataFrame.summarise(per_capita: [:max, :min], total: [:min])

Speaking of mutate, it's 'group-aware'. As are arrange, distinct, and n_rows.

DataFrame.arrange(grouped, total: :desc)

That's it!

And not. This is certainly not exhaustive, but I hope it gives you a good idea of what can be done and what the 'flavour' of the API is like. I'd love contributors and issues raised where you find them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment