Skip to content

Instantly share code, notes, and snippets.

@patrickpang
Created September 5, 2019 06:45
Show Gist options
  • Save patrickpang/3fa82ab72b8456e8fb8d2d025db7c33a to your computer and use it in GitHub Desktop.
Save patrickpang/3fa82ab72b8456e8fb8d2d025db7c33a to your computer and use it in GitHub Desktop.
Data.table and Clojure

Data.table and Clojure

Background

This article compares the approaches of data manipulation using data.table in R and standard functions in Clojure, in order to evaluate the necessity of a Clojure library providing similar functionality of data.table. In the conceptual level, data.table provides a DSL in R specifically for data analysis, with a unified syntax similar to SQL for selecting, grouping, updating and joining tabular data. In contrast, the standard libraries of Clojure provide basic building blocks for general purpose, including persistent data structures (e.g. sequence, vector, set, map) and generic transformation functions (e.g. map, filter, reduce). This article aims to illustrate the impact of the two approaches on data manipulation using common use cases.

Load Dataset

The dataset used in this article is the NYC-flights14 data, which is On-Time flights data from the Bureau of Transporation Statistics for all the flights that departed from New York City airports in 2014.

In R, data.table library provides the fread function, which reads CSV file on disk, parses the data into a data.table with integers automatically casted, and stores the whole data.table in memory.

flights <- fread("https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv")
head(flights)

In Clojure, data.csv library provides the read-csv function, which focuses on parsing CSV data into lazy sequence of vectors. Therefore, we compose it with file reader function, and helper functions to transform the data. The composition is mainly done by thread macros for readability. For the illustration purpose, we load the entire CSV file into memory, which can be avoided as explained later.

(ns skirmish
  (:require
   [clojure.data.csv :as csv]
   [clojure.java.io :as io]
   [clojure.set :as set]
   [clojure.pprint :as pprint]))

(defn csv->map
  "Convert parsed CSV vectors into maps with headers as keys"
  [csv-data]
  (map zipmap
       (->> (first csv-data)
            (map keyword)
            repeat)
       (rest csv-data)))

(defn update-by-keys
  "Update values in a map (m) by applying function (f) on keys"
  [m keys f]
  (reduce (fn [m k] (update m k f)) m keys))

(defn parse-int
  "Parse integer from string. May return nil."
  [str]
  (try (Integer/parseInt str)
       (catch Exception e nil)))

(defn slurp-csv
  "Read CSV data into memory"
  [filename]
  (with-open [reader (io/reader filename)]
    (->> (csv/read-csv reader)
         csv->map
         doall)))

(def flights
  (->> (slurp-csv "data/flights14.csv")
       (map #(update-by-keys % [:day :hour :month :year :arr_delay :dep_delay :distance :air_time] parse-int))))

(pprint/pprint (take 6 flights))

Filter

In R, the filtering condition can be directly written in the bracket.

ans <- flights[origin == "JFK" & month == 6L]
head(ans)

In Clojure, the filtering condition is expressed using regular functions in prefix notation, and composed using thread macro.

(->> flights
     (filter #(and (= (:origin %) "JFK") (= (:month %) 6)))
     (take 6)
     pprint/print-table)

Slice

In R, the rows in a data.table can be accessed directly using range of indices.

ans <- flights[1:2]
ans

In Clojure, the transformation functions usually returns sequence, which only does not support random access. For the purpose of slicing, the drop and take function can be used together for consecutive range.

(->> flights
     (drop 5)
     (take 6)
     pprint/print-table)

If direct access by indices is required, the sequence can be converted into a vector, which can then be sliced by subvec.

(-> flights
    vec
    (subvec 5 11)
    pprint/print-table)

Sort

In R, rows in data.table is sorted by the order function, which supports multi-field sorting, in both ascending and descending order.

ans <- flights[order(-distance, air_time)]
head(ans)

In Clojure, the sort-by function sorts a sequence by the result of a function applied on each item, which can sort the sequence of maps with the fact that a keyword is also a function.

(->> flights
     (sort-by (juxt (comp - :distance) :air_time))
     (take 6)
     pprint/print-table)

Multi-field sorting in Clojure involves a customized comparator function for non-numeric fields in descending order.

(->> flights
     (sort-by (juxt :origin :dest)
              (fn [[origin-a dest-a] [origin-b dest-b]]
                (compare [origin-a dest-b] [origin-b dest-a])))
     (take 6)
     pprint/print-table)

Select

In R, specific columns in a data.table can be selected and renamed.

ans <- flights[, arr_delay]
head(ans)

ans <- flights[, .(arr_delay, dep_delay)]
head(ans)

ans <- flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]
head(ans)

In Clojure, the select-keys and rename-keys functions are used for selection and renaming of keys in maps.

(->> flights
     (map :arr_delay)
     (take 6)
     pprint/pprint)

(->> flights
     (map #(select-keys % [:arr_delay :dep_delay]))
     (take 6)
     pprint/print-table)

(->> flights
     (map #(select-keys % [:arr_delay :dep_delay]))
     (map #(set/rename-keys % {:arr_delay :delay_arr, :dep_delay :delay_dep}))
     (take 6)
     pprint/print-table)

Reduce

In R, computations can be carried out by considering each column of a data.table as a list.

ans <- flights[, sum( (arr_delay + dep_delay) < 0 )]
ans

In Clojure, the sequence of maps can be transformed by standard library functions such as reduce, count, etc.

(->> flights
     (filter #(< (+ (:arr_delay %) (:dep_delay %)) 0))
     count
     pprint/pprint)

Group

In R, the by expression is used to aggregate the rows in a data.table

ans <- flights[, .N, by = origin]
ans

In Clojure, the group-by function is used to aggregate the maps.

(->> flights
     (group-by :origin)
     (map (fn [[k v]] [k (count v)]))
     pprint/pprint)

Update

In R, due to the reference semantics, the := operator allows updating columns in-place, which means the data.table is modified as the expression is evaluated. The copy operator can be used to avoid the mutation with a copy of the data.table.

flights[, `:=`(speed = distance / (air_time/60),
               delay = arr_delay + dep_delay)]
head(flights)

flights[hour == 24L, hour := 0L]
flights[, sort(unique(hour))]

flights[, c("delay") := NULL]
head(flights)

In Clojure, due to the persistent data structures, the standard library functions are not mutating the underlying data structures directly. Instead, from the user's perspective, new copy of the data is returned while the old data remains unchanged. In the implementation level, structural sharing is used to minimize the memory consumption of persistent data structures.

(->> flights
     (map #(assoc % :speed (/ (:distance %) (/ (:air_time %) 60))
                  :delay (+ (:arr_delay %) (:dep_delay %))))
     (take 6)
     pprint/print-table)

(->> flights
     (map #(dissoc % :distance))
     (map (fn [m] (update m :hour #(if (= % 24) 0 %))))
     (map :hour)
     set
     sort
     pprint/pprint)

(->> flights
     (map #(dissoc % :distance))
     (take 6)
     pprint/print-table)

Join

For the join operations, the following datasets are used:

Employees:

Employee EmployeeName Department Salary
1 Alice 11 800
2 Bob 11 600
3 Carla 12 900
4 Daniel 12 1000
5 Evelyn 13 800
6 Ferdinand 21 700

Departments:

Department DepartmentName Manager
11 Production 1
12 Sales 4
13 Marketing 5
14 Research NA

In R, the merge function provides SQL-like join operations, including inner-join, left-join, right-join, full-join, outer-join, and rolling-join.

setkey(Employees, Department)
setkey(Departments, Department)

# inner join
Result <- Employees[Departments, nomatch=0]

# left join
Result <- merge(Employees, Departments, all.x=TRUE)

# right join
Result <- Employees[Departments]

# full join
Result <- merge(Employees, Departments, all=TRUE)

# outer join
Result <- merge(Employees, Departments, all=TRUE)
Result <- Result[is.na(EmployeeName) | is.na(DepartmentName)]

In Clojure, the join function in the clojure.set library only provides natural join. However, the implementation of other join operations can be done by composing utility functions.

(def employees
  (->> (slurp-csv "data/Employees.csv")
       (map #(update-by-keys % [:Department :Employee :Salary] parse-int))))

(def departments
  (->> (slurp-csv "data/Departments.csv")
       (map #(update-by-keys % [:Department :Manager] parse-int))))

(pprint/print-table (set/join employees departments))

(defn inner-join
  [left-keys right-keys]
  (set/intersection (set left-keys) (set right-keys)))

(defn outer-join
  [left-keys right-keys]
  (set/union (set/difference (set left-keys) (set right-keys)) (set/difference (set right-keys) (set left-keys))))

(defn full-join
  [left-keys right-keys]
  (set/union (set left-keys) (set right-keys)))

(defn left-join
  [left-keys right-keys]
  left-keys)

(defn right-join
  [left-keys right-keys]
  right-keys)

(defn map-combine [group1 group2 dummy-map]
  (for [map1 group1
        map2 group2]
    (merge-with #(or %2 %1) dummy-map map2 map1)))

(defn joiner [left-coll right-coll left-fn right-fn join-type]
  (let [left-idx (group-by left-fn left-coll)
        right-idx (group-by right-fn right-coll)
        join-keys (set (join-type (keys left-idx) (keys right-idx)))]
    (apply concat
           (map #(map-combine (get left-idx  % [{}])
                              (get right-idx % [{}])
                              (zipmap (set (set/union (keys (first left-coll)) (keys (first right-coll)))) (repeat nil))) join-keys))))

(pprint/print-table (joiner employees departments :Department :Department inner-join))
(pprint/print-table (joiner employees departments :Department :Department left-join))
(pprint/print-table (joiner employees departments :Department :Department right-join))
(pprint/print-table (joiner employees departments :Department :Department outer-join))
(pprint/print-table (joiner employees departments :Department :Department full-join))

The rolling-join operation is not investigated yet in this article.

Scalability

The data.table in R loads the entire dataset into memory, which makes it not applicable for dataset larger than RAM. The lazy sequence abstraction in Clojure allows memory-efficient data manipulation while having minimal impact on the expression of algorithms. Together with the concurrency primitives provided by Clojure, data processing on a single machine should be efficient in Clojure.

(with-open [reader (io/reader "data/flights14.csv")]
  (->> (csv/read-csv reader)
       csv->map
       (map #(update-by-keys % [:day :hour :month :year :arr_delay :dep_delay :distance :air_time] parse-int))
       (take 6)
       pprint/print-table))

However, when the algorithm is not suitable for operation in the row level, or the intermediate data grows out of RAM, the workload needs to be distributed in a cluster of computation units. In that field, while the Python ecosystem has Dask, there is no such Clojure library available yet. Onyx, which is a distributed streaming computation platform in Clojure, may provide some infrastructure for such purpose, but the details need further investigation.

Conclusion

As illustrated by the above use cases, the data.table package in R provides a convenient and expressive DSL for data analysis; while Clojure, as a general purpose language, provides the basic building blocks which is flexible and can be adapted to specific needs. Besides, in the language level, the concurrency primitives, the JVM platform, and the namespace system in Clojure provides a more robust foundation for system programming. Therefore, it is suggested that R can be used for exploratory data analysis, for its expressive DSL and its rich ecosystem; while Clojure can be used to develop data-based applications in the later stage. The emergence of polyglot platform such as GraalVM actually encourages such hybrid approach of development, which can be investigated further.

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment