patrickpang/dt-clj.md

## dt-clj.md

      
    Raw
  

              dt-clj.md
            
          
    Data.table and Clojure

Background

This article compares the approaches of data manipulation using data.table in R and standard functions in Clojure, in order to evaluate the necessity of a Clojure library providing similar functionality of data.table. In the conceptual level, data.table provides a DSL in R specifically for data analysis, with a unified syntax similar to SQL for selecting, grouping, updating and joining tabular data. In contrast, the standard libraries of Clojure provide basic building blocks for general purpose, including persistent data structures (e.g. sequence, vector, set, map) and generic transformation functions (e.g. map, filter, reduce). This article aims to illustrate the impact of the two approaches on data manipulation using common use cases.
Load Dataset

The dataset used in this article is the NYC-flights14 data, which is On-Time flights data from the Bureau of Transporation Statistics for all the flights that departed from New York City airports in 2014.
In R, data.table library provides the fread function, which reads CSV file on disk, parses the data into a data.table with integers automatically casted, and stores the whole data.table in memory.
flights <- fread("https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv")
head(flights)
In Clojure, data.csv library provides the read-csv function, which focuses on parsing CSV data into lazy sequence of vectors. Therefore, we compose it with file reader function, and helper functions to transform the data. The composition is mainly done by thread macros for readability. For the illustration purpose, we load the entire CSV file into memory, which can be avoided as explained later.
(ns skirmish
  (:require
   [clojure.data.csv :as csv]
   [clojure.java.io :as io]
   [clojure.set :as set]
   [clojure.pprint :as pprint]))

(defn csv->map
  "Convert parsed CSV vectors into maps with headers as keys"
  [csv-data]
  (map zipmap
       (->> (first csv-data)
            (map keyword)
            repeat)
       (rest csv-data)))

(defn update-by-keys
  "Update values in a map (m) by applying function (f) on keys"
  [m keys f]
  (reduce (fn [m k] (update m k f)) m keys))

(defn parse-int
  "Parse integer from string. May return nil."
  [str]
  (try (Integer/parseInt str)
       (catch Exception e nil)))

(defn slurp-csv
  "Read CSV data into memory"
  [filename]
  (with-open [reader (io/reader filename)]
    (->> (csv/read-csv reader)
         csv->map
         doall)))

(def flights
  (->> (slurp-csv "data/flights14.csv")
       (map #(update-by-keys % [:day :hour :month :year :arr_delay :dep_delay :distance :air_time] parse-int))))

(pprint/pprint (take 6 flights))
Filter

In R, the filtering condition can be directly written in the bracket.
ans <- flights[origin == "JFK" & month == 6L]
head(ans)
In Clojure, the filtering condition is expressed using regular functions in prefix notation, and composed using thread macro.
(->> flights
     (filter #(and (= (:origin %) "JFK") (= (:month %) 6)))
     (take 6)
     pprint/print-table)
Slice

In R, the rows in a data.table can be accessed directly using range of indices.
ans <- flights[1:2]
ans
In Clojure, the transformation functions usually returns sequence, which only does not support random access. For the purpose of slicing, the drop and take function can be used together for consecutive range.
(->> flights
     (drop 5)
     (take 6)
     pprint/print-table)
If direct access by indices is required, the sequence can be converted into a vector, which can then be sliced by subvec.
(-> flights
    vec
    (subvec 5 11)
    pprint/print-table)
Sort

In R, rows in data.table is sorted by the order function, which supports multi-field sorting, in both ascending and descending order.
ans <- flights[order(-distance, air_time)]
head(ans)
In Clojure, the sort-by function sorts a sequence by the result of a function applied on each item, which can sort the sequence of maps with the fact that a keyword is also a function.
(->> flights
     (sort-by (juxt (comp - :distance) :air_time))
     (take 6)
     pprint/print-table)
Multi-field sorting in Clojure involves a customized comparator function for non-numeric fields in descending order.
(->> flights
     (sort-by (juxt :origin :dest)
              (fn [[origin-a dest-a] [origin-b dest-b]]
                (compare [origin-a dest-b] [origin-b dest-a])))
     (take 6)
     pprint/print-table)
Select

In R, specific columns in a data.table can be selected and renamed.
ans <- flights[, arr_delay]
head(ans)

ans <- flights[, .(arr_delay, dep_delay)]
head(ans)

ans <- flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]
head(ans)
In Clojure, the select-keys and rename-keys functions are used for selection and renaming of keys in maps.
(->> flights
     (map :arr_delay)
     (take 6)
     pprint/pprint)

(->> flights
     (map #(select-keys % [:arr_delay :dep_delay]))
     (take 6)
     pprint/print-table)

(->> flights
     (map #(select-keys % [:arr_delay :dep_delay]))
     (map #(set/rename-keys % {:arr_delay :delay_arr, :dep_delay :delay_dep}))
     (take 6)
     pprint/print-table)
Reduce

In R, computations can be carried out by considering each column of a data.table as a list.
ans <- flights[, sum( (arr_delay + dep_delay) < 0 )]
ans
In Clojure, the sequence of maps can be transformed by standard library functions such as reduce, count, etc.
(->> flights
     (filter #(< (+ (:arr_delay %) (:dep_delay %)) 0))
     count
     pprint/pprint)
Group

In R, the by expression is used to aggregate the rows in a data.table
ans <- flights[, .N, by = origin]
ans
In Clojure, the group-by function is used to aggregate the maps.
(->> flights
     (group-by :origin)
     (map (fn [[k v]] [k (count v)]))
     pprint/pprint)
Update

In R, due to the reference semantics, the := operator allows updating columns in-place, which means the data.table is modified as the expression is evaluated. The copy operator can be used to avoid the mutation with a copy of the data.table.
flights[, `:=`(speed = distance / (air_time/60),
               delay = arr_delay + dep_delay)]
head(flights)

flights[hour == 24L, hour := 0L]
flights[, sort(unique(hour))]

flights[, c("delay") := NULL]
head(flights)
In Clojure, due to the persistent data structures, the standard library functions are not mutating the underlying data structures directly. Instead, from the user's perspective, new copy of the data is returned while the old data remains unchanged. In the implementation level, structural sharing is used to minimize the memory consumption of persistent data structures.
(->> flights
     (map #(assoc % :speed (/ (:distance %) (/ (:air_time %) 60))
                  :delay (+ (:arr_delay %) (:dep_delay %))))
     (take 6)
     pprint/print-table)

(->> flights
     (map #(dissoc % :distance))
     (map (fn [m] (update m :hour #(if (= % 24) 0 %))))
     (map :hour)
     set
     sort
     pprint/pprint)

(->> flights
     (map #(dissoc % :distance))
     (take 6)
     pprint/print-table)
Join

For the join operations, the following datasets are used:
Employees:


Employee
EmployeeName
Department
Salary


1
Alice
11
800


2
Bob
11
600


3
Carla
12
900


4
Daniel
12
1000


5
Evelyn
13
800


6
Ferdinand
21
700


Departments:


Department
DepartmentName
Manager


11
Production
1


12
Sales
4


13
Marketing
5


14
Research
NA


In R, the merge function provides SQL-like join operations, including inner-join, left-join, right-join, full-join, outer-join, and rolling-join.
setkey(Employees, Department)
setkey(Departments, Department)

# inner join
Result <- Employees[Departments, nomatch=0]

# left join
Result <- merge(Employees, Departments, all.x=TRUE)

# right join
Result <- Employees[Departments]

# full join
Result <- merge(Employees, Departments, all=TRUE)

# outer join
Result <- merge(Employees, Departments, all=TRUE)
Result <- Result[is.na(EmployeeName) | is.na(DepartmentName)]
In Clojure, the join function in the clojure.set library only provides natural join. However, the implementation of other join operations can be done by composing utility functions.
(def employees
  (->> (slurp-csv "data/Employees.csv")
       (map #(update-by-keys % [:Department :Employee :Salary] parse-int))))

(def departments
  (->> (slurp-csv "data/Departments.csv")
       (map #(update-by-keys % [:Department :Manager] parse-int))))

(pprint/print-table (set/join employees departments))

(defn inner-join
  [left-keys right-keys]
  (set/intersection (set left-keys) (set right-keys)))

(defn outer-join
  [left-keys right-keys]
  (set/union (set/difference (set left-keys) (set right-keys)) (set/difference (set right-keys) (set left-keys))))

(defn full-join
  [left-keys right-keys]
  (set/union (set left-keys) (set right-keys)))

(defn left-join
  [left-keys right-keys]
  left-keys)

(defn right-join
  [left-keys right-keys]
  right-keys)

(defn map-combine [group1 group2 dummy-map]
  (for [map1 group1
        map2 group2]
    (merge-with #(or %2 %1) dummy-map map2 map1)))

(defn joiner [left-coll right-coll left-fn right-fn join-type]
  (let [left-idx (group-by left-fn left-coll)
        right-idx (group-by right-fn right-coll)
        join-keys (set (join-type (keys left-idx) (keys right-idx)))]
    (apply concat
           (map #(map-combine (get left-idx  % [{}])
                              (get right-idx % [{}])
                              (zipmap (set (set/union (keys (first left-coll)) (keys (first right-coll)))) (repeat nil))) join-keys))))

(pprint/print-table (joiner employees departments :Department :Department inner-join))
(pprint/print-table (joiner employees departments :Department :Department left-join))
(pprint/print-table (joiner employees departments :Department :Department right-join))
(pprint/print-table (joiner employees departments :Department :Department outer-join))
(pprint/print-table (joiner employees departments :Department :Department full-join))
The rolling-join operation is not investigated yet in this article.
Scalability

The data.table in R loads the entire dataset into memory, which makes it not applicable for dataset larger than RAM. The lazy sequence abstraction in Clojure allows memory-efficient data manipulation while having minimal impact on the expression of algorithms. Together with the concurrency primitives provided by Clojure, data processing on a single machine should be efficient in Clojure.
(with-open [reader (io/reader "data/flights14.csv")]
  (->> (csv/read-csv reader)
       csv->map
       (map #(update-by-keys % [:day :hour :month :year :arr_delay :dep_delay :distance :air_time] parse-int))
       (take 6)
       pprint/print-table))
However, when the algorithm is not suitable for operation in the row level, or the intermediate data grows out of RAM, the workload needs to be distributed in a cluster of computation units. In that field, while the Python ecosystem has Dask, there is no such Clojure library available yet. Onyx, which is a distributed streaming computation platform in Clojure, may provide some infrastructure for such purpose, but the details need further investigation.
Conclusion

As illustrated by the above use cases, the data.table package in R provides a convenient and expressive DSL for data analysis; while Clojure, as a general purpose language, provides the basic building blocks which is flexible and can be adapted to specific needs. Besides, in the language level, the concurrency primitives, the JVM platform, and the namespace system in Clojure provides a more robust foundation for system programming. Therefore, it is suggested that R can be used for exploratory data analysis, for its expressive DSL and its rich ecosystem; while Clojure can be used to develop data-based applications in the later stage. The emergence of polyglot platform such as GraalVM actually encourages such hybrid approach of development, which can be investigated further.
References


Introduction to data.table
Reference semantics
Clojure API documentation
JOINing data in R using data.table
Clojure - Comparators Guide
SQL-like JOIN in Clojure
Embedded Interop Between Clojure, R, and Python With GraalVM - Squid's Blog
Employee	EmployeeName	Department	Salary
1	Alice	11	800
2	Bob	11	600
3	Carla	12	900
4	Daniel	12	1000
5	Evelyn	13	800
6	Ferdinand	21	700