Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@dkochmanski
Last active November 6, 2019 14:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dkochmanski/e575bc6e6d0386f537067b75726e3fb4 to your computer and use it in GitHub Desktop.
Save dkochmanski/e575bc6e6d0386f537067b75726e3fb4 to your computer and use it in GitHub Desktop.

Polyclot 0.0.1 documentation

Overview

Polyclot is a tool to draw interactive charts in CLIM. Purpose of this document is to provide information on how to plot data and how to create new types of charts. Tool may be used by other CLIM applications as a library or as a standalone utility to render charts.

Developer guide

Coding conventions

  • class names are enclosed in brackets, i.e <data-frame>
  • all classes should be defined with an utility define-class
  • types which are not classes are enclosed in percentage characters, i.e %index%

Utilities

(define-class <name> superclass slots &rest options)

Defines a class <name>, its constructor function <name> (trampoline to make-instance) and a variable <name> which contains the class object. Accepts additional option :stealth-mixin which makes this class a superclass of the victim.

(define-class <record-positions> (<data-frame>)
  ((ink :initform clim:+red+))
  (:stealth-mixin clim:output-record-history)
  (:documentation "OUTPUT-RECORD position scatterplot."))

Possibility to mix into the existing class allows interpreting objects defined in unrelated libraries as i.e a dataframe.

Data frames

Data frame represents a set of data. Data is immutable, but it is possible to filter rows and columns with a function sel and to add rows and columns with functions add-rows! and add-cols!.

Column names and row names are immutable strings which must be unique across the data frame. Data should be accessed with mapping functions map-data-frame, map-data-frame-rows and map-data-frame-cols and with a function ref which allows selecting a single element.

Components

<aesthetic> - mapping of a dataframe

<stat> - statistical transformation

<geom> - geometric object (design)

<mods> - collision modifiers (positional adjustment)

<scale> - mapping from data to aesthetic attributes

<coord> - mapping of the object’s position to the plot’s area

Standalone utility

Embedding in a CLIM application

As a frame

As a pane

As an output record

Extending Polyclot

Reference manual

Data Frames

Types

<data-frame>

A protocol class. All class implementing this protocol must have it as its superclass.

<raw-data-frame>

<sel-data-frame>

%index%

Either an integer or a string. If it is an integer it must be a valid index of the row or the column starting from 0. If it is a string it must be an existing row or column name.

<invalid-slice> (error)

<invalid-index> (error)

<row-does-not-exist> (<invalid-index>)

<col-does-not-exist> (<invalid-index>)

<insert-error> (error)

<col-name-not-unique> (<insert-error>)

<row-name-not-unique> (<insert-error>)

<row-length-mismatch> (<insert-error>)

Accessors

dims <data-frame>

Returns a data frame dimensions as two values: a number of rows and a number of columns in a data frame.

cols <data-frame>

Returns a data frame column names. Result type is (vector string).

rows <data-frame>

Returns a data frame row names. Result type is (vector string).

ref <data-frame> row col

Selects a single element indexed by row and col. Row may be an opaque object taken from map-data-frame-rows - in that case we seek a column in it. If either row or column are not part of a data frame consequences are undefined.

When row is an index function returns five values: value, column name, column index, row name and row index. Otherwise it returns three values: value, column name and column index.

(ref data-frame "Audi" "Max Speed")
(ref data-frame 42     "Max Speed")
(ref data-frame "Fiat" 42)

Function signals an error <invalid-index> for invalid indexes.

sel <data-frame> rows cols

Returns a <data-frame> which contains a slice of the original <data-frame> defined by rows and cols. Slice specifier:

T
select all rows/cols
(cons index index)
select elements between indexes
(cons (eql t) index)
select all elements up to the index
(cons index (eql t))
select all elements starting from the index
(list s1 e1 s2 e2 …)
select union of slices (s1 . e1) (s2 . e2) …
(vector index)
select elements with specified indexes

Function returns a data frame which is a “window” to the original data frame. To have a flattened data frame use copy-data-frame on it.

Function signals an error <invalid-slice> for invalid slice specifiers.

(let ((data-frame-1 (sel df (cons 10 20) #("Price" "Max speed")))
      (data-frame-2 (sel df (cons "Fiat" t) t))
      (data-frame-3 (sel df (list 10 20 "Fiat" t) #("Price")))
      (data-frame-4 (sel df t #(1 4 8))))
  #|do something|#)

Modifying the original data frame by means of add-rows! and add-cols! will change content of the selected data frame if the corresponding row or column splice specifier is open-ended toward the end, that is T or (cons index (eql t)).

If add-rows! or add-cols! function is invoked on a data frame being a “window” then function is invoked on the original data frame.

Mapping

map-data-frame <data-frame> function

Maps function over the data frame. Function should accept six arguments: row name, row index, data row, col name, col index and value.

(map-data-frame df (lambda (rname rind row cname cind val)
                     (declare (ignore rname cname))
                     (format t "[~s,~s] ~a~%" rind cind val))
                   a-data-frame)

map-data-frame-rows <data-frame> function

Maps function over a data frame rows. Function should accept three arguments: row name, row index and data row (opaque object).

map-data-frame-cols data-row function

Maps function over the row columns. Function should accept three arguments: column name, column index and value.

Destructive operators

add-rows! <data-frame> &rest name-row-pairs

Adds a new data row. Function modifies the data frame and returns the modified object. To avoid modification of the original data frame invoke the function on its copy.

(setq df (add-rows! (copy-data-frame df)
                    "Honda" '(42 15 22 :xxx "low")
                    "Audi"  '(10 12 44 :yyy "high")))

add-cols! <data-frame> &rest name-fun-pairs

Data frames are based on rows. Adding a column is an operation achieved by specifying a function which accepts the row name, row index and row data. FUN should return the column value for a row. Function modifies the data frame and returns the modified object.

(setq df (add-cols! df
                    "AVG" (lambda (row-name row-index row)
                            (+ (ref df row-index "Max")
                               (ref df row-index "Min"))
                            2)
                    "TYP" (lambda (row-name row-index row)
                            (if (> (ref df row-index "Seats") 3)
                                :comfort
                                :ergonomy))))

Constructors

make-data-frame cols &rest rows

Creates a data frame. Cols is a sequence of column names and rows are conses where car is the row name and cdr is a sequence of column values. Length of values must be the same as length of column names sequence.

(make-data-frame '(       "col1" "col2" "col3")
                 '("row1" value1 value2 value3)
                 '("row2" value1 value2 value3))

It is a thin wrapper to create a <data-frame> (exact class is not specified but it implements all necessary protocols).

copy-data-frame <data-frame>

Creates a new data frame with copied data (allocates new rows to store names and data).

(let ((new-df (copy-data-frame df)))
  (setq new-df (add-rows! new-df "Foo" '(1 2)))
  ;; add-rows called on new-df doesn't modify df.
  (ref df "Bar" 0))

join-data-frame <data-frame> <data-frame> &rest args

This function is included for completeness but is left unspecified.

@sirherrbatka
Copy link

First of, I know that this is supposed to be protocol for data frames with row storage so If they are considered not applicable I understand that.

  1. Some may expect a quick and simple way not only to add a single column but also join column-wise and row-wise whole tables.
  2. For the efficiency sake, add-col perhaps should permit adding multiple columns in one go. This can make a difference for a very large frames.
  3. Ditto for add-row (?).
  4. row-name is mentioned in the map-* functions. There is a way to add row with name but i can't see way to obtain name of the row under the index. I recommend to reconsider row name feature all together because it seems that the same can be achieved by adding an extra column with name, if needed.
  5. Likewise, it would be useful to be able to translate column name to number, and number to column name.
  6. Possibility to ask for number of columns/rows in the frame appears to be missing.
  7. I urge you to add specification for signaled conditions in some of the functions as well.
  8. Why copy-data-frame exist? Is there a use case for it? It appears to me that whole protocol assumes that frames are in fact immutable and therefore there shouldn't be need for copying.

@dkochmanski
Copy link
Author

dkochmanski commented Nov 5, 2019

  • 1. join-frames I'll think about it, thanks frame join falls outside the scope of the protocol for now
  • 2. 3. add-rows/cols done
  • 4. 5. row/col names are useful for selecting data without looking into actual data with map, I've modified mapping function continuation signatures to pass both name and index and ref to return five values (value, names and indexes). I will also consider a function which returns rows and cols (in fact prototype implementation has such functions)
  • 6. given we have functions rows and cols (and that they are vectors), it will be (length (cols dataframe)) etc
  • 7. I will - i.e invalid index will be a common error which should be signalled
  • 8. for efficiency reason add-rows and add-cols may modify original data-frame (and if it is a data-frame which defines a slice of another data frame they both may be modified -see updated doc), copy-data-frame exists to keep the old data-frame. Instead of "always copying" we save some time. Also copy-data-frame called on a slice will make a "raw" data frame based on it.

@dkochmanski
Copy link
Author

dkochmanski commented Nov 5, 2019

notice, that slice may be implemented by specializing mapping and reference functions without allocating a separate data frame with copied data.

@sirherrbatka
Copy link

Yeah, i am aware of this interface passing style. The only thing that worries me at this point is mixing of destructive and pure operations. Maybe you can consider adding add-rows! and add-cols! as a complementary destructive operations while add-rows and add-cols would not be able to mutate?

PS
Length is sufficient if you assume that data frame is actually always a vector. However, protocol here can be extended to any other data structure used for representation. In fact I think that the only thing missing is the ability to get frame dimensions. It would be a shame to leave it this way.

@skempf
Copy link

skempf commented Nov 6, 2019

I'm not as familiar with row based data frames, although it looks like there are a lot of similarities with the type of data that I work with. Given that caveat, here are some thoughts:

  1. Regarding add-cols!, perhaps one might add a column that is not a function of the other columns. Is this what the join-data-frame is meant for? (also in your example you are missing the (/ ) for the "AVG" column).

  2. For both the add-X!, would it be better to require an incoming list rather than using &rest? Is this just a better style?

  3. Most of the data that I work with is tabular, but the rows are not named, except implicitly by the first column, or just the row index itself. It would be nice to have a constructor that worked with that type of data. Perhaps something simple like: (make-data-frame-from-array cols 2darray). I would think this would be a frequent use-case for non-statisticians.

  4. In your specification of map-data-frame, are you meaning to be the return type? So the spec would be (map-data-frame <data-frame -- as in type> function <data-frame -- as in the input data>)?

  5. Regarding the map-data-frame-rows, normally you would have a column selector and map over rows. Same comment for map-data-frame-cols, except a row selector.

@dkochmanski
Copy link
Author

(defgeneric map-data-frame (<data-frame> %row-slice% %col-slice% function)
  (:method ((df <raw-data-frame>) row-slice col-slice fun)
    (map-data-frame-rows
     df row-slice
     (lambda (row-index row)
       (map-data-frame-cols
        df row col-slice
        (lambda (col-index col-name value)
          (funcall fun row-index row col-index col-name value)))))))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment