ijlyttle/01-proposed_changes.md

## 01-proposed_changes.md

      
    Raw
  

              01-proposed_changes.md
            
          
    Motivation

I see our project as a set of two transformations: one that changes a ggplot2 object into a ggspec, then another that changes a ggspec into a Vega-Lite spec.
The analogy may not be exact, but I see these transformations in terms of linear algebra, where a ggplot2 object is a vector in "ggplot" space, a ggspec is a vector in "ggspec" space, and a Vega-Lite spec is a vector in "Vega-Lite" space.
One of our goals is that "ggspec"-space should be a faithful representation of "ggplot2"-space. One way of doing this is to make sure that the "transformation-matrix" is as close to diagonal as we can make it. As Haley likes to point out, the ggplot2 object is a "list of 9"; therefore the "ggspec" object will have no more than 9 elements (maybe it will not have a "theme"?).
If we wanted to (and we don't want to), we could reproduce the ggplot2 object using the ggspec.
Elements of the spec

I have included below a proposal for a modification to ggspec - let me go through each of the elements.
Data

Our idea is that data will be "promoted" to the top of the spec from within the layers, have duplicates removed, and be named. The change here is that the metadata is now defined.
I propose metadata to be an object, where the names are the names in the particular dataset, and the values are objects themselves. This object would have a mandatory field, and two optional fields:

type: string, the Vega-Lite type, where we would map from R

numeric, integer: "quantitative"
character: "nominal"
factor: "nominal"
ordered: "ordinal"
POSIXct, Date: "temporal"


levels: array of strings with the levels of the factor or ordered
timezone: string, timezone of POSIXct

Here is my understanding of how ggplot works with color-scales and factors: for "regular" factors, it uses a "nominal" scale; for ordered factors (has class "ordered"), it uses an ordinal scale. I think we should follow this.
Part of our goal here is to make sure that the Vega-Lite spec that we produce will be generalizable to new data. As such, I think that if the data arrives at ggplot as a factor, that means that the levels are the only possible values the varaible can take, e.g. day-of-the-week.
Another use of factors is within ggplot, perhaps to order a variable according the value of another variable. For example, consider a bar chart where we want to order cities by their population. In ggplot, we would use an "internal" factor, where the levels are determined when the plot is built. Here, the cities could change, and the population (hence ordering) could change.
In this situation, we could do a similar thing, by specifying city as a nominal variable. It remains to figure out how to "decode" something like forcats::fct_reorder() and how to denote that in the ggspec, but that can be a problem for later.
It remains to be determined how to deal with POSIXct and Date. I have some ideas, but they are not yet completely formed. The challenge is that R has a notion of timezones, while Vega-Lite (like native JavaScript) does not.
The observations would be the usual d3-format array of objects for the data-frame.
The ggspec data object would contain all the datasets from the ggplot object in one place. It would reserve data-00 for the dataset in the ggplot2 data element, then data-01, data-02, ..., for data-frames specified in the layers.
In summary, the ggspec data element would be a function of the ggplot2 data element and the ggplot2 layers element.
Mapping

At present, we are not concerned with the ggplot2 mapping object, as our thought is to support, initially, mappings that are defined in the layers.
Layers

Here, layers is an array of layer objects, I think each ggspec layer object will be a function of the ggplot layer, and the data. It is a function of data only to be able to include the name of the dataset. With apologies to Wenyu, this is a significant change from the previous proposal: we propose not to include the type here; instead, it would be provided in the metadata in the ggspec data object.
If it will be OK with Wenyu, he could determine the type of the particular Vega-Lite encoding using the type from the metadata, to be overridden by the ggspec scales if need be. However, I think it would be OK, initially, just to use the value from metadata.
Maybe there will not be a need to rename the type according to the scale. Consider this example (paste into the editor):
{
  "$schema": "https://vega.github.io/schema/vega-lite/v3.json",
  "data": {"url": "data/cars.json"},
  "mark": "point",
  "encoding": {
    "x": {"field": "Horsepower", "type": "quantitative"},
    "y": {"field": "Miles_per_Gallon", "type": "quantitative"},
    "color": {
      "field": "Cylinders", 
      "type": "ordinal",
      "scale": {"range": "category"}
    }
  }
}
Scales

Here, the scales is an array of scale objects; each ggspec scale would be a function of the ggplot2 scale. We introduce the name field to this proposal.
Labels

The ggspec labels is a function only of the ggplot2 labels. For Wenyu, if the scale for an aesthetic/encoding is named, we can use that name; otherwise we can look for it in labels.

  
## 02-ggplot.R
library("ggplot2")

ggplot(iris) +
  geom_point(aes(x = Petal.Width, y = Petal.Length), color = "red") +
  scale_y_log10("petal length") +
  labs(title = "Hello World")

## 03-proposed_spec.json
{
  "data": {
    "data-00": {
      "metadata": {
        "Sepal.Length": {"type": "quantitative"},
        "Sepal.Width": {"type": "quantitative"},
        "Petal.Length": {"type": "quantitative"},
        "Petal.Width": {"type": "quantitative"},
        "Species": {
          "type": "nominal",
          "levels": ["setosa", "versicolor", "virginica"]
        }
      },
      "observations": [
        {
          "Sepal.Length": 5.1,
          "Sepal.Width": 3.5,
          "Petal.Length": 1.4,
          "Petal.Width": 0.2,
          "Species": "setosa"
        }
      ]
    }
  },
  "layers": [
    {
      "data": "data-00",
      "geom": {"class": "GeomPoint"},
      "mapping": {
        "x": {"field": "Petal.Width"},
        "y": {"field": "Petal.Length"}
      },
      "aes_params": {
        "colour": {"value": "red"}
      }
    }
  ],
  "scales": [
    {
      "name": "petal length",
      "class": "ScaleContinuousPosition",
      "aesthetics": ["y", "ymin", "ymax", "yend", "yintercept", "ymin_final", "ymax_final", "lower", "middle", "upper"],
      "transform": {"type": "log", "base": 10}
    }
  ],
  "labels": {
    "title": "Hello World",
    "x": "Petal.Width",
    "y": "Petal.Length"
  }
}
	library("ggplot2")

	ggplot(iris) +
	geom_point(aes(x = Petal.Width, y = Petal.Length), color = "red") +
	scale_y_log10("petal length") +
	labs(title = "Hello World")
	{
	"data": {
	"data-00": {
	"metadata": {
	"Sepal.Length": {"type": "quantitative"},
	"Sepal.Width": {"type": "quantitative"},
	"Petal.Length": {"type": "quantitative"},
	"Petal.Width": {"type": "quantitative"},
	"Species": {
	"type": "nominal",
	"levels": ["setosa", "versicolor", "virginica"]
	}
	},
	"observations": [
	{
	"Sepal.Length": 5.1,
	"Sepal.Width": 3.5,
	"Petal.Length": 1.4,
	"Petal.Width": 0.2,
	"Species": "setosa"
	}
	]
	}
	},
	"layers": [
	{
	"data": "data-00",
	"geom": {"class": "GeomPoint"},
	"mapping": {
	"x": {"field": "Petal.Width"},
	"y": {"field": "Petal.Length"}
	},
	"aes_params": {
	"colour": {"value": "red"}
	}
	}
	],
	"scales": [
	{
	"name": "petal length",
	"class": "ScaleContinuousPosition",
	"aesthetics": ["y", "ymin", "ymax", "yend", "yintercept", "ymin_final", "ymax_final", "lower", "middle", "upper"],
	"transform": {"type": "log", "base": 10}
	}
	],
	"labels": {
	"title": "Hello World",
	"x": "Petal.Width",
	"y": "Petal.Length"
	}
	}