Skip to content

Instantly share code, notes, and snippets.

@AjayRamanathan
Last active February 16, 2017 15:48
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save AjayRamanathan/c84a4641836700a2547b to your computer and use it in GitHub Desktop.
Grammar of Graphics Gsoc Proposal

#Layered Grammar of Graphics

###Short description:

The project aim is to implement layered Grammar of Graphics, in Haskell using Diagrams as backend. The core ideas is to start with the raw data and think about all the transformations, statistics, etc. that go into graphing it. With a good framework, this can help us see connections between different graphs and create new ones. You’ll realize that a pie chart is basically just a stacked bar chart plotted in polar coordinates, with bar height mapped to pie-slice angle… and that can get you thinking.

The GoG project has had several purposes. One, of course, is to develop statistical graphics systems that are exceptionally powerful and flexible. Another is to understand the steps we all use when we generate charts and graphs. This understanding leads to a formalization of the problem that helps to integrate the miscellaneous set of techniques that have comprised the field of statistical graphics over several centuries.

#Project

###Idea

The Grammar of Graphics, or GoG, denotes a system with seven orthogonal components. By orthogonal, I mean there are seven graphical component sets whose elements are aspects of the general system and that every combination of aspects in the product of all these sets is meaningful. A consequence of the orthogonallity of such a graphic system is a high degree of expressiveness. That is, it comprises a system that can produce a huge variety of graphical forms (chart types).

This dataflow is a chain that describes the sequence of mappings needed to produce a statistical graphic from a set of data. The first component Variables maps data to an object called a dataset (a set of variables). The next three components Algebra, Scales, Statistics are transformations on datasets. The next component Geometry maps a dataset to a graph and the next Coordinates embeds a graph in a coordinate space. The last component Aesthetics maps a graph to a visible or perceivable display called a graphic.

Data Flow

###Contents

  • Data Layer
  • Co-ordinate Layer
  • Theme Layer
  • Faceting Layer
  • Geom Layer
    • Statistics
    • Geoms
    • AesScales
    • Legends

###Data Layer

The Data Layer will consist of the basic dataset which will be operated on, each attributes in the data layer can be combined using algebric functions cross (*), nest (/) and blend (+). The domain of an attribute in the Data Layer would be calculated which would be later used for aesthetic scales.

dataset = [(1,5,4,2,"A"),
(9,0,4,2,"A"),
(7,2,3,2,"B"),
(6,1,5,2,"B"),
...
...
(0,6,4,0,"C")
attr1 = MkAttribute ((\ (i,_,_,_,_) -> i), "Name Attribute one")
attr2 = MkAttribute ((\ (_,i,_,_,_) -> i), "Name Attribute two")
attr3 = MkAttribute ((\ (_,_,i,_,_) -> i), "Name Attribute three")

newtype Attribute' s a b = MkAttribute ((a -> b), s)
type Attribute a b = Attribute' String a b 

or

Type DataString = (String [String])
Type DataInt = (String [Double])

dataset = (DataString, DataInt, ...)

###Co-ordinate Layer

The most popular types of charts employ Cartesian coordinates. The same real tuples in the graphs underlying these graphics can be embedded in many other coordinate systems, however. There are many reasons for displaying graphics in different coordinate systems. One reason is to simplify. For example, coordinate transformations can change some curvilinear graphics to linear. Another reason is to reshape graphics so that important variation or covariation is more salient or accurately perceived. Each Co-ordinate system will have its own mapping values (For Cartesian: x and y, Polar: Theta and Radius). Similarly it would have a Scale (xlim, ylim, xtrans and ytrans) which can be used to produce log/semi-log scales scales. Co-ordinate Layer will also contain a boolean,Flip, Fractional, Ration which would decide the output size of the graph and String, Title

co-ord

###Faceting Layer

When dealing with multivariate data, we often want to display plots for specific subsets of data, laid out in a panel. These plots are often referred to as small-multiple plots. They are very useful in practice since you only need to take your user through one of the plots in the panel, and leave them to interpret the others in terms of that. All facet of the same graph will have the same Geom Layers.

The input for Faceting would a list of tuple pairs, 1st denoting x and 2nd denoting y axis

Facet [(None,A)] is same as facet(~.A); Facet [(B,None)] is same as facet(B.~); and Facet [(B,A)] is same as facet_grid(B,A) from ggplo2

###Theme Layer The theme layer acts as the aesthetics for the graph and the co-ordinate system, theme layer would include values for colour, line width, ticklength etc.

    data Theme v n = Theme
    { _Colours :: [Colour Double]
    , _LineStyle :: Colour Double -> Style v n
    , _MarkerStyle :: Colour Double -> Style v n
    , _FillStyle :: Colour Double -> Style v n
    , _markerPaths :: [Path v n]
    }

###GeomLayer

####Statistics A Geom Layer would consists of alist of tuple [([Statistics],Geom)]. The non-aesthetics values in Geom Layer are scaled according to the co-ordinate scale values, and then statistical transform is applied on it which is later passed to Geom.

  -  bin : Divide continous range into bins, count number of points in each.
  -  contour : Caculate contour lines
  -  density : Compute 1d Density
  -  jitter : Jitter values by adding random small values
  -  identity : Identity function; F(x) = x
  -  smooth : Smoothed conditional mean of y with respect to x
  -  summary : Aggregate values of y given x
  -  others : like unique, boxplot, qq etc.

####Geom Geom are the visible layer of the plot, using a combination of Geom we can generate a wide variety of graphs, each geom has some ploting variables and some aesthetic variables, each aesthetic variable is mapped to a respective AesScale. e.g. GeomRibbion which accepts x, y_min and y_max becomes Area plot when y_min is set as zero.

    data GeomPoint dataset b a = GeomPoint
    { _geomPointX :: Maybe (dataset),
    _geomPointY :: Maybe (dataset),
    --Aesthetics
    _geomPointSize :: Maybe (dataset),
    _geomPointColor :: Maybe (dataset b a),
    _geomPointAlpha :: Maybe (dataset b a),
    }
    data GeomHLine dataset b a = GeomHLine
    { _geomHLineY :: Maybe (dataset),
    _geomHLineCoordinate :: Double
    }
    data GeomVLine dataset b a = GeomVLine
    { _geomVLineX :: Maybe (dataset),
    _geomVLineCoordinate :: Double
    }

geom

####AesScales Aesthetic scales are scales for aesthetic layer, if a value of aesthetic layer is linked to a attribute from the dataset, it applies a scale coresponding to the values in the domain of the attribute, some Aesthetic scales are only meant for descrete domains others like color and size/width can be used for both continous and discontinous domains. Predefined AesScales like scalegrayscale can be very usefull while make plots. AesScales will take the data and produce type Diagram B R2 which can be rendered on to the screen.

    Shape :: Char -> Path R2
    Shape s | s == 'o' = circle 0.07
    | s == 'a' = eqTriangle 0.1
    | s == 'b' = square 0.1
    | s == 'c' = plus 0.07
    | s == 'd' = star (StarSkip 2) (pentagon 0.1)
    | s == 'e' = cross 0.07
    | otherwise = circle 0.07

    cross :: Double -> Path R2
    cross x = fromVertices [ x^&(-x) , (-x)^&x ]
    <> fromVertices [ x^&x , (-x)^&(-x) ]

    plus :: Double -> Path R2
    plus x = cross x # rotate (45 @@ deg)

aesscale

####Legend If a aesthetic value is mapped to more than one value (to a domain) it would produce a legend. Legend directly is linked with scale which is set. Other than this legends would also have user-set values like the position of the legend, spacing, text font/color and orientation.

    type Dia = Diagram B R2
    
    discreteColorLegend :: Show b => DScale b (Colour Double) -> [String] -> Dia
    sizeScaleLegend :: Dia -> IntervalScale Double Double -> [String] -> Dia
    continousColorLegend :: Show b => DScale b (Colour Double) -> [String] -> Dia
    Legend
    { _legendPosition = Top
    , _legendSpacing = Int
    , _legendTextWidth = Int
    ...
    , _legendTextStyle = mempty & _fontSizeR .~ fmap Recommend (output 12)
    , _legendOrientation = Verticle
    }

#Schedule

###April 27 - May 11: I don't know everybody in the community yet, neither did I get enough time to know the developers in the community. This will be a great time to get to know everybody and my fellow students. My summer vacation will start from 29th of April. This is the familiarization period where I will read docs regarding ggplot2, Diagrams, Grammar of Graphics, read about the theoretical stuff. This period will also involve looking into the existing source codes of the various GoG implementation which are already implemented in other languages and understanding them. Brainstorming; Begin working on the Idea.

###May 12 - May 25: During this period I would write code snippets, pseudo-codes of the algorithm, which would be later useful once I start coding. I will also make a rough sketch of the software architecture, decide details such as type, class, functions etc. I will discuss with the Mentor on changes in the original idea, editing and finalize the Idea.

###May 26 - June 9: Coding starts. My work would be a extension of the Plotting Library by cchalmers. During this period I will start working on dataSet data DataFrame = DataDouble String [Double] (Double,Double) | DataString String [String] [String], the basic variable to be used. Implement cross(*), nest(/) and blend (+), and function which are necessary to manipulate the raw dataSet. I will also work on the co-ordinate system (Cartesian and Polar) and their scales (Linear, Semi-Log, Log). At the end of this week it would be possible to generate simple scatter plots on both Cartesian and polar co-ordinate system.

###June 10 - June 24: Work on basic (non geomlayer specific) statistical functions like statsbin, statsjitter, statsidentity, statssmooth which would later be very helpful implementing the Geometric Layer. I will start building the Geometric Layer, starting with ABLine, VLine, HLine, Bar, Ribbion/Area, Point and the respective Aesthetics, I will also spend lot of time documenting, adding examples and adding Test as the basic Grammar of Graphics framework would be done. At the end of this week it would be possible to produce a Bar Graph, Area Graph, Line Plot etc. As Geom Objects act more like lego blocks you can add multiple layers to form complex plots.

###June 24 - July 2: This period would allow me to add more geomlayer, as the number of geomlayers increases, combining them will lead much more versatile graph plotting. During this period work on group aesthetics for geomline and geomribbion will be done. I will also do some work on Legends, options such as legend placement, add more scales to map aesthetic layers.

###July 3: Midterm Evaluation.

###July 4 - July 18: Most of this time would be spent writing few more geometric layers like, Heatmaps/Tiles, also create new datasets and write few new example. , and also add support for aesthetic properties of position i.e. stack, fill and dodge.

###July 19 - August 2: During this period I will work on Faceting Layer. At the end of this week it would be possible to generate faceted scatter plots in both co-ordinate system. I would also write the necessary API for ease of usage.

###August 3 - August 12: During this period I plan on working a system for theme, which can be used to manipulate bgcolor, ticklength and other properties of theme with ease. Write test and documentation. At the end of the period it would be possible to make Contour maps, Boxplot. This would mark the completion of the project. I will spend a huge time polishing code and optimizing it.

###August 13 - August 21: Buffer Period

  • Fixing issues/bugs
  • Detailed documentation of the project
  • Add high level API for ease of use

###August 21 - August 28: Pencil down.

  • Spend time cleaning/polishing the code
  • Writing documentation and
  • Adding examples

###August 28: Submit Code.

###Post Gsoc:

  • 3D Co-ordinate System
  • Spherical Co-ordinate System
  • Support for Maps and Networks
  • Add more Geometry Layer

#Personal:

##University and Department

Hi, I'm Ajay Ramanathan a resident of India(GMT +5:30) and a student of Ocean engg. & Naval arch at Indian Institute of Technology, Kharagpur, West Bengal. This my first time working with functional programming language, I usually prefer coding in python. I’m quite fascinated with Big Data and Neural Network, I have been working primarily in the AI, Natural language processing, and Machine Learning for the past few months. I love working with UI/UX Design , Illustrations as a started out as a designer. I also have in-depth knowledge in c and cpp; especially with Gtk and other UI based Libraries.

I can work on this project with dedication since I currently have no other project/work requiring my immediate attention (for at least four months). I can work a net of 5-8 hours on a daily basis averaging to 30 hours on the weekdays (Monday through Friday). Also, on weekends my work time can increase up to 10 hours/day giving a total of at least 50+ hours a week. I usually work late at night till early morning.

My personal aim during the project is to develop professional functional program coding skills to complement my design skills, and to have fun. Grammar of Graphics book by Leland Wilkinson, gave me a completely different perspective about viewing charts and graphs and, I believe it would be a really fun task implementing it in a Haskell. I am new to Haskell, I believe what I lack in Haskell coding skills, I can easily make up with my knowledge of design principles, enthusiasm and little hard work. I do have some veteran Haskell friends to help me out if I were to find myself in a tight spot.

A well-documented application in Haskell, and using Diagrams which can be used to efficiently and accurately model any graph given a varied DataSet. I will maintain a Public Github repository, where the Mentor would be able to track my progress. I plan on submitting detailed report of the progress each fortnight with some level of documentation thus allowing discussion of important details and necessary changes regarding the code. I use the nickname "chinu" everywhere and can easily be seen around on IRC.

##Previous Work

###Gsoc 2013

Implementation of the combined selection tool under GIMP : I implemented a unified selection tool that combines the functionality of the rectangle/ellipse selection tool, fuzzy select, select by color, foreground selection tool, free selection tool and intelligent scissors into three new tools, and is easier and faster to use.

Others

  • Faceting System
  • Thesaurus parser
  • Small game implementation in pygame (python) and phaser.js (javascript).
  • Extensive work with image analysis and manipulation using openCV (cpp) and simpleCV (python).
  • Currently I am working on machine learning and neural network using nupic (pyhton) and scikit-learn(python).

##Contact:

##Important Links:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment