Skip to content

Instantly share code, notes, and snippets.

@wangyuchen
Last active August 29, 2015 13:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wangyuchen/9580067 to your computer and use it in GitHub Desktop.
Save wangyuchen/9580067 to your computer and use it in GitHub Desktop.
GSoC 2014 proposal

Project title: Turning R objects into Pandoc's markdown1

Project short title (30 characters): pander

Bio of Student

I'm currently a first year graduate student at University of Missouri, Columbia. My undergraduate major is statistics and I minored in computer science for two and a half years. After I learned the basic theories of mathematical statistics and had some preliminary programming experience, R became a nature choice of my language for statistical computing. R's ability to quickly implement statistical models and integrate other languages is a major advantage over other statistical softwares.

Luckily, I have a chance to become a research assistant in my senior year. I was assigned to do a project on statistical process control. It's a small area but require lots knowledge of other topics of statistics. It brought real data to me. Since then, I used R on a daily basis for the past one year and six month during the project. jtrans and other packages were developed by me. jtrans is on CRAN and most of other public packages are on my github.

My primary interest in R lies in its application with modern web technology. During the past two years, I've used knitr to combine my reports in Markdown to R code and outputs; I've experienced GoogleVis and rechart (an API of baudu's echart) to produce interactive plots based on Javascirpt. Due to R's nature as a open-source script language, it's easy to combine R with other script languages and visualization systems, and I think it's definitely a big trend.

With my education background in statistics and computer science, and extensive programming experience with real data in R, I think I should contribute my summer time to R's development to thank the community for providing us such a great tool.

CONTACT INFORMATION

Student name: Yuchen Wang

Link_id: teary0712

Student postal address: 2500 S Old Hwy 63 Unit 823, Columbia, Missouri, U.S., 65201

Telephone(s): (573) 239-5737

Email(s): ycwang0712@gmail.com or yuchen.wang@mizzou.edu

Other communications channels: Skype/Google+, etc. : Google talk: ycwang0712@gmail.com

Student affiliation

Institution: University of Missouri, Columbia

Program: Master of Arts in Statistics

Stage of completion: First year

Contact to verify: Kathleen Maurer umcasstat@missouri.edu

Schedule Conflicts:

I've already applied for several internships during the summer before I decide to apply for the Google Summer of Code 2014. However, I haven't heard from any of them yet. Of course I won't apply to any other internships till the result of my GSoC application come out.

Some personal matters may occur during the summer, like a family vacation or some small tasks for research projects. But all of them won't take longer than one week. With that being said, I ensure that I can averagely work 40 hours per week during the GSoC 2014.

MENTORS

Mentor names: Gergely Daróczi and László Szakács

Mentor emails: daroczig@rapporter.net and cocinerox@gmail.com

Mentor link_ids: Don't know yet, needs update.

Have you been in touch with the mentors? When and how?

Yes, I have. After viewing idea pages and picked up the most attracting project (this one), I start to work on the test problem right away. I spent about four hours on the output of the test problem to make it as similar to the given answer as possible.

I finished the test problem at about 4 AM on March 12, and I made a pull request to mentor's origin on Github right after that to let the mentor see it as soon as possible. Next morning when I woke up I received his reply telling me the code is fine and I need to start writing this proposal.

CODING PLAN & METHODS

Describe perceived obstacles

  1. understanding the internal of pander.pander
  2. Reproduce some features in other packages(like xtable, stargazer)
  3. refactor current codebase, especially the forked brew function
  4. improve error handling and logging facilities

The above four are the greatest obstacles I feel now. I'd like to make there four the quarter milestone for my plan. For the first two understanding the whole package and adding new class supports is like a expansion: make more functionalities. But for the other two, it's like a self improvement. Like reorganize code to make it better.

The class support and global option implementation will be the main focus before the midterm. After that, focus will turned to internal error handling and logging.

Personally I have an idea of render R function/expression into LaTeX equations. It'll be easy for report writers to directly transfer his equation in LaTeX to an R function or vice versa. Maybe I'll incoporating that if we have time.

TIMELINE

April 22 - May 4

  • Read introductory documentation and get familiar with the source code of the project and the coding style. Compare pander with other tools, such as xtable and knitr. Identify the advantages and disadvantages of each package.
  • Get to know and discuss with mentors. Get feedback to help me evaluate the time needed for each step in my work.

May 5 - May 18

  • Exchange ideas with other users about how to improve pander package.
  • Add simple features to the package. Fix simple bugs and do some early-code-hacking.

May 19 - June 1 (Coding - Phase 1)

  • Try some new specific methods for not yet supported R classes. Break out some limitations of markdown by supporting row and column-spanning.
  • Start digging the markdown version of stargazer

June 2 - June 22 (Coding - Phase 2)

  • Implement new global options for tables and plots, for example make pandoc.table to support configurable column width.
  • prepare for midterm submission.

June 23: Submit mid-term evaluations

June 23- July 6 (Coding - Phase 3)

  • As the project moved into next phase, focus will be on improving error handling, documentation and so on. As the project moves on, I am sure I will have experience and deep understanding of pander internals at that time. We'll set up specific goals for how to improve these in the next month.

July 7 - July 20 (Coding - Phase 4)

  • Refactor current codebase, especially the forked brew function

July 21 - August 3 (Coding - Phase 5)

  • improve error handling and logging facilities

August 4 – August 18 (pencils down)

  • documentation, tutorial and project website.
  • Finish the whole task before 'pencils down' date. I plan to spend two weeks to scrub code, write tests, improve documentation and perfect other aspects of the project.

August 22

  • Submit required code samples to Google after ensuring all the tests are passed.

MANAGEMENT OF CODING PROJECT

How do you propose to ensure code is submitted / tested?

I will use Github extensively during the project. Github has already provided great tools for managing code and discussing with mentors as well as users. Issues and milestones would be set up so that I can track my progress by solving tasks one by on. Mentors are also github users, so they can comment or view my code whenever they want. Furthermore, project pages can be a great place for documentations and tutorials (due to the nature of pander, it'll be great to provide some online tutorials). Nowadays, many popular packages (knitr/ggplot2/rcharts) use github pages to build project site for documents and discussions, I think my experience with Github pages and Jekyll will be helpful.

How often do you plan to commit? What changes in commit behavior would indicate a problem?

I plan to commit at least once a day. I think related changes should be arranged within one commits, so that the purpose of a commit can be clear and the description will not be messy. The number of commits may not be a good indicator of work strength. If I couldn't commit once in two or three days when there is a specific functionality to add, then there is a problem. In that situation, I will discuss the problem with mentors directly as soon as possible .

TEST

Describe the qualification test that you have submitted to you project mentors. If feasible, include code, details, output, and example of similar coding problems that you have solved.

What's tested

The test problem ask to realize a S3 method for CrossTable object in the Github package pander. In my opinion, this is testing mainly the following abilities:

  1. The ability to use git and Github: how to fork an existing source package and how to commit changes and pull back to the origin so the origin developer can see you modifications.
  2. The ability to develop a package: how to build package from source; how to use roxygen2 to generate documentations and how to control the namespace.
  3. Testing for object-oriented programming in R: how to add a new method to a existing generic function and how to deal with object.

Example code to show the above abilities

I'm quite familiar with Github, so it's very common for me to fork the package and start the test. The final pull request is at Rapporter/pander#63.

As you can see in the pull request, I've committed twice to separate modifications of different purposes on the package. The first commit is for personal settings. Since I used RStudio as development environment, there are some configuration files that should be ignored in the repository. In the next commit, I added a S3 method 'pander.CrossTable()' for 'CrossTable' object to the generic function 'pander()'. The function was added in the file 'S3.R', where all the S3 methods for 'pander' reside.

A roxygen2 comment is added, so the namespace file will be automatically modified to output the new S3 method.

#' @S3method pander CrossTable

As for the function itself, it receives a CrossTable as input and print out a Markdown table. After inspecting the package, I found other methods for pander() who have a table output usually use the pandoc.table() function to output table objects, so you don't need to worry about how to print the header or footer. Therefore the basic structure of pander.CrossTable() should be like this:

pander.CrossTable <- function(x, ...) {
  # convert CrossTable x to table res
  pandoc.table(res, ...)
} 

The ... parameter was used to pass parameters to pandoc.table(). Then I just need to inspect the CrossTable object, convert it to a table object and pass to pandoc.table(). The CrossTable object turns out to be a list, so I need to collect and re-organize different components of the list. In fact, I considered to utilize the print method for CrossTable provided with the object in the descr package at first. However, the print.CrossTable() is actually doing more things than pander.CrossTable() was supposed to do. Using the print.CrossTable() will require the package depend on the descr package. I think that's not necessary.

The major problem for converting CrossTable object lies in undetermined levels of the factor variable: the number of sub-tables of a CrossTable is a variable. so at first, I only implemented a function to deal with one single sub-table. The following table returns summary sub-table for one level of the CrossTable.

  sum_table <- function(t) {
	level <- row.names(t)[1]
	
	t[1, ] <- round(t[1, ], digits=digits)
	t[-1, ] <- sapply(t[-1, ], round_to_percent)
	rsum <- c(round(x$rs[level]), rep(NA, 2), 
			  round_to_percent(x$rs[level]/x$total.n))
	
	# add row sum to last col
	t <- cbind(t, Total = rsum)
	
	# add blank line to last row
	t <- rbind(NA, t, NA)
	
	# add row names
	t <- cbind(a=c(level, 'N', 'Row (%)', 'Column (%)', ' ', ' '), t) 
	colnames(t)[1] <- ' '
	

	return(t)
  }

The round_to_percent() function round a decimal value to percentage. This will return one sub-table of the CrossTable, see below.

> s[[1]]
             3         4 5
0   15.0000000 4.0000000 0
0.1  0.7894737 0.2105263 0
0.2  1.0000000 0.3333333 0
0.3  0.4687500 0.1250000 0

> sum_table(s[[1]])
                  3    4    5 Total
1            0 <NA> <NA> <NA>  <NA>
0            N   15    4    0    19
0.1    Row (%)  79%  21%   0%  <NA>
0.2 Column (%) 100%  33%   0%  <NA>
0.3             47%  12%   0%   59%
6              <NA> <NA> <NA>  <NA>

Then, after split the original CrossTable and apply this function, a total summary table will be available. I've also made small modifications like adding a digits parameter, but then the mentor told me I can utilize the digits option in the package directly. I believe after spending some time with pander, I will be familiar with using this kind of options already provided by the package.

Footnotes

  1. this proposal is available at this gist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment