Skip to content

Instantly share code, notes, and snippets.

@chrismedrela
Last active March 15, 2016 00:14
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save chrismedrela/9348472 to your computer and use it in GitHub Desktop.
Save chrismedrela/9348472 to your computer and use it in GitHub Desktop.
"Improving numerical routines in Scala Breeze" GSoC 2014 proposal.

"Improving numerical routines in Scala Breeze" GSoC 2014 proposal.

Abstract

Breeze is a great numerical processing library. However, it lacks some high-level functions that you can find in other libraries like SciPy. The second issue is that Breeze lacks documentation. This makes the entry barrier higher for new contributors.

My proposal is to revamp documentation and to introduce interpolation and integration facilities.

My main principle will be to lower entry barrier as much as possible and to get people excited about Breeze so that it will gain a lot of new contributors. In my opinion, exceeding the critical mass is the most important thing at this moment.

Improving documentation

At the beginning, I'm going to improve the documentation of existing Breeze modules and then revamp the tutorial. This is a chance for me to get into internals of Breeze. I'm going to spend at this step about 3-4 weeks.

I believe that good documentation consists of three components:

  • Step-by-step tutorial shows the key concepts (like vectors and matrixes); shows how Breeze "feels"; should be quick and easy.
  • Topical guides are for those, who have read the tutorial. In Breeze there is a natural one-to-one relation between modules and topics, so this is equivalent to high-level module documentation.
  • Low-level deep-dive reference included at the end of topical guides.

If time permits, I will investigate if it's possible to treat code snippets in documentation as doctests. Introducing doctests means that we won't worry any more about out-of-date snippets.

The main breeze page

The main page, that is, the first page users can see, should be as short as possible -- everything, that can be moved to other pages (e.g. installation and contributors), will be moved out. People don't have time to read negligible details, first they need to know if Breeze is what they are looking for and to get excited about the project.

The first paragraph should contain the most important information like:

  • what Breeze is;
  • why you should care about it;
  • and why it's worth the effort to learn it;
  • what license Breeze uses;
  • what the latest release is;
  • what are the other components (breeze-viz, breeze-learn, breeze-process).

The second paragraph will be a bunch of links. Everything should be simply accessible, the best are two steps -- go to home page and find an appropriate link. There wouldn't be too much links (installation, contribution; link to tutorial and full documentation; bug tracker and source code -- both link to github).

The next paragraph should show the power of Breeze and get people excited. So a superb simple way to play with Breeze is a must:

$ sbt
set libraryDependencies ++= Seq("org.scalanlp" % "breeze_2.10" % "0.7-SNAPSHOT")
set libraryDependencies ++= Seq("org.scalanlp" % "breeze-natives_2.10" % "0.7-SNAPSHOT")
set resolvers ++= Seq("Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/")
set resolvers ++= Seq("Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/")
set scalaVersion := "2.10.3"
console

And then, we will show the most amazing thing you can do in Breeze in little code. It will consists mostly of code examples, no deep description. The first impression is very important. At the end, there will be a link to the tutorial.

Switching to Github Pages and Jekyll

The current documentation consists of wiki pages on github. This causes two problems. First of all, they are separated repositores. There are two different workflows to work at documentation and code. There is no association between documentation and code and you don't know which code version is a doc version about.

The second problem is that people would get more excited if breeze had it's own webpage instead of using github wiki pages. There is an initial movement for breeze and epic (see scalanlp webpage). My goal would be to enhance it and to move all documentation to this site.

Github Pages is a good choice because it's free and is integrated with Jekyll. That means that documentation html pages will be generated directly from Markdown files. It's also very easy to use this technology.

Improving numerical routines

After revamping the existing documentation I will focus on the main part of this proposal -- introducing interpolation and integration modules.

My goal is not to implement all possible facilities. Instead, for each family of algorithms I will implement only one and design an interface that all algorithms from that family must fulfill. Good documentation will lower the entry barrier. I find it better in the long term because lowering the barrier will attract more contributors which is the most important thing at this moment rather than completeness. The new contributors will implement other algorithms.

I'm going to write documentation and/or tests before implementation so that other people can see how the API will look like and can comment and discuss it.

This project is not risky at all. The code can be merged with the master branch after implementing every family of algorithm. I find iterative approach very suitable for this project.

I'm working at implementing linear interpolation, so you can "feel" what I'd like to do this summer.

Brief plan

I will start from implementing interpolation. Univariate linear interpolation is already in progress. Then I will implement 1d splines with degree equal or smaller than 3. After that, I will move to multivariate interpolation and I will implement both n-d linear interpolation as well as 2-d splines (with degree <= 3). If time permits, I will also implement other interpolators like barycentric and krogh ones.

The rest of time I will focus on integration. Again, I will start from single integral. I will implement trapezoid and Simpson methods with equidistant nodes. Then, I will move to n-d integrals and implement Monte Carlo method.

If time permits, I will also focus on enhancing existing modules like signal processing, optimization and statistical functions.

About me

My name is Christopher Mędrela and I'm a student of University of Science and Technology in Kraków (Poland). My time zone is UTC+01:00. My email address is chris.medrela+gsoc2014 at gmail.com. I have [a github account] (https://github.com/chrismedrela).

I'm contributor of open source projects since 2011. I'm working mainly at Django. I've written a lot of patches. Last year I was participating GSoC and I've successfully revamped Django check framework (proposal, merge).

I'm fluent in Python so I can easily comprehend SciPy. I'm interested in other languages too. I met Scala about one year ago. Before switching to Python, I was coding in Java, so Scala is not a completely new language for me. I'm familiar with Scala enough to manage this project and the branch where I'm working at implementing linear interpolation proves that.

I can use the tools necessary to manage this project. During the last GSoC I've mastered git. This proposal is written in Markdown, so I get started with it. I know the basics of sbt, otherwise I couldn't write the pull request.

During the last GSoC it turned out that my English is good enough to talk in real time although it's not fluent.

I'd like to internally shift GSoC dates to start on 21 April (one month earlier) and finish after 12 weeks. Google said "We don't police what you deliver to your org and when, simply that you meet the milestones of the program as laid out.". Since we will make everything earlier, deadlines are not a problem. David Hall doesn't object to that too. The reason for the shifting is that I'd like to have an internship in the late summer and this is the only way I can avoid a clash with GSoC.

During the GSoC, I'm not going to have any job, holidays nor any other time-consuming activity except for classes at university. I'm not going to apply for those internship which will collide with GSoC. I'd like to reserve one week for preparing to exams (23 - 27 June), so I will be able to work at GSoC for eleven weeks.

@dlwh
Copy link

dlwh commented Mar 6, 2014

This looks basically good!

My biggest overarching comment is to try to shape the pitch not in terms of adding in SciPy functionality, even though I pitched it like that on the list. Can you maybe title it along the lines of "Improving Numerical Routines in Scala Breeze," or something like that? Most of the text can stay the same, I just want this to be more about enhancing Breeze, rather than "merely" importing in SciPy.

It makes more sense to have people start their own project that depends on Breeze rather than cloning and running. The startup time for compiling Breeze is unfortunately long.

I don't think graph routines are a good fit for Breeze, despite their presence in Scipy. Clustering will go in Nak. Signal Processing already exists, so probably falls into your "enhancing" category. Maybe just leave the enhancing out of the proposal? If there's time, we can think of other things to do.

FWIW, your (written) English seems great. I'm not sure about it being ok for you to do classes while you do GSOC, but I'll let the Google people and EPFL decide that. What's your load like?

@chrismedrela
Copy link
Author

What about the new procedure to run Breeze?

What's wrong about doing classes during GSoC? This program is for students so everybody has classes. My last class is on 24 June and is followed by exam period. If I pass all exams on the first attempt which is very likely (I failed only one exam on the first attempt til now), I will end academic year one week later.

BTW What does EPFL stand for?

My load: Every week there are 5 or 6 labs (one lab is every two weeks), two of them take 2h 15' each one, the rest take 1h 30' each one. Plus 6 lectures, which are not obligatory and I'm not going to attend them. It's easier to read scripts at home. Plus 2 English classes x 1h 30'. Some of the labs will finish after ten weeks while the entire term is 15 weeks long. Three exams at the end of the term. So as you can see, I'm going to spend exactly 12 hours at university + some time at home. So the load is equivalent to half-time job. That means that I can manage GSoC easily.

@dlwh
Copy link

dlwh commented Mar 10, 2014

That sounds good! Maybe a simple template empty sbt project that just downloads the jars? Can also double as a script or something.

Americans usually don't take classes in the summer. If you've successfully juggled it before, I have no objections.

EPFL is http://epfl.ch/ . Scala was/is created there.

@chrismedrela
Copy link
Author

The summer vacation is from 10 July to 31 September in my case. Generally speaking, summer holidays are in July, August and September in Poland. My plan is to prepare to exams and go ahead as far as possible before 21 April so I won't be disturbed by classes and I will be able to focus on GSoC.

Has you received any feedback from Google or EPFL yet?

@dlwh
Copy link

dlwh commented Mar 17, 2014

I think there's nothing to worry about. If you successfully completed a GSoC doing basically the same thing last year, I see no problem.

@hubertp
Copy link

hubertp commented Mar 20, 2014

How many and what exams do you have (Polish names are ok, I can understand them)? Which year are you currently in? I just want to get a rough idea on the workload (apart from the one you already provided).

@chrismedrela
Copy link
Author

I'm second year student (major: Automatic Control and Robotics) and I will have three exams this term:

  • Basics of Automatic Control -- Podstawy Automatyki
  • Elektrotechnics -- Elektrotechnika
  • Automation Apparatus (no idea how to translate it into English) -- Aparatura Automatyzacji

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment