This page outlines my time at Google Summer of Code on the JuliaLang organization, working on Survey.jl from May 29, 2023 to August 28, 2023.
Contributor: @nadiaenh
Project GitHub: https://github.com/xKDR/Survey.jl
GSoC project page: https://summerofcode.withgoogle.com/programs/2023/projects/feWjPxcW
The bulk of my contributions can be found at the bottom of this page, or you can check out the PR's linked throughout this document for a more granular view.
Before starting GSoC, we expected that I should be working on GLM support, raking support, adding or improving support for additional survey designs, improving documentation, and maybe working on imputation if time permitted. A glm( )
function was already partially defined with one test, but was computing standard errors incorrectly for replicate design types. Additionally, I had met with the maintainers of the package and suggested that we restructure part of the code to use an abstract data structure for summary statistics, which would make the code more reusable - see PR #277 and PR #275. I had also contributed two documentation PR's - the larger PR #283 and the very small PR #285
The three main deliverables from my time at GSoC ended up being PR #304, which includes:
- Support for design-based variance computation
- Improved support for domain-based statistical summary functions (mean, total, quantile, and ratio)
- Support for generalized linear models for Bootstrap and Jackknife replicate designs
All three new features included relevant tests, docstrings, and documentation.
I also was able to adress Issue #298, Issue #300, Issue #295, Issue #301, and Issue #309.
I came into this experience unsure of what to expect, and I am leaving it with unique insights into the open-source community.
What I learned:
- Julia, obviously! I had started learning some Julia before GSoC, but I definitely learned a lot more over the summer - things like docstrings, doctests, comparisons with R, how to write a test suite, how package documentation is generated, CI/CD, etc.
- Speaking up when you think of an improvement or notice an issue is the name of the game. Whenever I suggested improvements like abstract data types for summary statistics, or having a separate file for all variance functions, I received engaged opinions and openness to those changes.
- Adaptability is a big part of open source. I had to be flexible and jump from one task to another, for example I ended up doing a lot of work on the
variance()
functions because it was necessary for the proper functioning ofglm( )
. - Organization is a life-saver! I kept detailed notes of every single meeting throughout the program, as well as what work I did each week. This came in handy when it came to writing my final report, and to make sure I stayed on track with the tasks that needed to be done.
What I would do differently:
- Be more intentional with my communication, reach out more frequently when I'm stuck instead of trying to figure it out on my own.
- Be more self-directed to have more ownership of my tasks, and create a bigger impact in the project.
- Prioritize when juggling many tasks, which would have helped with stress and some confusion at times.
My mentor Ayush Patnaik was always available for questions, willing to explain concepts or processes, sending a ton of helpful links and documentation, and helping me set a new direction whenever I was blocked. I'm very grateful for his continued assistance, understanding, and knowledgeability!
I also want to thank another maintainer on the package, Shikhar Mishra, who had a lot of insights during the pre-GSoC meetings, as well as Siddhant Chaudhary for reviewing my PR's !
Lastly, thanks to the GSoC, JSoC, and xKDR teams for setting up this opportunity 😄
A GLM function was defined for Replicate Design survey types, but it did not actually compute design-based standard errors - see PR #221. Here, the glm( )
calls in lines 10 and 11 used a Cholesky decomposition by default. My overarching task was going to be modifying the function so it would allow for design-based standard errors. A subtask was to allow for passing a decomposition type parameter, so we could use QR decomposition if desired and see if there were any changes between the 2 methods.
I also worked on incorporating Siddhant's newly merged PR #297 into the code I was writing, as his PR was supposed to allow for replicate-based variance calculation. I tried it out with simple random sample, stratified, and clustered survey designs, as well as different types of regression. For some reason, I was getting unexpected results, so I spent about a week trying to figure out why. After meeting with my mentor Ayush, we decided not to incorporate PR #297 into my code for now, as I might need to do more work on the variance()
functions first so they can be used in my overarching task.
I spent a lot of time in this period familiarizing myself with the codebase, the workflows (CI/CD, CodeCov,...), and reviewing survey design and sampling concepts.
I had to write a tutorial in the Survey.jl documentation on how to use the new glm( )
function with some good examples. I was unsure on how to properly export the necessary methods for glm( )
(types of link functions and distributions), so after discussing with Ayush, I opened Issue #305. I had some errors when running CI/CD due to missing packages, so I also worked on that.
I was still working on the documentation page for glm( )
, and started writing tests for src/reg.jl
. I fixed some errors in my tests where I was not comparing my results to the results from R. Lastly, I picked up three small issues - Issue #298, Issue #300, and Issue #301.
In my meeting with Ayush, we discussed potentially working on a new Svyglm
object that would contain some coefficients and standard errors, as well as a show( )
method. These were somewhat 'quality of life' improvements, in that they would make some manipulations easier to read. Meanwhile, I was working on a PR that would address some of the issues I picked up last week.
I pushed PR #307 which addressed Issue #298 and Issue #300 that I had picked up. I opened PR #304. After getting feedback from Ayush, I had a handful of new small tasks on that - some cleanup, adding a new lm( )
function, changing some tests, and documentation. I spent the rest of the week addressing the comments I received on that PR. This resulted in a bunch of separate commits to add docstrings, documentation, tests, and dependencies.
This week, as I worked on Issue #301, it became evident that I shouldn't (couldn't?) move forward with my ongoing task until I found a solution to the issue from the community bonding period - the variance()
function. The work that I was going to do on that function would also address Issue #295, so I picked it up as well. On the side, I updated my tests for reg.jl
again and patched a weird issue where Julia couldn't tell the difference between GLM.lm( )
and Survey.lm( )
. Lastly, before GSoC started way back in late February, I had met with the maintainers and suggested having some abstract data types for our summary statistics to make the code more reusable. They had opened Issue #295 soon after, to which I had opened PR #277 but then this issue got set aside at the time as it needed much more in-depth discussion. After meeting with Ayush this week, I was reassigned to this issue to potentially work on later.
The variance()
function took more work than expected so I kept working on my upcoming PR for that, which would address a couple of outstanding issues.
I had made enough progress on my variance()
implementation to open PR #308 which addressed issues Issue #295 and Issue #301 that I had picked up two weeks ago. I also spent some time refactoring the GLM.glm( )
function without explicitly passing a formula in the function call. After that, I updated the variance()
function so it would also support Jackknife survey designs, fix some doctest issues in my PR, and updated some summary statistic functions so they would take 3 arguments instead of 2.
I started working on updating the bydomain( )
function in src/by.jl
so that it would behave like the other summary statistic functions - i.e. calling an inner function. When I met with Ayush later this week, my computer was unusually slow, so he helped me start setting up a connection a remote SSH to the xKDR server. This took a couple days for me to sort out.
My mentor Ayush was away for JuliaCon during this week, which was a good time for me to take a much needed mental health break from ongoing events in my personal life.
I updated all the summary statistic functions (mean, ratio, total, quantile) to :
- 1) use the
variance()
function I had created in PR #308. - 2) have a dispatched form that would call the
bydomain( )
function.
I spent this week writing test cases for the 4 functions I had refactored the previous week. I also picked up Issue #309.
After meeting with Ayush, we decided to define bydomain( )
only for Bootstrap Replicates for now instead of both types of replicate designs. Since there were about 2 weeks left in GSoC, I started focusing on closing my main outstanding PR's and not working on by.jl
for now - PR #308 on variance()
and PR #304 on glm( )
.
I opened PR #312 to fix the issue I picked up last week. Lastly, I spent a number of hours working on my final evaluation report (this document).
Spent this week getting the big PR #308 on variance()
ready for merging. This meant reviewing multiple docstrings, fixing multiple doctests that were failing due to recent changes, and updating many test cases across multiple files. Also, I fixed Issue #309.
This week was spent getting the other big PR #304 on glm()
ready for merging, and submitting the final GSoC evaluation report. Also, PR #308 ended up getting merged into PR #304.