Skip to content

Instantly share code, notes, and snippets.

@nadiaenh
Last active September 7, 2023 22:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nadiaenh/78619132bd7994961398866779430e91 to your computer and use it in GitHub Desktop.
Save nadiaenh/78619132bd7994961398866779430e91 to your computer and use it in GitHub Desktop.
Final evaluation report for Google Summer of Code / Julia Summer of Code 2023

Introduction

This page outlines my time at Google Summer of Code on the JuliaLang organization, working on Survey.jl from May 29, 2023 to August 28, 2023.

Contributor: @nadiaenh
Project GitHub: https://github.com/xKDR/Survey.jl
GSoC project page: https://summerofcode.withgoogle.com/programs/2023/projects/feWjPxcW

The bulk of my contributions can be found at the bottom of this page, or you can check out the PR's linked throughout this document for a more granular view.

🛫 Starting point

Before starting GSoC, we expected that I should be working on GLM support, raking support, adding or improving support for additional survey designs, improving documentation, and maybe working on imputation if time permitted. A glm( ) function was already partially defined with one test, but was computing standard errors incorrectly for replicate design types. Additionally, I had met with the maintainers of the package and suggested that we restructure part of the code to use an abstract data structure for summary statistics, which would make the code more reusable - see PR #277 and PR #275. I had also contributed two documentation PR's - the larger PR #283 and the very small PR #285

🛬 Ending point

The three main deliverables from my time at GSoC ended up being PR #304, which includes:

  • Support for design-based variance computation
  • Improved support for domain-based statistical summary functions (mean, total, quantile, and ratio)
  • Support for generalized linear models for Bootstrap and Jackknife replicate designs

All three new features included relevant tests, docstrings, and documentation.

I also was able to adress Issue #298, Issue #300, Issue #295, Issue #301, and Issue #309.

💭 My takeaways

I came into this experience unsure of what to expect, and I am leaving it with unique insights into the open-source community.

What I learned:

  • Julia, obviously! I had started learning some Julia before GSoC, but I definitely learned a lot more over the summer - things like docstrings, doctests, comparisons with R, how to write a test suite, how package documentation is generated, CI/CD, etc.
  • Speaking up when you think of an improvement or notice an issue is the name of the game. Whenever I suggested improvements like abstract data types for summary statistics, or having a separate file for all variance functions, I received engaged opinions and openness to those changes.
  • Adaptability is a big part of open source. I had to be flexible and jump from one task to another, for example I ended up doing a lot of work on the variance() functions because it was necessary for the proper functioning of glm( ).
  • Organization is a life-saver! I kept detailed notes of every single meeting throughout the program, as well as what work I did each week. This came in handy when it came to writing my final report, and to make sure I stayed on track with the tasks that needed to be done.

What I would do differently:

  • Be more intentional with my communication, reach out more frequently when I'm stuck instead of trying to figure it out on my own.
  • Be more self-directed to have more ownership of my tasks, and create a bigger impact in the project.
  • Prioritize when juggling many tasks, which would have helped with stress and some confusion at times.

🙏🏼 Acknowledgements

My mentor Ayush Patnaik was always available for questions, willing to explain concepts or processes, sending a ton of helpful links and documentation, and helping me set a new direction whenever I was blocked. I'm very grateful for his continued assistance, understanding, and knowledgeability!

I also want to thank another maintainer on the package, Shikhar Mishra, who had a lot of insights during the pre-GSoC meetings, as well as Siddhant Chaudhary for reviewing my PR's !

Lastly, thanks to the GSoC, JSoC, and xKDR teams for setting up this opportunity 😄

Week-by-week breakdown

Community bonding period

A GLM function was defined for Replicate Design survey types, but it did not actually compute design-based standard errors - see PR #221. Here, the glm( ) calls in lines 10 and 11 used a Cholesky decomposition by default. My overarching task was going to be modifying the function so it would allow for design-based standard errors. A subtask was to allow for passing a decomposition type parameter, so we could use QR decomposition if desired and see if there were any changes between the 2 methods.

I also worked on incorporating Siddhant's newly merged PR #297 into the code I was writing, as his PR was supposed to allow for replicate-based variance calculation. I tried it out with simple random sample, stratified, and clustered survey designs, as well as different types of regression. For some reason, I was getting unexpected results, so I spent about a week trying to figure out why. After meeting with my mentor Ayush, we decided not to incorporate PR #297 into my code for now, as I might need to do more work on the variance() functions first so they can be used in my overarching task.

I spent a lot of time in this period familiarizing myself with the codebase, the workflows (CI/CD, CodeCov,...), and reviewing survey design and sampling concepts.

Week 1 - May 29 to June 4

I had to write a tutorial in the Survey.jl documentation on how to use the new glm( ) function with some good examples. I was unsure on how to properly export the necessary methods for glm( ) (types of link functions and distributions), so after discussing with Ayush, I opened Issue #305. I had some errors when running CI/CD due to missing packages, so I also worked on that.

Week 2 - June 5 to June 11

I was still working on the documentation page for glm( ), and started writing tests for src/reg.jl. I fixed some errors in my tests where I was not comparing my results to the results from R. Lastly, I picked up three small issues - Issue #298, Issue #300, and Issue #301.

In my meeting with Ayush, we discussed potentially working on a new Svyglm object that would contain some coefficients and standard errors, as well as a show( ) method. These were somewhat 'quality of life' improvements, in that they would make some manipulations easier to read. Meanwhile, I was working on a PR that would address some of the issues I picked up last week.

Week 3 - June 12 to June 18

I pushed PR #307 which addressed Issue #298 and Issue #300 that I had picked up. I opened PR #304. After getting feedback from Ayush, I had a handful of new small tasks on that - some cleanup, adding a new lm( ) function, changing some tests, and documentation. I spent the rest of the week addressing the comments I received on that PR. This resulted in a bunch of separate commits to add docstrings, documentation, tests, and dependencies.

Week 4 - June 19 to June 25

This week, as I worked on Issue #301, it became evident that I shouldn't (couldn't?) move forward with my ongoing task until I found a solution to the issue from the community bonding period - the variance() function. The work that I was going to do on that function would also address Issue #295, so I picked it up as well. On the side, I updated my tests for reg.jl again and patched a weird issue where Julia couldn't tell the difference between GLM.lm( ) and Survey.lm( ). Lastly, before GSoC started way back in late February, I had met with the maintainers and suggested having some abstract data types for our summary statistics to make the code more reusable. They had opened Issue #295 soon after, to which I had opened PR #277 but then this issue got set aside at the time as it needed much more in-depth discussion. After meeting with Ayush this week, I was reassigned to this issue to potentially work on later.

Week 5 - June 26 to July 2

The variance() function took more work than expected so I kept working on my upcoming PR for that, which would address a couple of outstanding issues.

Week 6 - July 3 to July 9

I had made enough progress on my variance() implementation to open PR #308 which addressed issues Issue #295 and Issue #301 that I had picked up two weeks ago. I also spent some time refactoring the GLM.glm( ) function without explicitly passing a formula in the function call. After that, I updated the variance() function so it would also support Jackknife survey designs, fix some doctest issues in my PR, and updated some summary statistic functions so they would take 3 arguments instead of 2.

Week 7 - July 10 to July 16

I started working on updating the bydomain( ) function in src/by.jl so that it would behave like the other summary statistic functions - i.e. calling an inner function. When I met with Ayush later this week, my computer was unusually slow, so he helped me start setting up a connection a remote SSH to the xKDR server. This took a couple days for me to sort out.

Week 8 😮‍💨 Break week

My mentor Ayush was away for JuliaCon during this week, which was a good time for me to take a much needed mental health break from ongoing events in my personal life.

Week 9 - July 24 to July 30

I updated all the summary statistic functions (mean, ratio, total, quantile) to :

  • 1) use the variance() function I had created in PR #308.
  • 2) have a dispatched form that would call the bydomain( ) function.

Week 10 - July 31 to August 6

I spent this week writing test cases for the 4 functions I had refactored the previous week. I also picked up Issue #309.

Week 11 - August 7 to August 13

After meeting with Ayush, we decided to define bydomain( ) only for Bootstrap Replicates for now instead of both types of replicate designs. Since there were about 2 weeks left in GSoC, I started focusing on closing my main outstanding PR's and not working on by.jl for now - PR #308 on variance() and PR #304 on glm( ).

I opened PR #312 to fix the issue I picked up last week. Lastly, I spent a number of hours working on my final evaluation report (this document).

Week 12 - August 14 to August 20

Spent this week getting the big PR #308 on variance() ready for merging. This meant reviewing multiple docstrings, fixing multiple doctests that were failing due to recent changes, and updating many test cases across multiple files. Also, I fixed Issue #309.

Week 13 - 🏁 Final evaluation week

This week was spent getting the other big PR #304 on glm() ready for merging, and submitting the final GSoC evaluation report. Also, PR #308 ended up getting merged into PR #304.

"""
mean(var, domain, design)
Estimate means of domains.
```jldoctest meanlabel; setup = :(apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights)
julia> mean(:api00, :cname, dclus1)
11×2 DataFrame
Row │ mean cname
│ Float64 String15
─────┼──────────────────────
1 │ 669.0 Alameda
2 │ 472.0 Fresno
3 │ 452.5 Kern
4 │ 647.267 Los Angeles
5 │ 623.25 Mendocino
6 │ 519.25 Merced
7 │ 710.563 Orange
8 │ 709.556 Plumas
9 │ 659.436 San Diego
10 │ 551.189 San Joaquin
11 │ 732.077 Santa Clara
```
Use the replicate design to compute standard errors of the estimated means.
```jldoctest meanlabel
julia> mean(:api00, :cname, bclus1)
11×3 DataFrame
Row │ mean SE cname
│ Float64 Float64 String15
─────┼───────────────────────────────
1 │ 732.077 NaN Santa Clara
2 │ 659.436 NaN San Diego
3 │ 519.25 NaN Merced
4 │ 647.267 NaN Los Angeles
5 │ 710.563 NaN Orange
6 │ 472.0 NaN Fresno
7 │ 709.556 NaN Plumas
8 │ 669.0 NaN Alameda
9 │ 551.189 NaN San Joaquin
10 │ 452.5 NaN Kern
11 │ 623.25 NaN Mendocino
```
"""
function mean(x::Symbol, domain, design::AbstractSurveyDesign)
df = bydomain(x, domain, design, mean)
return df
end
"""
mean(x::Symbol, design::ReplicateDesign)
Compute the standard error of the estimated mean using replicate weights.
# Arguments
- `x::Symbol`: Symbol representing the variable for which the mean is estimated.
- `design::ReplicateDesign`: Replicate design object.
# Returns
- `df`: DataFrame containing the estimated mean and its standard error.
# Examples
```jldoctest; setup = :(using Survey, StatsBase; apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights;)
julia> mean(:api00, bclus1)
1×2 DataFrame
Row │ mean SE
│ Float64 Float64
─────┼──────────────────
1 │ 644.169 23.4107
```
"""
function mean(x::Symbol, design::ReplicateDesign)
# Define an inner function to calculate the mean
function inner_mean(df::DataFrame, column, weights_column)
return StatsBase.mean(df[!, column], StatsBase.weights(df[!, weights_column]))
end
# Calculate the mean and variance
df = Survey.variance(x, inner_mean, design)
rename!(df, :estimator => :mean)
return df
end
"""
svyglm(formula::FormulaTerm, design::ReplicateDesign, args...; kwargs...)
Perform generalized linear modeling (GLM) using the survey design with replicates.
# Arguments
- `formula`: A `FormulaTerm` specifying the model formula.
- `design`: A `ReplicateDesign` object representing the survey design with replicates.
- `args...`: Additional arguments to be passed to the `glm` function.
- `kwargs...`: Additional keyword arguments to be passed to the `glm` function.
# Returns
A `DataFrame` containing the estimates for model coefficients and their standard errors.
# Example
```julia
apisrs = load_data("apisrs")
srs = SurveyDesign(apisrs)
bsrs = bootweights(srs, replicates = 2000)
result = svyglm(@formula(api00 ~ api99), bsrs, Normal())
```
"""
function svyglm(formula::FormulaTerm, design::ReplicateDesign, args...; kwargs...)
rhs_symbols = typeof(formula.rhs) == Term ? Symbol.(formula.rhs) : collect(Symbol.(formula.rhs))
lhs_symbols = Symbol.(formula.lhs)
columns = vcat(rhs_symbols, lhs_symbols)
function inner_svyglm(df::DataFrame, columns, weights_column, args...; kwargs...)
matrix = hcat(ones(nrow(df)), Matrix(df[!, columns[1:(length(columns)-1)]]))
model = glm(matrix, (df[!, columns[end]]), args...; wts=df[!, weights_column], kwargs...)
return coef(model)
end
# Compute variance of coefficients
variance(columns, inner_svyglm, design, args...; kwargs...)
end
"""
variance(x::Union{Symbol, Vector{Symbol}}, func::Function, design::ReplicateDesign{BootstrapReplicates}, args...; kwargs...)
Compute the standard error of the estimated mean using the bootstrap method.
# Arguments
- `x::Union{Symbol, Vector{Symbol}}`: Symbol or vector of symbols representing the variable(s) for which the mean is estimated.
- `func::Function`: Function used to calculate the mean.
- `design::ReplicateDesign{BootstrapReplicates}`: Replicate design object.
- `args...`: Additional arguments to be passed to the function.
- `kwargs...`: Additional keyword arguments.
# Returns
- `df`: DataFrame containing the estimated mean and its standard error.
The variance is calculated using the formula
```math
\\hat{V}(\\hat{\\theta}) = \\dfrac{1}{R}\\sum_{i = 1}^R(\\theta_i - \\hat{\\theta})^2
```
where above ``R`` is the number of replicate weights, ``\\theta_i`` is the estimator computed using the ``i``th set of replicate weights, and ``\\hat{\\theta}`` is the estimator computed using the original weights.
# Examples
```jldoctest; setup = :(using Survey, StatsBase, DataFrames; apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights;)
julia> mean(df::DataFrame, column, weights) = StatsBase.mean(df[!, column], StatsBase.weights(df[!, weights]));
julia> variance(:api00, mean, bclus1)
1×2 DataFrame
Row │ estimator SE
│ Float64 Float64
─────┼────────────────────
1 │ 644.169 23.4107
```
"""
function variance(x::Union{Symbol, Vector{Symbol}}, func::Function, design::ReplicateDesign{BootstrapReplicates}, args...; kwargs...)
# Compute the estimators
θs = func(design.data, x, design.weights, args...; kwargs...)
# Compute the estimators for each replicate
θts = [
func(design.data, x, "replicate_" * string(i), args...; kwargs...) for i in 1:design.replicates
]
# Convert θs and θts to a vector if they are not already
θs = (θs isa Vector) ? θs : [θs]
θts2 = Vector[]
if !(θts[1] isa Vector)
for θt in θts
push!(θts2, [θt])
end
θts = θts2
end
# Calculate variances for each estimator
variance = Float64[]
θts = hcat(θts...)
for i in 1:length(θs)
θ = θs[i]
θt = θts[i, :]
θt = filter(!isnan, θt)
num = sum((θt .- θ) .^ 2) / length(θt)
push!(variance, num)
end
return DataFrame(estimator = θs, SE = sqrt.(variance))
end
"""
variance(x::Symbol, func::Function, design::ReplicateDesign{JackknifeReplicates})
Compute variance of column `x` for the given `func` using the Jackknife method. The formula to compute this variance is the following.
```math
\\hat{V}_{\\text{JK}}(\\hat{\\theta}) = \\sum_{h = 1}^H \\dfrac{n_h - 1}{n_h}\\sum_{j = 1}^{n_h}(\\hat{\\theta}_{(hj)} - \\hat{\\theta})^2
```
Above, ``\\hat{\\theta}`` represents the estimator computed using the original weights, and ``\\hat{\\theta_{(hj)}}`` represents the estimator computed from the replicate weights obtained when PSU ``j`` from cluster ``h`` is removed.
# Examples
```jldoctest; setup = :(using Survey, StatsBase, DataFrames; apistrat = load_data("apistrat"); dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw); rstrat = jackknifeweights(dstrat);)
julia> mean(df::DataFrame, column, weights) = StatsBase.mean(df[!, column], StatsBase.weights(df[!, weights]));
julia> variance(:api00, mean, rstrat)
1×2 DataFrame
Row │ estimator SE
│ Float64 Float64
─────┼────────────────────
1 │ 662.287 9.53613
```
# Reference
pg 380-382, Section 9.3.2 Jackknife - Sharon Lohr, Sampling Design and Analysis (2010)
"""
function variance(x::Union{Symbol, Vector{Symbol}}, func::Function, design::ReplicateDesign{JackknifeReplicates}, args...; kwargs...)
df = design.data
stratified_gdf = groupby(df, design.strata)
# estimator from original weights
θs = func(design.data, x, design.weights, args...; kwargs...)
# ensure that θs is a vector
θs = (θs isa Vector) ? θs : [θs]
variance = zeros(length(θs))
replicate_index = 1
for subgroup in stratified_gdf
psus_in_stratum = unique(subgroup[!, design.cluster])
nh = length(psus_in_stratum)
cluster_variance = zeros(length(θs))
for psu in psus_in_stratum
# estimator from replicate weights
θhjs = func(design.data, x, "replicate_" * string(replicate_index), args...; kwargs...)
# update the cluster variance for each estimator
for i in 1:length(θs)
cluster_variance[i] += ((nh - 1)/nh) * (θhjs[i] - θs[i])^2
end
replicate_index += 1
end
# update the overall variance
variance .+= cluster_variance
end
return DataFrame(estimator = θs, SE = sqrt.(variance))
end
"""
quantile(var, domain, design)
Estimate a quantile of domains.
```jldoctest meanlabel; setup = :(apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights;)
julia> quantile(:api00, :cname, dclus1, 0.5)
11×2 DataFrame
Row │ 0.5th percentile cname
│ Float64 String15
─────┼───────────────────────────────
1 │ 669.0 Alameda
2 │ 474.5 Fresno
3 │ 452.5 Kern
4 │ 628.0 Los Angeles
5 │ 616.5 Mendocino
6 │ 519.5 Merced
7 │ 717.5 Orange
8 │ 699.0 Plumas
9 │ 657.0 San Diego
10 │ 542.0 San Joaquin
11 │ 718.0 Santa Clara
```
"""
function quantile(x::Symbol, domain, design::AbstractSurveyDesign, p::Real)
df = bydomain(x, domain, design, quantile, p)
return df
end
"""
quantile(x::Symbol, design::ReplicateDesign, p; kwargs...)
Compute the standard error of the estimated quantile using replicate weights.
# Arguments
- `x::Symbol`: Symbol representing the variable for which the quantile is estimated.
- `design::ReplicateDesign`: Replicate design object.
- `p::Real`: Quantile value to estimate, ranging from 0 to 1.
- `kwargs...`: Additional keyword arguments.
# Returns
- `df`: DataFrame containing the estimated quantile and its standard error.
# Examples
```jldoctest; setup = :(using Survey, StatsBase; apisrs = load_data("apisrs"); srs = SurveyDesign(apisrs; weights=:pw); bsrs = srs |> bootweights;)
julia> quantile(:api00, bsrs, 0.5)
1×2 DataFrame
Row │ 0.5th percentile SE
│ Float64 Float64
─────┼───────────────────────────
1 │ 659.0 14.9764
```
"""
function quantile(x::Symbol, design::ReplicateDesign, p::Real; kwargs...)
# Define an inner function to calculate the quantile
function inner_quantile(df::DataFrame, column, weights_column)
return Statistics.quantile(df[!, column], ProbabilityWeights(df[!, weights_column]), p)
end
# Calculate the quantile and variance
df = variance(x, inner_quantile, design)
rename!(df, :estimator => string(p) * "th percentile")
return df
end
"""
ratio(var, domain, design)
Estimate ratios of domains.
```jldoctest ratiolabel; setup = :(using Survey; apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights;)
julia> ratio([:api00, :api99], :cname, dclus1)
11×2 DataFrame
Row │ ratio cname
│ Float64 String15
─────┼──────────────────────
1 │ 1.09852 Alameda
2 │ 1.17779 Fresno
3 │ 1.11453 Kern
4 │ 1.06307 Los Angeles
5 │ 1.00565 Mendocino
6 │ 1.08121 Merced
7 │ 1.03628 Orange
8 │ 1.02127 Plumas
9 │ 1.06112 San Diego
10 │ 1.07331 San Joaquin
11 │ 1.05598 Santa Clara
```
Use the replicate design to compute standard errors of the estimated means.
```jldoctest ratiolabel
julia> ratio([:api00, :api99], :cname, bclus1)
11×3 DataFrame
Row │ estimator SE cname
│ Float64 Float64 String15
─────┼─────────────────────────────────
1 │ 1.05598 NaN Santa Clara
2 │ 1.06112 NaN San Diego
3 │ 1.08121 NaN Merced
4 │ 1.06307 NaN Los Angeles
5 │ 1.03628 NaN Orange
6 │ 1.17779 NaN Fresno
7 │ 1.02127 NaN Plumas
8 │ 1.09852 NaN Alameda
9 │ 1.07331 NaN San Joaquin
10 │ 1.11453 NaN Kern
11 │ 1.00565 NaN Mendocino
```
"""
function ratio(x::Vector{Symbol}, domain, design::AbstractSurveyDesign)
df = bydomain(x, domain, design, ratio)
return df
end
"""
ratio(x::Vector{Symbol}, design::ReplicateDesign)
Compute the standard error of the ratio using replicate weights.
# Arguments
- `variable_num::Symbol`: Symbol representing the numerator variable.
- `variable_den::Symbol`: Symbol representing the denominator variable.
- `design::ReplicateDesign`: Replicate design object.
# Returns
- `var`: Variance of the ratio.
# Examples
```jldoctest; setup = :(using Survey, StatsBase; apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = bootweights(dclus1);)
julia> ratio([:api00, :api99], bclus1)
1×2 DataFrame
Row │ estimator SE
│ Float64 Float64
─────┼───────────────────────
1 │ 1.06127 0.00672259
```
"""
function ratio(x::Vector{Symbol}, design::ReplicateDesign)
variable_num, variable_den = x[1], x[2]
# Define an inner function to calculate the ratio
function inner_ratio(df::DataFrame, columns, weights_column)
return sum(df[!, columns[1]], StatsBase.weights(df[!, weights_column])) / sum(df[!, columns[2]], StatsBase.weights(df[!, weights_column]))
end
# Calculate the variance using the `variance` function with the inner function
var = variance([variable_num, variable_den], inner_ratio, design)
return var
end
"""
total(var, domain, design)
Estimate population totals of domains.
```jldoctest totallabel; setup = :(using Survey; apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights;)
julia> total(:api00, :cname, dclus1)
11×2 DataFrame
Row │ total cname
│ Float64 String15
─────┼─────────────────────────────
1 │ 3.7362e6 Alameda
2 │ 9.58547e5 Fresno
3 │ 459473.0 Kern
4 │ 2.46465e6 Los Angeles
5 │ 1.26571e6 Mendocino
6 │ 1.0545e6 Merced
7 │ 5.7721e6 Orange
8 │ 3.2422e6 Plumas
9 │ 9.20698e6 San Diego
10 │ 1.03541e7 San Joaquin
11 │ 3.22122e6 Santa Clara
```
Use the replicate design to compute standard errors of the estimated totals.
```jldoctest totallabel
julia> total(:api00, :cname, bclus1)
11×3 DataFrame
Row │ total SE cname
│ Float64 Float64 String15
─────┼────────────────────────────────────────
1 │ 3.22122e6 2.6143e6 Santa Clara
2 │ 9.20698e6 8.00251e6 San Diego
3 │ 1.0545e6 9.85983e5 Merced
4 │ 2.46465e6 2.15017e6 Los Angeles
5 │ 5.7721e6 5.40929e6 Orange
6 │ 9.58547e5 8.95488e5 Fresno
7 │ 3.2422e6 3.03494e6 Plumas
8 │ 3.7362e6 3.49184e6 Alameda
9 │ 1.03541e7 9.69862e6 San Joaquin
10 │ 459473.0 4.30027e5 Kern
11 │ 1.26571e6 1.18696e6 Mendocino
```
"""
function total(x::Symbol, domain, design::AbstractSurveyDesign)
df = bydomain(x, domain, design, total)
return df
end
"""
total(x::Symbol, design::ReplicateDesign)
Compute the standard error of the estimated total using replicate weights.
# Arguments
- `x::Symbol`: Symbol representing the variable for which the total is estimated.
- `design::ReplicateDesign`: Replicate design object.
# Returns
- `df`: DataFrame containing the estimated total and its standard error.
# Examples
```jldoctest; setup = :(using Survey; apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights;)
julia> total(:api00, bclus1)
1×2 DataFrame
Row │ total SE
│ Float64 Float64
─────┼──────────────────────
1 │ 3.98999e6 9.01611e5
```
"""
function total(x::Symbol, design::ReplicateDesign)
# Define an inner function to calculate the total
function inner_total(df::DataFrame, column, weights)
return StatsBase.wsum(df[!, column], StatsBase.weights(df[!, weights]))
end
# Calculate the total and variance
df = variance(x, inner_total, design)
rename!(df, :estimator => :total)
return df
end
function bydomain(x::Union{Symbol, Vector{Symbol}}, domain,design::Union{SurveyDesign, ReplicateDesign{BootstrapReplicates}}, func::Function, args...; kwargs...)
domain_names = unique(design.data[!, domain])
gdf = groupby(design.data, domain)
domain_names = [join(collect(keys(gdf)[i]), "-") for i in 1:length(gdf)]
vars = DataFrame[]
for group in gdf
push!(vars, func(x, subset(group, design), args...; kwargs...))
end
estimates = vcat(vars...)
if isa(domain, Vector{Symbol})
domain = join(domain, "_")
end
estimates[!, domain] = domain_names
return estimates
end
# [Generalized Linear Models in Survey](@id manual)
The `svyglm()` function in the Julia Survey package is used to fit generalized linear models (GLMs) to survey data. It incorporates survey design information, such as sampling weights, stratification, and clustering, to produce valid estimates and standard errors that account for the type of survey design.
As of June 2023, the [GLM.jl documentation](https://juliastats.org/GLM.jl/stable/) lists the supported distribution families and their link functions as:
```txt
Bernoulli (LogitLink)
Binomial (LogitLink)
Gamma (InverseLink)
InverseGaussian (InverseSquareLink)
NegativeBinomial (NegativeBinomialLink, often used with LogLink)
Normal (IdentityLink)
Poisson (LogLink)
```
Refer to the GLM.jl documentation for more information about the GLM package.
## Fitting a GLM to a Survey Design object
You can fit a GLM to a Survey Design object the same way you would fit it to a regular data frame. The only difference is that you need to specify the survey design object as the second argument to the svyglm() function.
```julia
using Survey
apisrs = load_data("apisrs")
# Simple random sample survey
srs = SurveyDesign(apisrs, weights = :pw)
# Survey stratified by stype
dstrat = SurveyDesign(apistrat, strata = :stype, weights = :pw)
# Survey clustered by dnum
dclus1 = SurveyDesign(apiclus1, clusters = :dnum, weights = :pw)
```
Once you have the survey design object, you can fit a GLM using the svyglm() function. Specify the formula for the model and the distribution family.
The svyglm() function supports all distribution families supported by GLM.jl, i.e. Bernoulli, Binomial, Gamma, Geometric, InverseGaussian, NegativeBinomial, Normal, and Poisson.
For example, to fit a GLM with a Bernoulli distribution and a Logit link function to the `srs` survey design object we created above:
```julia
formula = @formula(api00 ~ api99)
my_glm = svyglm(formula, srs, family = Normal())
# View the coefficients and standard errors
my_glm.Coefficients
my_glm.SE
```
## Examples
The examples below use the `api` datasets, which contain survey data collected about California schools. The datasets are included in the Survey.jl package and can be loaded by calling `load_data("name_of_dataset")`.
### Bernoulli with Logit Link
A school being eligible for the awards program (`awards`) is a binary outcome (0 or 1). Let's assume it follows a Bernoulli distribution. Suppose we want to predict `awards` based on the percentage of students eligible for subsidized meals (`meals`) and the percentage of English Language Learners (`ell`). We can fit this GLM using the code below:
```julia
using Survey
apisrs = load_data("apisrs")
srs = SurveyDesign(apisrs, weights = :pw)
# Convert yes/no to 1/0
apisrs.awards = ifelse.(apisrs.awards .== "Yes", 1, 0)
# Fit the model
model = glm(@formula(awards ~ meals + ell), apisrs, Bernoulli(), LogitLink())
```
### Poisson with Log Link
Let us assume that the number of students tested (`api_stu`) follows a Poisson distribution, which models the number of successes out of a fixed number of trials. Suppose we want to predict the number of students tested based on the percentage of students eligible for subsidized meals (`meals`) and the percentage of English Language Learners (`ell`). We can fit this GLM using the code below:
```julia
using Survey
apisrs = load_data("apisrs")
srs = SurveyDesign(apisrs, weights = :pw)
# Rename api.stu to api_stu
rename!(apisrs, Symbol("api.stu") => :api_stu)
# Fit the model
model = glm(@formula(api_stu ~ meals + ell), apisrs, Poisson(), LogLink())
```
### Gamma with Inverse Link
Let us assume that the average parental education level (`avg_ed`) follows a Gamma distribution, which is suitable for modeling continuous, positive-valued variables with a skewed distribution. Suppose we want to predict the average parental education level based on the percentage of students eligible for subsidized meals (`meals`) and the percentage of English Language Learners (`ell`). We can fit this GLM using the code below:
```julia
using Survey
apisrs = load_data("apisrs")
srs = SurveyDesign(apisrs, weights = :pw)
# Rename api.stu to api_stu
rename!(apisrs, Symbol("avg.ed") => :avg_ed)
# Fit the model
model = glm(@formula(avg_ed ~ meals + ell), apisrs, Gamma(), InverseLink())
```
### More examples
Other examples for GLMs can be found in the [GLM.jl documentation](https://juliastats.org/GLM.jl/stable/).
@testset "GLM in bootstrap apisrs" begin
rename!(bsrs.data, Symbol("sch.wide") => :sch_wide)
bsrs.data.sch_wide = ifelse.(bsrs.data.sch_wide .== "Yes", 1, 0)
model = svyglm(@formula(sch_wide ~ meals + ell), bsrs, Binomial())
@test model.estimator[1] ≈ 1.523050676 rtol=STAT_TOL
@test model.estimator[2] ≈ 0.009754261 rtol=STAT_TOL
@test model.estimator[3] ≈ -0.020892044 rtol=STAT_TOL
@test model.SE[1] ≈ 0.369836 rtol=SE_TOL
@test model.SE[2] ≈ 0.009928 rtol=SE_TOL
@test model.SE[3] ≈ 0.012676 rtol=SE_TOL
# R code
# data(api)
# srs <- svydesign(id=~1, weights=~pw, data=apisrs)
# bsrs <- as.svrepdesign(srs, type="subbootstrap", replicates=10000)
# model = svyglm(sch.wide ~ meals + ell, bsrs, family=binomial())
# summary(model)
model = svyglm(@formula(api00 ~ api99), bsrs, Gamma(),InverseLink())
@test model.estimator[1] ≈ 2.915479e-03 rtol=STAT_TOL
@test model.estimator[2] ≈ -2.137187e-06 rtol=STAT_TOL
@test model.SE[1] ≈ 4.550e-05 rtol=SE_TOL
@test model.SE[2] ≈ 6.471e-08 rtol=SE_TOL
# R code
# data(api)
# srs <- svydesign(id=~1, weights=~pw, data=apisrs)
# bsrs <- as.svrepdesign(srs, type="subbootstrap", replicates=10000)
# model = svyglm(api00 ~ api99, bsrs, family = Gamma(link = "inverse"))
# summary(model)
model = svyglm(@formula(api00 ~ api99), bsrs, Normal())
@test model.estimator[1] ≈ 63.2830726 rtol=STAT_TOL
@test model.estimator[2] ≈ 0.9497618 rtol=STAT_TOL
@test model.SE[1] ≈ 9.63276 rtol=SE_TOL
@test model.SE[2] ≈ 0.01373 rtol=SE_TOL
# R code
# data(api)
# srs <- svydesign(id=~1, weights=~pw, data=apisrs)
# bsrs <- as.svrepdesign(srs, type="subbootstrap", replicates=10000)
# model = svyglm(api00 ~ api99, bsrs, family = gaussian())
# summary(model)
rename!(bsrs.data, Symbol("api.stu") => :api_stu)
model = svyglm(@formula(api_stu ~ meals + ell), bsrs, Poisson())
@test model.estimator[1] ≈ 6.229602881 rtol=STAT_TOL
@test model.estimator[2] ≈ -0.002038345 rtol=STAT_TOL
@test model.estimator[3] ≈ 0.002116465 rtol=STAT_TOL
@test model.SE[1] ≈ 0.093115 rtol=SE_TOL
@test model.SE[2] ≈ 0.002191 rtol=SE_TOL
@test model.SE[3] ≈ 0.002858 rtol=SE_TOL
# R code
# data(api)
# srs <- svydesign(id=~1, weights=~pw, data=apisrs)
# bsrs <- as.svrepdesign(srs, type="subbootstrap", replicates=10000)
# model = svyglm(api.stu ~ meals + ell, bsrs, family = poisson())
# summary(model)
end
@testset "GLM in jackknife apisrs" begin
rename!(jsrs.data, Symbol("sch.wide") => :sch_wide)
jsrs.data.sch_wide = ifelse.(jsrs.data.sch_wide .== "Yes", 1, 0)
model = svyglm(@formula(sch_wide ~ meals + ell), jsrs, Binomial())
@test model.estimator[1] ≈ 1.523051 rtol=STAT_TOL
@test model.estimator[2] ≈ 0.009754 rtol=1e-4 # This is a tiny bit off with 1e-5
@test model.estimator[3] ≈ -0.020892 rtol=STAT_TOL
@test model.SE[1] ≈ 0.359043 rtol=SE_TOL
@test model.SE[2] ≈ 0.009681 rtol=SE_TOL
@test model.SE[3] ≈ 0.012501 rtol=SE_TOL
# R code
# data(api)
# srs <- svydesign(id=~1, weights=~pw, data=apisrs)
# jsrs <- as.svrepdesign(srs, type="JK1", replicates=10000)
# model = svyglm(sch.wide ~ meals + ell, jsrs, family=binomial())
# summary(model)
model = svyglm(@formula(api00 ~ api99), jsrs, Gamma(), InverseLink())
@test model.estimator[1] ≈ 2.915479e-03 rtol=STAT_TOL
@test model.estimator[2] ≈ -2.137187e-06 rtol=STAT_TOL
@test model.SE[1] ≈ 5.234e-05 rtol=0.12 # failing at 1e-1
@test model.SE[2] ≈ 7.459e-08 rtol=0.12 # failing at 1e-1
# R code
# data(api)
# srs <- svydesign(id=~1, weights=~pw, data=apisrs)
# jsrs <- as.svrepdesign(srs, type="JK1", replicates=10000)
# model = svyglm(api00 ~ api99, jsrs, family = Gamma(link = "inverse"))
# summary(model)
model = svyglm(@formula(api00 ~ api99), jsrs, Normal())
@test model.estimator[1] ≈ 63.2830726 rtol=STAT_TOL
@test model.estimator[2] ≈ 0.9497618 rtol=STAT_TOL
@test model.SE[1] ≈ 9.69178 rtol=SE_TOL
@test model.SE[2] ≈ 0.01384 rtol=SE_TOL
# R code
# data(api)
# srs <- svydesign(id=~1, weights=~pw, data=apisrs)
# jsrs <- as.svrepdesign(srs, type="JK1", replicates=10000)
# model = svyglm(api00 ~ api99, jsrs, family = gaussian())
# summary(model)
rename!(jsrs.data, Symbol("api.stu") => :api_stu)
model = svyglm(@formula(api_stu ~ meals + ell), jsrs, Poisson())
@test model.estimator[1] ≈ 6.229602881 rtol=STAT_TOL
@test model.estimator[2] ≈ -0.002038345 rtol=STAT_TOL
@test model.estimator[3] ≈ 0.002116465 rtol=STAT_TOL
@test model.SE[1] ≈ 0.095444 rtol=SE_TOL
@test model.SE[2] ≈ 0.002317 rtol=SE_TOL
@test model.SE[3] ≈ 0.003085 rtol=SE_TOL
# R code
# data(api)
# srs <- svydesign(id=~1, weights=~pw, data=apisrs)
# jsrs <- as.svrepdesign(srs, type="JK1", replicates=10000)
# model = svyglm(api.stu ~ meals + ell, jsrs, family = poisson())
# summary(model)
end
@testset "GLM in bootstrap apiclus1" begin
rename!(dclus1_boot.data, Symbol("sch.wide") => :sch_wide)
dclus1_boot.data.sch_wide = ifelse.(dclus1_boot.data.sch_wide .== "Yes", 1, 0)
model = svyglm(@formula(sch_wide ~ meals + ell), dclus1_boot, Binomial())
@test model.estimator[1] ≈ 1.89955691 rtol=STAT_TOL
@test model.estimator[2] ≈ -0.01911468 rtol=STAT_TOL
@test model.estimator[3] ≈ 0.03992541 rtol=STAT_TOL
@test model.SE[1] ≈ 0.62928 rtol=SE_TOL
@test model.SE[2] ≈ 0.01209 rtol=SE_TOL
@test model.SE[3] ≈ 0.01533 rtol=SE_TOL
# R code
# data(api)
# srs <- svydesign(id=~dnum, weights=~pw, data=apiclus1)
# dclus1_boot <- as.svrepdesign(srs, type="subbootstrap", replicates=10000)
# model = svyglm(sch.wide ~ meals + ell, dclus1_boot, family=binomial())
# summary(model)
model = svyglm(@formula(api00 ~ api99), dclus1_boot, Gamma(),InverseLink())
@test model.estimator[1] ≈ 2.914844e-03 rtol=STAT_TOL
@test model.estimator[2] ≈ -2.180775e-06 rtol=STAT_TOL
@test model.SE[1] ≈ 8.014e-05 rtol=SE_TOL
@test model.SE[2] ≈ 1.178e-07 rtol=SE_TOL
# R code
# data(api)
# srs <- svydesign(id=~dnum, weights=~pw, data=apiclus1)
# dclus1_boot <- as.svrepdesign(srs, type="subbootstrap", replicates=10000)
# model = svyglm(api00 ~ api99, dclus1_boot, family = Gamma(link = "inverse"))
# summary(model)
model = svyglm(@formula(api00 ~ api99), dclus1_boot, Normal())
@test model.estimator[1] ≈ 95.28483 rtol=STAT_TOL
@test model.estimator[2] ≈ 0.90429 rtol=STAT_TOL
@test model.SE[1] ≈ 16.02015 rtol=SE_TOL
@test model.SE[2] ≈ 0.02522 rtol=SE_TOL
# R code
# data(api)
# srs <- svydesign(id=~dnum, weights=~pw, data=apiclus1)
# dclus1_boot <- as.svrepdesign(srs, type="subbootstrap", replicates=10000)
# model = svyglm(api00 ~ api99, dclus1_boot, family = gaussian())
# summary(model)
rename!(dclus1_boot.data, Symbol("api.stu") => :api_stu)
model = svyglm(@formula(api_stu ~ meals + ell), dclus1_boot, Poisson())
@test model.estimator[1] ≈ 6.2961803529 rtol=STAT_TOL
@test model.estimator[2] ≈ -0.0026906166 rtol=STAT_TOL
@test model.estimator[3] ≈ -0.0006054623 rtol=STAT_TOL
@test model.SE[1] ≈ 0.161459174 rtol=SE_TOL
@test model.SE[2] ≈ 0.003193577 rtol=0.11 # failing at 0.1
@test model.SE[3] ≈ 0.005879115 rtol=SE_TOL
# R code
# data(api)
# srs <- svydesign(id=~dnum, weights=~pw, data=apiclus1)
# dclus1_boot <- as.svrepdesign(srs, type="subbootstrap", replicates=10000)
# model = svyglm(api.stu ~ meals + ell, dclus1_boot, family = poisson())
# summary(model)
end
@testset "GLM in bootstrap apistrat" begin
rename!(bstrat.data, Symbol("sch.wide") => :sch_wide)
bstrat.data.sch_wide = ifelse.(bstrat.data.sch_wide .== "Yes", 1, 0)
model = svyglm(@formula(sch_wide ~ meals + ell), bstrat, Binomial())
@test model.estimator[1] ≈ 1.560408424 rtol=STAT_TOL
@test model.estimator[2] ≈ 0.003524761 rtol=STAT_TOL
@test model.estimator[3] ≈ -0.006831057 rtol=STAT_TOL
@test model.SE[1] ≈ 0.342669691 rtol=SE_TOL
@test model.SE[2] ≈ 0.009423733 rtol=SE_TOL
@test model.SE[3] ≈ 0.013934952 rtol=SE_TOL
# R code
# data(api)
# srs <- svydesign(id=~1, strata=~dnum, weights=~pw, data=apistrat)
# bstrat <- as.svrepdesign(srs, type="subbootstrap", replicates=10000)
# model = svyglm(sch.wide ~ meals + ell, bstrat, family=binomial())
# summary(model)
model = svyglm(@formula(api00 ~ api99), bstrat, Gamma(),InverseLink())
@test model.estimator[1] ≈ 2.873104e-03 rtol=STAT_TOL
@test model.estimator[2] ≈ -2.088791e-06 rtol=STAT_TOL
@test model.SE[1] ≈ 4.187745e-05 rtol=SE_TOL
@test model.SE[2] ≈ 5.970111e-08 rtol=SE_TOL
# R code
# data(api)
# srs <- svydesign(id=~1, strata=~dnum, weights=~pw, data=apistrat)
# bstrat <- as.svrepdesign(srs, type="subbootstrap", replicates=10000)
# model = svyglm(api00 ~ api99, bstrat, family = Gamma(link = "inverse"))
# summary(model)
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment