Skip to content

Instantly share code, notes, and snippets.

@nalimilan
Created September 10, 2020 21:10
Show Gist options
  • Save nalimilan/b801d48ab2931b6a671f05081fd94ba3 to your computer and use it in GitHub Desktop.
Save nalimilan/b801d48ab2931b6a671f05081fd94ba3 to your computer and use it in GitHub Desktop.
- # this constant defines which types of values returned by aggregation function
- # in combine are considered to produce multiple columns in the resulting data frame
- const MULTI_COLS_TYPE = Union{AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix}
-
- """
- groupby(d::AbstractDataFrame, cols; sort=false, skipmissing=false)
-
- Return a `GroupedDataFrame` representing a view of an `AbstractDataFrame` split
- into row groups.
-
- # Arguments
- - `df` : an `AbstractDataFrame` to split
- - `cols` : data frame columns to group by. Can be any column selector
- ($COLUMNINDEX_STR; $MULTICOLUMNINDEX_STR).
- - `sort` : whether to sort groups according to the values of the grouping columns
- `cols`; if all `cols` are `CategoricalVector`s then groups are always sorted
- irrespective of the value of `sort`
- - `skipmissing` : whether to skip groups with `missing` values in one of the
- grouping columns `cols`
-
- # Details
- An iterator over a `GroupedDataFrame` returns a `SubDataFrame` view
- for each grouping into `df`.
- Within each group, the order of rows in `df` is preserved.
-
- `cols` can be any valid data frame indexing expression.
- In particular if it is an empty vector then a single-group `GroupedDataFrame`
- is created.
-
- A `GroupedDataFrame` also supports
- indexing by groups, `map` (which applies a function to each group)
- and `combine` (which applies a function to each group
- and combines the result into a data frame).
-
- `GroupedDataFrame` also supports the dictionary interface. The keys are
- [`GroupKey`](@ref) objects returned by [`keys(::GroupedDataFrame)`](@ref),
- which can also be used to get the values of the grouping columns for each group.
- `Tuples` and `NamedTuple`s containing the values of the grouping columns (in the
- same order as the `cols` argument) are also accepted as indices. Finally,
- an `AbstractDict` can be used to index into a grouped data frame where
- the keys are column names of the data frame. The order of the keys does
- not matter in this case.
-
- # See also
-
- [`combine`](@ref), [`select`](@ref), [`select!`](@ref), [`transform`](@ref), [`transform!`](@ref)
-
- # Examples
- ```julia
- julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
- b = repeat([2, 1], outer=[4]),
- c = 1:8);
-
- julia> gd = groupby(df, :a)
- GroupedDataFrame with 4 groups based on key: a
- First Group (2 rows): a = 1
- │ Row │ a │ b │ c │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 1 │ 2 │ 1 │
- │ 2 │ 1 │ 2 │ 5 │
- ⋮
- Last Group (2 rows): a = 4
- │ Row │ a │ b │ c │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 4 │ 1 │ 4 │
- │ 2 │ 4 │ 1 │ 8 │
-
- julia> gd[1]
- 2×3 SubDataFrame
- │ Row │ a │ b │ c │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 1 │ 2 │ 1 │
- │ 2 │ 1 │ 2 │ 5 │
-
- julia> last(gd)
- 2×3 SubDataFrame
- │ Row │ a │ b │ c │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 4 │ 1 │ 4 │
- │ 2 │ 4 │ 1 │ 8 │
-
- julia> gd[(a=3,)]
- 2×3 SubDataFrame
- │ Row │ a │ b │ c │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 3 │ 2 │ 3 │
- │ 2 │ 3 │ 2 │ 7 │
-
- julia> gd[Dict("a" => 3)]
- 2×3 SubDataFrame
- │ Row │ a │ b │ c │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 3 │ 2 │ 3 │
- │ 2 │ 3 │ 2 │ 7 │
-
- julia> gd[(3,)]
- 2×3 SubDataFrame
- │ Row │ a │ b │ c │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 3 │ 2 │ 3 │
- │ 2 │ 3 │ 2 │ 7 │
-
- julia> k = first(keys(gd))
- GroupKey: (a = 3)
-
- julia> gd[k]
- 2×3 SubDataFrame
- │ Row │ a │ b │ c │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 3 │ 2 │ 3 │
- │ 2 │ 3 │ 2 │ 7 │
-
- julia> for g in gd
- println(g)
- end
- 2×3 SubDataFrame
- │ Row │ a │ b │ c │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 1 │ 2 │ 1 │
- │ 2 │ 1 │ 2 │ 5 │
- 2×3 SubDataFrame
- │ Row │ a │ b │ c │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 2 │ 1 │ 2 │
- │ 2 │ 2 │ 1 │ 6 │
- 2×3 SubDataFrame
- │ Row │ a │ b │ c │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 3 │ 2 │ 3 │
- │ 2 │ 3 │ 2 │ 7 │
- 2×3 SubDataFrame
- │ Row │ a │ b │ c │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 4 │ 1 │ 4 │
- │ 2 │ 4 │ 1 │ 8 │
- ```
- """
- function groupby(df::AbstractDataFrame, cols;
- sort::Bool=false, skipmissing::Bool=false)
0 _check_consistency(df)
0 idxcols = index(df)[cols]
- if isempty(idxcols)
- return GroupedDataFrame(df, Symbol[], ones(Int, nrow(df)),
- nothing, nothing, nothing, nrow(df) == 0 ? 0 : 1,
- nothing, Threads.ReentrantLock())
- end
96 sdf = select(df, idxcols, copycols=false)
-
80000080 groups = Vector{Int}(undef, nrow(df))
48 ngroups, rhashes, gslots, sorted =
- row_group_slots(ntuple(i -> sdf[!, i], ncol(sdf)), Val(false), groups, skipmissing)
-
288 gd = GroupedDataFrame(df, copy(_names(sdf)), groups, nothing, nothing, nothing, ngroups, nothing,
- Threads.ReentrantLock())
-
- # sort groups if row_group_slots hasn't already done that
0 if sort && !sorted
- # Find index of representative row for each group
0 idx = Vector{Int}(undef, length(gd))
0 fillfirst!(nothing, idx, 1:nrow(parent(gd)), gd)
0 group_invperm = invperm(sortperm(view(parent(gd)[!, gd.cols], idx, :)))
0 groups = gd.groups
0 @inbounds for i in eachindex(groups)
0 gix = groups[i]
0 groups[i] = gix == 0 ? 0 : group_invperm[gix]
- end
- end
-
0 return gd
- end
-
- const F_TYPE_RULES =
- """
- `fun` can return a single value, a row, a vector, or multiple rows.
- The type of the returned value determines the shape of the resulting `DataFrame`.
- There are four kind of return values allowed:
- - A single value gives a `DataFrame` with a single additional column and one row
- per group.
- - A named tuple of single values or a [`DataFrameRow`](@ref) gives a `DataFrame`
- with one additional column for each field and one row per group (returning a
- named tuple will be faster). It is not allowed to mix single values and vectors
- if a named tuple is returned.
- - A vector gives a `DataFrame` with a single additional column and as many rows
- for each group as the length of the returned vector for that group.
- - A data frame, a named tuple of vectors or a matrix gives a `DataFrame` with
- the same additional columns and as many rows for each group as the rows
- returned for that group (returning a named tuple is the fastest option).
- Returning a table with zero columns is allowed, whatever the number of columns
- returned for other groups.
-
- `fun` must always return the same kind of object (out of four
- kinds defined above) for all groups, and with the same column names.
-
- Optimized methods are used when standard summary functions (`sum`, `prod`,
- `minimum`, `maximum`, `mean`, `var`, `std`, `first`, `last` and `length`)
- are specified using the `Pair` syntax (e.g. `:col => sum`).
- When computing the `sum` or `mean` over floating point columns, results will be
- less accurate than the standard `sum` function (which uses pairwise
- summation). Use `col => x -> sum(x)` to avoid the optimized method and use the
- slower, more accurate one.
-
- Column names are automatically generated when necessary using the rules defined
- in [`select`](@ref) if the `Pair` syntax is used and `fun` returns a single
- value or a vector (e.g. for `:col => sum` the column name is `col_sum`); otherwise
- (if `fun` is a function or a return value is an `AbstractMatrix`) columns are
- named `x1`, `x2` and so on.
- """
-
- const F_ARGUMENT_RULES =
- """
-
- Arguments passed as `args...` can be:
-
- * Any index that is allowed for column indexing ($COLUMNINDEX_STR, $MULTICOLUMNINDEX_STR).
- * Column transformation operations using the `Pair` notation that is described below
- and vectors of such pairs.
-
- Transformations allowed using `Pair`s follow the rules specified for
- [`select`](@ref) and have the form `source_cols => fun`, `source_cols => fun
- => target_col`, or `source_col => target_col`. Function `fun` is passed
- `SubArray` views as positional arguments for each column specified to be
- selected, or a `NamedTuple` containing these `SubArray`s if `source_cols` is
- an `AsTable` selector. It can return a vector or a single value (defined
- precisely below). If automatic generation of target column
- name is required it respects the `renamecols` keyword argument following the
- rules described in [`select`](@ref).
-
- As a special case `nrow` or `nrow => target_col` can be passed without specifying
- input columns to efficiently calculate number of rows in each group.
- If `nrow` is passed the resulting column name is `:nrow`.
-
- If multiple `args` are passed then return values of different `fun`s are allowed
- to mix single values and vectors. In this case single values will be
- broadcasted to match the length of columns specified by returned vectors.
- As a particular rule, values wrapped in a `Ref` or a `0`-dimensional `AbstractArray`
- are unwrapped and then broadcasted.
-
- If the first or last argument is `pair` then it must be a `Pair` following the
- rules for pairs described above, except that in this case function defined
- by `fun` can return any return value defined below.
-
- If the first or last argument is a function `fun`, it is passed a [`SubDataFrame`](@ref)
- view for each group and can return any return value defined below.
- Note that this form is slower than `pair` or `args` due to type instability.
-
- If `gd` has zero groups then no transformations are applied.
- """
-
- const KWARG_PROCESSING_RULES =
- """
- If `keepkeys=true`, the resulting `DataFrame` contains all the grouping columns
- in addition to those generated. In this case if the returned
- value contains columns with the same names as the grouping columns, they are
- required to be equal.
- If `keepkeys=false` and some generated columns have the same name as grouping columns,
- they are kept and are not required to be equal to grouping columns.
-
- If `ungroup=true` (the default) a `DataFrame` is returned.
- If `ungroup=false` a `GroupedDataFrame` grouped using `keycols(gdf)` is returned.
-
- If `gd` has zero groups then transformations are applied to vectors of zero length.
- """
-
- """
- combine(gd::GroupedDataFrame, args...; keepkeys::Bool=true, ungroup::Bool=true,
- renamecols::Bool=true)
- combine(fun::Union{Function, Type}, gd::GroupedDataFrame;
- keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)
- combine(pair::Pair, gd::GroupedDataFrame; keepkeys::Bool=true, ungroup::Bool=true,
- renamecols::Bool=true)
-
- Apply operations to each group in a [`GroupedDataFrame`](@ref) and return the combined
- result as a `DataFrame` if `ungroup=true` or `GroupedDataFrame` if `ungroup=false`.
-
- If an `AbstractDataFrame` is passed, apply operations to the data frame as a whole
- and a `DataFrame` is always returend.
-
- $F_ARGUMENT_RULES
-
- $F_TYPE_RULES
-
- $KWARG_PROCESSING_RULES
-
- Ordering of rows follows the order of groups in `gdf`.
-
- # See also
-
- [`groupby`](@ref), [`select`](@ref), [`select!`](@ref), [`transform`](@ref), [`transform!`](@ref)
-
- # Examples
- ```jldoctest
- julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
- b = repeat([2, 1], outer=[4]),
- c = 1:8);
-
- julia> gd = groupby(df, :a);
-
- julia> combine(gd, :c => sum, nrow)
- 4×3 DataFrame
- │ Row │ a │ c_sum │ nrow │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 1 │ 6 │ 2 │
- │ 2 │ 2 │ 8 │ 2 │
- │ 3 │ 3 │ 10 │ 2 │
- │ 4 │ 4 │ 12 │ 2 │
-
- julia> combine(gd, :c => sum, nrow, ungroup=false)
- GroupedDataFrame with 4 groups based on key: a
- First Group (1 row): a = 1
- │ Row │ a │ c_sum │ nrow │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 1 │ 6 │ 2 │
- ⋮
- Last Group (1 row): a = 4
- │ Row │ a │ c_sum │ nrow │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 4 │ 12 │ 2 │
-
- julia> combine(sdf -> sum(sdf.c), gd) # Slower variant
- 4×2 DataFrame
- │ Row │ a │ x1 │
- │ │ Int64 │ Int64 │
- ├─────┼───────┼───────┤
- │ 1 │ 1 │ 6 │
- │ 2 │ 2 │ 8 │
- │ 3 │ 3 │ 10 │
- │ 4 │ 4 │ 12 │
-
- julia> combine(gdf) do d # do syntax for the slower variant
- sum(d.c)
- end
- 4×2 DataFrame
- │ Row │ a │ x1 │
- │ │ Int64 │ Int64 │
- ├─────┼───────┼───────┤
- │ 1 │ 1 │ 6 │
- │ 2 │ 2 │ 8 │
- │ 3 │ 3 │ 10 │
- │ 4 │ 4 │ 12 │
-
- julia> combine(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column
- 4×2 DataFrame
- │ Row │ a │ sum_log_c │
- │ │ Int64 │ Float64 │
- ├─────┼───────┼───────────┤
- │ 1 │ 1 │ 1.60944 │
- │ 2 │ 2 │ 2.48491 │
- │ 3 │ 3 │ 3.04452 │
- │ 4 │ 4 │ 3.46574 │
-
-
- julia> combine(gd, [:b, :c] .=> sum) # passing a vector of pairs
- 4×3 DataFrame
- │ Row │ a │ b_sum │ c_sum │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 1 │ 4 │ 6 │
- │ 2 │ 2 │ 2 │ 8 │
- │ 3 │ 3 │ 4 │ 10 │
- │ 4 │ 4 │ 2 │ 12 │
-
- julia> combine(gd) do sdf # dropping group when DataFrame() is returned
- sdf.c[1] != 1 ? sdf : DataFrame()
- end
- 6×3 DataFrame
- │ Row │ a │ b │ c │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 2 │ 1 │ 2 │
- │ 2 │ 2 │ 1 │ 6 │
- │ 3 │ 3 │ 2 │ 3 │
- │ 4 │ 3 │ 2 │ 7 │
- │ 5 │ 4 │ 1 │ 4 │
- │ 6 │ 4 │ 1 │ 8 │
-
- julia> combine(gd, :b => :b1, :c => :c1,
- [:b, :c] => +, keepkeys=false) # auto-splatting, renaming and keepkeys
- 8×3 DataFrame
- │ Row │ b1 │ c1 │ b_c_+ │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 2 │ 1 │ 3 │
- │ 2 │ 2 │ 5 │ 7 │
- │ 3 │ 1 │ 2 │ 3 │
- │ 4 │ 1 │ 6 │ 7 │
- │ 5 │ 2 │ 3 │ 5 │
- │ 6 │ 2 │ 7 │ 9 │
- │ 7 │ 1 │ 4 │ 5 │
- │ 8 │ 1 │ 8 │ 9 │
-
- julia> combine(gd, :b, :c => sum) # passing columns and broadcasting
- 8×3 DataFrame
- │ Row │ a │ b │ c_sum │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 1 │ 2 │ 6 │
- │ 2 │ 1 │ 2 │ 6 │
- │ 3 │ 2 │ 1 │ 8 │
- │ 4 │ 2 │ 1 │ 8 │
- │ 5 │ 3 │ 2 │ 10 │
- │ 6 │ 3 │ 2 │ 10 │
- │ 7 │ 4 │ 1 │ 12 │
- │ 8 │ 4 │ 1 │ 12 │
-
- julia> combine(gd, [:b, :c] .=> Ref)
- 4×3 DataFrame
- │ Row │ a │ b_Ref │ c_Ref │
- │ │ Int64 │ SubArra… │ SubArra… │
- ├─────┼───────┼──────────┼──────────┤
- │ 1 │ 1 │ [2, 2] │ [1, 5] │
- │ 2 │ 2 │ [1, 1] │ [2, 6] │
- │ 3 │ 3 │ [2, 2] │ [3, 7] │
- │ 4 │ 4 │ [1, 1] │ [4, 8] │
-
- julia> combine(gd, AsTable(:) => Ref)
- 4×2 DataFrame
- │ Row │ a │ a_b_c_Ref │
- │ │ Int64 │ NamedTuple… │
- ├─────┼───────┼──────────────────────────────────────┤
- │ 1 │ 1 │ (a = [1, 1], b = [2, 2], c = [1, 5]) │
- │ 2 │ 2 │ (a = [2, 2], b = [1, 1], c = [2, 6]) │
- │ 3 │ 3 │ (a = [3, 3], b = [2, 2], c = [3, 7]) │
- │ 4 │ 4 │ (a = [4, 4], b = [1, 1], c = [4, 8]) │
-
- julia> combine(gd, :, AsTable(Not(:a)) => sum, renamecols=false)
- 8×4 DataFrame
- │ Row │ a │ b │ c │ b_c │
- │ │ Int64 │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┼───────┤
- │ 1 │ 1 │ 2 │ 1 │ 3 │
- │ 2 │ 1 │ 2 │ 5 │ 7 │
- │ 3 │ 2 │ 1 │ 2 │ 3 │
- │ 4 │ 2 │ 1 │ 6 │ 7 │
- │ 5 │ 3 │ 2 │ 3 │ 5 │
- │ 6 │ 3 │ 2 │ 7 │ 9 │
- │ 7 │ 4 │ 1 │ 4 │ 5 │
- │ 8 │ 4 │ 1 │ 8 │ 9 │
- ```
- """
- function combine(f::Base.Callable, gd::GroupedDataFrame;
- keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)
- return combine_helper(f, gd, keepkeys=keepkeys, ungroup=ungroup,
- copycols=true, keeprows=false, renamecols=renamecols)
- end
-
- combine(f::typeof(nrow), gd::GroupedDataFrame;
- keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) =
- combine(gd, [nrow => :nrow], keepkeys=keepkeys, ungroup=ungroup,
- renamecols=renamecols)
-
- function combine(p::Pair, gd::GroupedDataFrame;
- keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)
- # move handling of aggregate to specialized combine
- p_from, p_to = p
-
- # verify if it is not better to use a fast path, which we achieve
- # by moving to combine(::GroupedDataFrame, ::AbstractVector) method
- # note that even if length(gd) == 0 we can do this step
- if isagg(p_from => (p_to isa Pair ? first(p_to) : p_to), gd) || p_from === nrow
- return combine(gd, [p], keepkeys=keepkeys, ungroup=ungroup, renamecols=renamecols)
- end
-
- if p_from isa Tuple
- cs = collect(p_from)
- # an explicit error is thrown as this was allowed in the past
- throw(ArgumentError("passing a Tuple $p_from as column selector is not supported" *
- ", use a vector $cs instead"))
- else
- cs = p_from
- end
- return combine_helper(cs => p_to, gd, keepkeys=keepkeys, ungroup=ungroup,
- copycols=true, keeprows=false, renamecols=renamecols)
- end
-
- combine(gd::GroupedDataFrame,
- cs::Union{Pair, typeof(nrow), ColumnIndex, MultiColumnIndex}...;
- keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) =
- _combine_prepare(gd, cs..., keepkeys=keepkeys, ungroup=ungroup,
- copycols=true, keeprows=false, renamecols=renamecols)
-
- function _combine_prepare(gd::GroupedDataFrame,
- @nospecialize(cs::Union{Pair, typeof(nrow),
- ColumnIndex, MultiColumnIndex}...);
- keepkeys::Bool, ungroup::Bool, copycols::Bool,
- keeprows::Bool, renamecols::Bool)
- cs_vec = []
- for p in cs
- if p === nrow
- push!(cs_vec, nrow => :nrow)
- elseif p isa AbstractVector{<:Pair}
- append!(cs_vec, p)
- else
- push!(cs_vec, p)
- end
- end
- if any(x -> x isa Pair && first(x) isa Tuple, cs_vec)
- x = cs_vec[findfirst(x -> first(x) isa Tuple, cs_vec)]
- # an explicit error is thrown as this was allowed in the past
- throw(ArgumentError("passing a Tuple $(first(x)) as column selector is not supported" *
- ", use a vector $(collect(first(x))) instead"))
- for (i, v) in enumerate(cs_vec)
- if first(v) isa Tuple
- cs_vec[i] = collect(first(v)) => last(v)
- end
- end
- end
- cs_norm_pre = [normalize_selection(index(parent(gd)), c, renamecols) for c in cs_vec]
- seen_cols = Set{Symbol}()
- process_vectors = false
- for v in cs_norm_pre
- if v isa Pair
- out_col = last(last(v))
- if out_col in seen_cols
- throw(ArgumentError("Duplicate output column name $out_col requested"))
- end
- push!(seen_cols, out_col)
- else
- @assert v isa AbstractVector{Int}
- process_vectors = true
- end
- end
- processed_cols = Set{Symbol}()
- if process_vectors
- cs_norm = Pair[]
- for (i, v) in enumerate(cs_norm_pre)
- if v isa Pair
- push!(cs_norm, v)
- push!(processed_cols, last(last(v)))
- else
- @assert v isa AbstractVector{Int}
- for col_idx in v
- col_name = _names(gd)[col_idx]
- if !(col_name in processed_cols)
- push!(processed_cols, col_name)
- if col_name in seen_cols
- trans_idx = findfirst(cs_norm_pre) do p
- p isa Pair || return false
- last(last(p)) == col_name
- end
- @assert !isnothing(trans_idx) && trans_idx > i
- push!(cs_norm, cs_norm_pre[trans_idx])
- # it is safe to delete from cs_norm_pre
- # as we have not reached trans_idx index yet
- deleteat!(cs_norm_pre, trans_idx)
- else
- push!(cs_norm, col_idx => identity => col_name)
- end
- end
- end
- end
- end
- else
- cs_norm = collect(Pair, cs_norm_pre)
- end
- f = Pair[first(x) => first(last(x)) for x in cs_norm]
- nms = Symbol[last(last(x)) for x in cs_norm]
- return combine_helper(f, gd, nms, keepkeys=keepkeys, ungroup=ungroup,
- copycols=copycols, keeprows=keeprows, renamecols=renamecols)
- end
-
- function gen_groups(idx::Vector{Int})
0 groups = zeros(Int, length(idx))
0 groups[1] = 1
- j = 1
0 last_idx = idx[1]
0 @inbounds for i in 2:length(idx)
0 cur_idx = idx[i]
0 j += cur_idx != last_idx
- last_idx = cur_idx
0 groups[i] = j
- end
0 return groups
- end
-
- function combine_helper(f, gd::GroupedDataFrame,
- nms::Union{AbstractVector{Symbol},Nothing}=nothing;
- keepkeys::Bool, ungroup::Bool,
- copycols::Bool, keeprows::Bool, renamecols::Bool)
16 if !ungroup && !keepkeys
0 throw(ArgumentError("keepkeys=false when ungroup=false is not allowed"))
- end
32 idx, valscat = _combine(f, gd, nms, copycols, keeprows, renamecols)
0 !keepkeys && ungroup && return valscat
0 keys = groupcols(gd)
0 for key in keys
0 if hasproperty(valscat, key)
0 if (keeprows && !isequal(valscat[!, key], parent(gd)[!, key])) ||
- (!keeprows && !isequal(valscat[!, key], view(parent(gd)[!, key], idx)))
0 throw(ArgumentError("column :$key in returned data frame " *
- "is not equal to grouping key :$key"))
- end
- end
- end
0 if keeprows
0 newparent = select(parent(gd), gd.cols, copycols=copycols)
- else
224 newparent = length(gd) > 0 ? parent(gd)[idx, gd.cols] : parent(gd)[1:0, gd.cols]
- end
16 added_cols = select(valscat, Not(intersect(keys, _names(valscat))), copycols=false)
384 hcat!(newparent, length(gd) > 0 ? added_cols : similar(added_cols, 0), copycols=false)
0 ungroup && return newparent
-
0 if length(idx) == 0 && !(keeprows && length(keys) > 0)
0 @assert nrow(newparent) == 0
0 return GroupedDataFrame(newparent, copy(gd.cols), Int[],
- Int[], Int[], Int[], 0, Dict{Any,Int}(),
- Threads.ReentrantLock())
0 elseif keeprows
0 @assert length(keys) > 0 || idx == gd.idx
- # in this case we are sure that the result GroupedDataFrame has the
- # same structure as the source except that grouping columns are at the start
0 return Threads.lock(gd.lazy_lock) do
0 return GroupedDataFrame(newparent, copy(gd.cols), gd.groups,
- getfield(gd, :idx), getfield(gd, :starts),
- getfield(gd, :ends), gd.ngroups,
- getfield(gd, :keymap), Threads.ReentrantLock())
- end
- else
0 groups = gen_groups(idx)
0 @assert groups[end] <= length(gd)
0 return GroupedDataFrame(newparent, copy(gd.cols), groups,
- nothing, nothing, nothing, groups[end], nothing,
- Threads.ReentrantLock())
- end
- end
-
- # Wrapping automatically adds column names when the value returned
- # by the user-provided function lacks them
- wrap(x::Union{AbstractDataFrame, NamedTuple, DataFrameRow}) = x
- wrap(x::AbstractMatrix) =
- NamedTuple{Tuple(gennames(size(x, 2)))}(Tuple(view(x, :, i) for i in 1:size(x, 2)))
- wrap(x::Any) = (x1=x,)
-
- const ERROR_ROW_COUNT = "return value must not change its kind " *
- "(single row or variable number of rows) across groups"
-
- const ERROR_COL_COUNT = "function must return only single-column values, " *
- "or only multiple-column values"
-
- wrap_table(x::Any, ::Val) =
- throw(ArgumentError(ERROR_ROW_COUNT))
- function wrap_table(x::Union{NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}},
- AbstractDataFrame, AbstractMatrix},
- ::Val{firstmulticol}) where firstmulticol
- if !firstmulticol
- throw(ArgumentError(ERROR_COL_COUNT))
- end
- return wrap(x)
- end
-
- function wrap_table(x::AbstractVector, ::Val{firstmulticol}) where firstmulticol
- if firstmulticol
- throw(ArgumentError(ERROR_COL_COUNT))
- end
- return wrap(x)
- end
-
- function wrap_row(x::Any, ::Val{firstmulticol}) where firstmulticol
- # NamedTuple is not possible in this branch
- if (x isa DataFrameRow) ⊻ firstmulticol
- throw(ArgumentError(ERROR_COL_COUNT))
- end
0 return wrap(x)
- end
-
- function wrap_row(x::Union{AbstractArray{<:Any, 0}, Ref},
- ::Val{firstmulticol}) where firstmulticol
- if firstmulticol
- throw(ArgumentError(ERROR_COL_COUNT))
- end
- return (x1 = x[],)
- end
-
- # note that also NamedTuple() is correctly captured by this definition
- # as it is more specific than the one below
- wrap_row(::Union{AbstractVecOrMat, AbstractDataFrame,
- NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}, ::Val) =
- throw(ArgumentError(ERROR_ROW_COUNT))
-
- function wrap_row(x::NamedTuple, ::Val{firstmulticol}) where firstmulticol
- if any(v -> v isa AbstractVector, x)
- throw(ArgumentError("mixing single values and vectors in a named tuple is not allowed"))
- end
- if !firstmulticol
- throw(ArgumentError(ERROR_COL_COUNT))
- end
- return x
- end
-
- # idx, starts and ends are passed separately to avoid cost of field access in tight loop
- # Manual unrolling of Tuple is used as it turned out more efficient than @generated
- # for small number of columns passed.
- # For more than 4 columns `map` is slower than @generated
- # but this case is probably rare and if huge number of columns is passed @generated
- # has very high compilation cost
- function do_call(f::Any, idx::AbstractVector{<:Integer},
- starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
- gd::GroupedDataFrame, incols::Tuple{}, i::Integer)
- f()
- end
-
- function do_call(f::Any, idx::AbstractVector{<:Integer},
- starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
- gd::GroupedDataFrame, incols::Tuple{AbstractVector}, i::Integer)
620373392 idx = idx[starts[i]:ends[i]]
0 return f(view(incols[1], idx))
- end
-
- function do_call(f::Any, idx::AbstractVector{<:Integer},
- starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
- gd::GroupedDataFrame, incols::NTuple{2, AbstractVector}, i::Integer)
- idx = idx[starts[i]:ends[i]]
- return f(view(incols[1], idx), view(incols[2], idx))
- end
-
- function do_call(f::Any, idx::AbstractVector{<:Integer},
- starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
- gd::GroupedDataFrame, incols::NTuple{3, AbstractVector}, i::Integer)
- idx = idx[starts[i]:ends[i]]
- return f(view(incols[1], idx), view(incols[2], idx), view(incols[3], idx))
- end
-
- function do_call(f::Any, idx::AbstractVector{<:Integer},
- starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
- gd::GroupedDataFrame, incols::NTuple{4, AbstractVector}, i::Integer)
- idx = idx[starts[i]:ends[i]]
- return f(view(incols[1], idx), view(incols[2], idx), view(incols[3], idx),
- view(incols[4], idx))
- end
-
- function do_call(f::Any, idx::AbstractVector{<:Integer},
- starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
- gd::GroupedDataFrame, incols::Tuple, i::Integer)
- idx = idx[starts[i]:ends[i]]
- return f(map(c -> view(c, idx), incols)...)
- end
-
- function do_call(f::Any, idx::AbstractVector{<:Integer},
- starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
- gd::GroupedDataFrame, incols::NamedTuple, i::Integer)
- idx = idx[starts[i]:ends[i]]
- return f(map(c -> view(c, idx), incols))
- end
-
- function do_call(f::Any, idx::AbstractVector{<:Integer},
- starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
- gd::GroupedDataFrame, incols::Nothing, i::Integer)
- idx = idx[starts[i]:ends[i]]
- return f(view(parent(gd), idx, :))
- end
-
- _nrow(df::AbstractDataFrame) = nrow(df)
- _nrow(x::NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}) =
- isempty(x) ? 0 : length(x[1])
- _ncol(df::AbstractDataFrame) = ncol(df)
- _ncol(x::Union{NamedTuple, DataFrameRow}) = length(x)
-
- abstract type AbstractAggregate end
-
- struct Reduce{O, C, A} <: AbstractAggregate
- op::O
- condf::C
- adjust::A
- checkempty::Bool
- end
- Reduce(f, condf=nothing, adjust=nothing) = Reduce(f, condf, adjust, false)
-
- check_aggregate(f::Any, ::AbstractVector) = f
- check_aggregate(f::typeof(sum), ::AbstractVector{<:Union{Missing, Number}}) =
- Reduce(Base.add_sum)
- check_aggregate(f::typeof(sum∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) =
- Reduce(Base.add_sum, !ismissing)
- check_aggregate(f::typeof(prod), ::AbstractVector{<:Union{Missing, Number}}) =
- Reduce(Base.mul_prod)
- check_aggregate(f::typeof(prod∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) =
- Reduce(Base.mul_prod, !ismissing)
- check_aggregate(f::typeof(maximum),
- ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f
- check_aggregate(f::typeof(maximum), v::AbstractVector{<:Union{Missing, Real}}) =
- eltype(v) === Any ? f : Reduce(max)
- check_aggregate(f::typeof(maximum∘skipmissing),
- ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f
- check_aggregate(f::typeof(maximum∘skipmissing), v::AbstractVector{<:Union{Missing, Real}}) =
- eltype(v) === Any ? f : Reduce(max, !ismissing, nothing, true)
- check_aggregate(f::typeof(minimum),
- ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f
- check_aggregate(f::typeof(minimum), v::AbstractVector{<:Union{Missing, Real}}) =
- eltype(v) === Any ? f : Reduce(min)
- check_aggregate(f::typeof(minimum∘skipmissing),
- ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f
- check_aggregate(f::typeof(minimum∘skipmissing), v::AbstractVector{<:Union{Missing, Real}}) =
- eltype(v) === Any ? f : Reduce(min, !ismissing, nothing, true)
- check_aggregate(f::typeof(mean), ::AbstractVector{<:Union{Missing, Number}}) =
- Reduce(Base.add_sum, nothing, /)
- check_aggregate(f::typeof(mean∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) =
- Reduce(Base.add_sum, !ismissing, /)
-
- # Other aggregate functions which are not strictly reductions
- struct Aggregate{F, C} <: AbstractAggregate
- f::F
- condf::C
- end
- Aggregate(f) = Aggregate(f, nothing)
-
- check_aggregate(f::typeof(var), ::AbstractVector{<:Union{Missing, Number}}) =
- Aggregate(var)
- check_aggregate(f::typeof(var∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) =
- Aggregate(var, !ismissing)
- check_aggregate(f::typeof(std), ::AbstractVector{<:Union{Missing, Number}}) =
- Aggregate(std)
- check_aggregate(f::typeof(std∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) =
- Aggregate(std, !ismissing)
- check_aggregate(f::typeof(first), v::AbstractVector) =
- eltype(v) === Any ? f : Aggregate(first)
- check_aggregate(f::typeof(first),
- ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f
- check_aggregate(f::typeof(first∘skipmissing), v::AbstractVector) =
- eltype(v) === Any ? f : Aggregate(first, !ismissing)
- check_aggregate(f::typeof(first∘skipmissing),
- ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f
- check_aggregate(f::typeof(last), v::AbstractVector) =
- eltype(v) === Any ? f : Aggregate(last)
- check_aggregate(f::typeof(last),
- ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f
- check_aggregate(f::typeof(last∘skipmissing), v::AbstractVector) =
- eltype(v) === Any ? f : Aggregate(last, !ismissing)
- check_aggregate(f::typeof(last∘skipmissing),
- ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f
- check_aggregate(f::typeof(length), ::AbstractVector) = Aggregate(length)
-
- # SkipMissing does not support length
-
- # Find first value matching condition for each group
- # Optimized for situations where a matching value is typically encountered
- # among the first rows for each group
- function fillfirst!(condf, outcol::AbstractVector, incol::AbstractVector,
- gd::GroupedDataFrame; rev::Bool=false)
0 ngroups = gd.ngroups
- # Use group indices if they have already been computed
0 idx = getfield(gd, :idx)
0 if idx !== nothing && condf === nothing
0 v = rev ? gd.ends : gd.starts
0 @inbounds for i in 1:ngroups
0 outcol[i] = incol[idx[v[i]]]
- end
0 elseif idx !== nothing
- nfilled = 0
0 starts = gd.starts
0 @inbounds for i in eachindex(outcol)
0 s = starts[i]
0 offsets = rev ? (nrow(gd[i])-1:-1:0) : (0:nrow(gd[i])-1)
0 for j in offsets
0 x = incol[idx[s+j]]
0 if !condf === nothing || condf(x)
- outcol[i] = x
- nfilled += 1
0 break
- end
- end
- end
0 if nfilled < length(outcol)
0 throw(ArgumentError("some groups contain only missing values"))
- end
- else # Finding first row is faster than computing all group indices
0 groups = gd.groups
0 if rev
0 r = length(groups):-1:1
- else
0 r = 1:length(groups)
- end
0 filled = fill(false, ngroups)
- nfilled = 0
0 @inbounds for i in r
0 gix = groups[i]
0 x = incol[i]
0 if gix > 0 && (condf === nothing || condf(x)) && !filled[gix]
0 filled[gix] = true
0 outcol[gix] = x
0 nfilled += 1
0 nfilled == ngroups && break
- end
- end
0 if nfilled < length(outcol)
0 throw(ArgumentError("some groups contain only missing values"))
- end
- end
0 outcol
- end
-
- # Use a strategy similar to reducedim_init from Base to get the vector of the right type
- function groupreduce_init(op, condf, adjust,
- incol::AbstractVector{U}, gd::GroupedDataFrame) where U
- T = Base.promote_union(U)
-
- if op === Base.add_sum
- initf = zero
- elseif op === Base.mul_prod
- initf = one
- else
- throw(ErrorException("Unrecognized op $op"))
- end
-
- Tnm = nonmissingtype(T)
- if isconcretetype(Tnm) && applicable(initf, Tnm)
- tmpv = initf(Tnm)
- initv = op(tmpv, tmpv)
- if adjust isa Nothing
- x = Tnm <: AbstractIrrational ? float(initv) : initv
- else
- x = adjust(initv, 1)
- end
- if condf === !ismissing
- V = typeof(x)
- else
- V = U >: Missing ? Union{typeof(x), Missing} : typeof(x)
- end
- v = similar(incol, V, length(gd))
- fill!(v, x)
- return v
- else
- # do not try to determine the narrowest possible type nor starting value
- # as this is not possible to do correctly in general without processing
- # groups; it will get fixed later in groupreduce!; later we
- # will make use of the fact that this vector is filled with #undef
- # while above the vector is filled with a concrete value
- return Vector{Any}(undef, length(gd))
- end
- end
-
- for (op, initf) in ((:max, :typemin), (:min, :typemax))
- @eval begin
- function groupreduce_init(::typeof($op), condf, adjust,
- incol::AbstractVector{T}, gd::GroupedDataFrame) where T
- @assert isnothing(adjust)
- S = nonmissingtype(T)
- # !ismissing check is purely an optimization to avoid a copy later
- outcol = similar(incol, condf === !ismissing ? S : T, length(gd))
- # Comparison is possible only between CatValues from the same pool
- if incol isa CategoricalVector
- U = Union{CategoricalArrays.leveltype(outcol),
- eltype(outcol) >: Missing ? Missing : Union{}}
- outcol = CategoricalArray{U, 1}(outcol.refs, incol.pool)
- end
- # It is safe to use a non-missing init value
- # since missing will poison the result if present
- # we assume here that groups are non-empty (current design assures this)
- # + workaround for https://github.com/JuliaLang/julia/issues/36978
- if isconcretetype(S) && hasmethod($initf, Tuple{S}) && !(S <: Irrational)
- fill!(outcol, $initf(S))
- else
- fillfirst!(condf, outcol, incol, gd)
- end
- return outcol
- end
- end
- end
-
- function copyto_widen!(res::AbstractVector{T}, x::AbstractVector) where T
- @inbounds for i in eachindex(res, x)
- val = x[i]
- S = typeof(val)
- if S <: T || promote_type(S, T) <: T
- res[i] = val
- else
- newres = Tables.allocatecolumn(promote_type(S, T), length(x))
- return copyto_widen!(newres, x)
- end
- end
- return res
- end
-
- function groupreduce!(res::AbstractVector, f, op, condf, adjust, checkempty::Bool,
- incol::AbstractVector, gd::GroupedDataFrame)
- n = length(gd)
- if adjust !== nothing || checkempty
- counts = zeros(Int, n)
- end
- groups = gd.groups
- @inbounds for i in eachindex(incol, groups)
- gix = groups[i]
- x = incol[i]
- if gix > 0 && (condf === nothing || condf(x))
- # this check should be optimized out if U is not Any
- if eltype(res) === Any && !isassigned(res, gix)
- res[gix] = f(x, gix)
- else
- res[gix] = op(res[gix], f(x, gix))
- end
- if adjust !== nothing || checkempty
- counts[gix] += 1
- end
- end
- end
- # handle the case of an unitialized reduction
- if eltype(res) === Any
- if op === Base.add_sum
- initf = zero
- elseif op === Base.mul_prod
- initf = one
- else
- initf = x -> throw(ErrorException("Unrecognized op $op"))
- end
- @inbounds for gix in eachindex(res)
- if !isassigned(res, gix)
- res[gix] = initf(nonmissingtype(eltype(incol)))
- end
- end
- end
- if adjust !== nothing
- res .= adjust.(res, counts)
- end
- if checkempty && any(iszero, counts)
- throw(ArgumentError("some groups contain only missing values"))
- end
- # Undo pool sharing done by groupreduce_init
- if res isa CategoricalVector && res.pool === incol.pool
- V = Union{CategoricalArrays.leveltype(res),
- eltype(res) >: Missing ? Missing : Union{}}
- res = CategoricalArray{V, 1}(res.refs, copy(res.pool))
- end
- if isconcretetype(eltype(res))
- return res
- else
- return copyto_widen!(Tables.allocatecolumn(typeof(first(res)), n), res)
- end
- end
-
- # function barrier works around type instability of groupreduce_init due to applicable
- groupreduce(f, op, condf, adjust, checkempty::Bool,
- incol::AbstractVector, gd::GroupedDataFrame) =
- groupreduce!(groupreduce_init(op, condf, adjust, incol, gd),
- f, op, condf, adjust, checkempty, incol, gd)
- # Avoids the overhead due to Missing when computing reduction
- groupreduce(f, op, condf::typeof(!ismissing), adjust, checkempty::Bool,
- incol::AbstractVector, gd::GroupedDataFrame) =
- groupreduce!(disallowmissing(groupreduce_init(op, condf, adjust, incol, gd)),
- f, op, condf, adjust, checkempty, incol, gd)
-
- (r::Reduce)(incol::AbstractVector, gd::GroupedDataFrame) =
- groupreduce((x, i) -> x, r.op, r.condf, r.adjust, r.checkempty, incol, gd)
-
- # this definition is missing in Julia 1.0 LTS and is required by aggregation for var
- # TODO: remove this when we drop 1.0 support
- if VERSION < v"1.1"
- Base.zero(::Type{Missing}) = missing
- end
-
- function (agg::Aggregate{typeof(var)})(incol::AbstractVector, gd::GroupedDataFrame)
- means = groupreduce((x, i) -> x, Base.add_sum, agg.condf, /, false, incol, gd)
- # !ismissing check is purely an optimization to avoid a copy later
- if eltype(means) >: Missing && agg.condf !== !ismissing
- T = Union{Missing, real(eltype(means))}
- else
- T = real(eltype(means))
- end
- res = zeros(T, length(gd))
- return groupreduce!(res, (x, i) -> @inbounds(abs2(x - means[i])), +, agg.condf,
- (x, l) -> l <= 1 ? oftype(x / (l-1), NaN) : x / (l-1),
- false, incol, gd)
- end
-
- function (agg::Aggregate{typeof(std)})(incol::AbstractVector, gd::GroupedDataFrame)
- outcol = Aggregate(var, agg.condf)(incol, gd)
- if eltype(outcol) <: Union{Missing, Rational}
- return sqrt.(outcol)
- else
- return map!(sqrt, outcol, outcol)
- end
- end
-
- for f in (first, last)
- function (agg::Aggregate{typeof(f)})(incol::AbstractVector, gd::GroupedDataFrame)
- n = length(gd)
- outcol = similar(incol, n)
- fillfirst!(agg.condf, outcol, incol, gd, rev=agg.f === last)
- if isconcretetype(eltype(outcol))
- return outcol
- else
- return copyto_widen!(Tables.allocatecolumn(typeof(first(outcol)), n), outcol)
- end
- end
- end
-
- function (agg::Aggregate{typeof(length)})(incol::AbstractVector, gd::GroupedDataFrame)
- if getfield(gd, :idx) === nothing
- lens = zeros(Int, length(gd))
- @inbounds for gix in gd.groups
- gix > 0 && (lens[gix] += 1)
- end
- return lens
- else
- return gd.ends .- gd.starts .+ 1
- end
- end
-
- isagg((col, fun)::Pair, gdf::GroupedDataFrame) =
- col isa ColumnIndex && check_aggregate(fun, parent(gdf)[!, col]) isa AbstractAggregate
-
- function _agg2idx_map_helper(idx, idx_agg)
- agg2idx_map = fill(-1, length(idx))
- aggj = 1
- @inbounds for (j, idxj) in enumerate(idx)
- while idx_agg[aggj] != idxj
- aggj += 1
- @assert aggj <= length(idx_agg)
- end
- agg2idx_map[j] = aggj
- end
- return agg2idx_map
- end
-
- function prepare_idx_keeprows(idx::AbstractVector{<:Integer},
- starts::AbstractVector{<:Integer},
- ends::AbstractVector{<:Integer},
- nrowparent::Integer)
- idx_keeprows = Vector{Int}(undef, nrowparent)
- i = 0
- for (s, e) in zip(starts, ends)
- v = idx[s]
- for k in s:e
- i += 1
- idx_keeprows[i] = v
- end
- end
- @assert i == nrowparent
- return idx_keeprows
- end
-
- function _combine(f::AbstractVector{<:Pair},
- gd::GroupedDataFrame, nms::AbstractVector{Symbol},
- copycols::Bool, keeprows::Bool, renamecols::Bool)
- # here f should be normalized and in a form of source_cols => fun
- @assert all(x -> first(x) isa Union{Int, AbstractVector{Int}, AsTable}, f)
- @assert all(x -> last(x) isa Base.Callable, f)
-
- if isempty(f)
- if keeprows && nrow(parent(gd)) > 0 && minimum(gd.groups) == 0
- throw(ArgumentError("select and transform do not support " *
- "`GroupedDataFrame`s from which some groups have "*
- "been dropped (including skipmissing=true)"))
- end
- return Int[], DataFrame()
- end
-
- if keeprows
- if nrow(parent(gd)) > 0 && minimum(gd.groups) == 0
- throw(ArgumentError("select and transform do not support " *
- "`GroupedDataFrame`s from which some groups have "*
- "been dropped (including skipmissing=true)"))
- end
- idx_keeprows = prepare_idx_keeprows(gd.idx, gd.starts, gd.ends, nrow(parent(gd)))
- else
- idx_keeprows = nothing
- end
-
- idx_agg = nothing
- if length(gd) > 0 && any(x -> isagg(x, gd), f)
- # Compute indices of representative rows only once for all AbstractAggregates
- idx_agg = Vector{Int}(undef, length(gd))
- fillfirst!(nothing, idx_agg, 1:length(gd.groups), gd)
- elseif length(gd) == 0 || !all(x -> isagg(x, gd), f)
- # Trigger computation of indices
- # This can speed up some aggregates that would not trigger this on their own
- @assert gd.idx !== nothing
- end
- res = Vector{Any}(undef, length(f))
- parentdf = parent(gd)
- for (i, p) in enumerate(f)
- source_cols, fun = p
- if length(gd) > 0 && isagg(p, gd)
- incol = parentdf[!, source_cols]
- agg = check_aggregate(last(p), incol)
- outcol = agg(incol, gd)
- res[i] = idx_agg, outcol
- elseif keeprows && fun === identity && !(source_cols isa AsTable)
- @assert source_cols isa Union{Int, AbstractVector{Int}}
- @assert length(source_cols) == 1
- outcol = parentdf[!, first(source_cols)]
- res[i] = idx_keeprows, copycols ? copy(outcol) : outcol
- else
- if source_cols isa Int
- incols = (parentdf[!, source_cols],)
- elseif source_cols isa AsTable
- incols = Tables.columntable(select(parentdf,
- source_cols.cols,
- copycols=false))
- else
- @assert source_cols isa AbstractVector{Int}
- incols = ntuple(i -> parentdf[!, source_cols[i]], length(source_cols))
- end
- firstres = length(gd) > 0 ?
- do_call(fun, gd.idx, gd.starts, gd.ends, gd, incols, 1) :
- do_call(fun, Int[], 1:1, 0:0, gd, incols, 1)
- firstmulticol = firstres isa MULTI_COLS_TYPE
- if firstmulticol
- throw(ArgumentError("a single value or vector result is required when " *
- "passing multiple functions (got $(typeof(res)))"))
- end
- # if idx_agg was not computed yet it is nothing
- # in this case if we are not passed a vector compute it.
- if !(firstres isa AbstractVector) && isnothing(idx_agg)
- idx_agg = Vector{Int}(undef, length(gd))
- fillfirst!(nothing, idx_agg, 1:length(gd.groups), gd)
- end
- # TODO: if firstres is a vector we recompute idx for every function
- # this could be avoided - it could be computed only the first time
- # and later we could just check if lengths of groups match this first idx
-
- # the last argument passed to _combine_with_first informs it about precomputed
- # idx. Currently we do it only for single-row return values otherwise we pass
- # nothing to signal that idx has to be computed in _combine_with_first
- idx, outcols, _ = _combine_with_first(wrap(firstres), fun, gd, incols,
- Val(firstmulticol),
- firstres isa AbstractVector ? nothing : idx_agg)
- @assert length(outcols) == 1
- res[i] = idx, outcols[1]
- end
- end
- # idx_agg === nothing then we have only functions that
- # returned multiple rows and idx_loc = 1
- idx_loc = findfirst(x -> x[1] !== idx_agg, res)
- if !keeprows && isnothing(idx_loc)
- @assert !isnothing(idx_agg)
- idx = idx_agg
- else
- idx = keeprows ? idx_keeprows : res[idx_loc][1]
- agg2idx_map = nothing
- for i in 1:length(res)
- if res[i][1] !== idx && res[i][1] != idx
- if res[i][1] === idx_agg
- # we perform pseudo broadcasting here
- # keep -1 as a sentinel for errors
- if isnothing(agg2idx_map)
- agg2idx_map = _agg2idx_map_helper(idx, idx_agg)
- end
- res[i] = idx_agg, res[i][2][agg2idx_map]
- elseif idx != res[i][1]
- if keeprows
- throw(ArgumentError("all functions must return vectors with " *
- "as many values as rows in each group"))
- else
- throw(ArgumentError("all functions must return vectors of the same length"))
- end
- end
- end
- end
- end
-
- # here first field in res[i] is used to keep track how the column was generated
- # a correct index is stored in idx variable
-
- for (i, (col_idx, col)) in enumerate(res)
- if keeprows && res[i][1] !== idx_keeprows # we need to reorder the column
- newcol = similar(col)
- # we can probably make it more efficient, but I leave it as an optimization for the future
- gd_idx = gd.idx
- for j in eachindex(gd.idx, col)
- newcol[gd_idx[j]] = col[j]
- end
- res[i] = (col_idx, newcol)
- end
- end
- outcols = map(x -> x[2], res)
- # this check is redundant given we check idx above
- # but it is safer to double check and it is cheap
- @assert all(x -> length(x) == length(outcols[1]), outcols)
- return idx, DataFrame(collect(AbstractVector, outcols), nms, copycols=false)
- end
-
- function _combine(fun::Base.Callable, gd::GroupedDataFrame, ::Nothing,
- copycols::Bool, keeprows::Bool, renamecols::Bool)
- @assert copycols && !keeprows
- # use `similar` as `gd` might have been subsetted
- firstres = length(gd) > 0 ? fun(gd[1]) : fun(similar(parent(gd), 0))
- idx, outcols, nms = _combine_multicol(firstres, fun, gd, nothing)
- valscat = DataFrame(collect(AbstractVector, outcols), nms)
- return idx, valscat
- end
-
- function _combine(p::Pair, gd::GroupedDataFrame, ::Nothing,
- copycols::Bool, keeprows::Bool, renamecols::Bool)
- # here p should not be normalized as we allow tabular return value from fun
- # map and combine should not dispatch here if p is isagg
0 @assert copycols && !keeprows
0 source_cols, (fun, out_col) = normalize_selection(index(parent(gd)), p, renamecols)
- parentdf = parent(gd)
- if source_cols isa Int
- incols = (parent(gd)[!, source_cols],)
- elseif source_cols isa AsTable
- incols = Tables.columntable(select(parentdf,
- source_cols.cols,
- copycols=false))
- else
- @assert source_cols isa AbstractVector{Int}
0 incols = ntuple(i -> parent(gd)[!, source_cols[i]], length(source_cols))
- end
16 firstres = length(gd) > 0 ?
- do_call(fun, gd.idx, gd.starts, gd.ends, gd, incols, 1) :
- do_call(fun, Int[], 1:1, 0:0, gd, incols, 1)
16 idx, outcols, nms = _combine_multicol(firstres, fun, gd, incols)
- # disallow passing target column name to genuine tables
0 if firstres isa MULTI_COLS_TYPE
0 if p isa Pair{<:Any, <:Pair{<:Any, <:SymbolOrString}}
- throw(ArgumentError("setting column name for tabular return value is disallowed"))
- end
- else
- # fetch auto generated or passed target column name to nms overwritting
- # what _combine_with_first produced
96 nms = [out_col]
- end
96 valscat = DataFrame(collect(AbstractVector, outcols), nms)
208 return idx, valscat
- end
-
- function _combine_multicol(firstres, fun::Any, gd::GroupedDataFrame,
- incols::Union{Nothing, AbstractVector, Tuple, NamedTuple})
- firstmulticol = firstres isa MULTI_COLS_TYPE
192 if !(firstres isa Union{AbstractVecOrMat, AbstractDataFrame,
- NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}})
50577616 idx_agg = Vector{Int}(undef, length(gd))
0 fillfirst!(nothing, idx_agg, 1:length(gd.groups), gd)
- else
- idx_agg = nothing
- end
0 return _combine_with_first(wrap(firstres), fun, gd, incols,
- Val(firstmulticol), idx_agg)
- end
-
- function _combine_with_first(first::Union{NamedTuple, DataFrameRow, AbstractDataFrame},
- f::Any, gd::GroupedDataFrame,
- incols::Union{Nothing, AbstractVector, Tuple, NamedTuple},
- firstmulticol::Val, idx_agg::Union{Nothing, AbstractVector{<:Integer}})
32 extrude = false
-
0 if first isa AbstractDataFrame
- n = 0
0 eltys = eltype.(eachcol(first))
256 elseif first isa NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}
- n = 0
0 eltys = map(eltype, first)
0 elseif first isa DataFrameRow
0 n = length(gd)
0 eltys = [eltype(parent(first)[!, i]) for i in parentcols(index(first))]
240 elseif firstmulticol == Val(false) && first[1] isa Union{AbstractArray{<:Any, 0}, Ref}
- extrude = true
0 first = wrap_row(first[1], firstmulticol)
0 n = length(gd)
0 eltys = (typeof(first[1]),)
- else # other NamedTuple giving a single row
0 n = length(gd)
0 eltys = map(typeof, first)
0 if any(x -> x <: AbstractVector, eltys)
0 throw(ArgumentError("mixing single values and vectors in a named tuple is not allowed"))
- end
- end
0 idx = isnothing(idx_agg) ? Vector{Int}(undef, n) : idx_agg
- local initialcols
- let eltys=eltys, n=n # Workaround for julia#15276
480 initialcols = ntuple(i -> Tables.allocatecolumn(eltys[i], n), _ncol(first))
- end
16 targetcolnames = tuple(propertynames(first)...)
288 if !extrude && first isa Union{AbstractDataFrame,
- NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}
0 outcols, finalcolnames = _combine_tables_with_first!(first, initialcols, idx, 1, 1,
- f, gd, incols, targetcolnames,
- firstmulticol)
- else
48 outcols, finalcolnames = _combine_rows_with_first!(first, initialcols, 1, 1,
- f, gd, incols, targetcolnames,
- firstmulticol)
- end
272 return idx, outcols, collect(Symbol, finalcolnames)
- end
-
- function fill_row!(row, outcols::NTuple{N, AbstractVector},
- i::Integer, colstart::Integer,
- colnames::NTuple{N, Symbol}) where N
- if _ncol(row) != N
- throw(ArgumentError("return value must have the same number of columns " *
- "for all groups (got $N and $(length(row)))"))
- end
0 @inbounds for j in colstart:length(outcols)
0 col = outcols[j]
0 cn = colnames[j]
- local val
- try
202306672 val = row[cn]
- catch
0 throw(ArgumentError("return value must have the same column names " *
- "for all groups (got $colnames and $(propertynames(row)))"))
- end
- S = typeof(val)
- T = eltype(col)
- if S <: T || promote_type(S, T) <: T
0 col[i] = val
- else
0 return j
- end
- end
0 return nothing
- end
-
- function _combine_rows_with_first!(first::Union{NamedTuple, DataFrameRow},
- outcols::NTuple{N, AbstractVector},
- rowstart::Integer, colstart::Integer,
- f::Any, gd::GroupedDataFrame,
- incols::Union{Nothing, AbstractVector, Tuple, NamedTuple},
- colnames::NTuple{N, Symbol},
- firstmulticol::Val) where N
0 len = length(gd)
0 gdidx = gd.idx
0 starts = gd.starts
0 ends = gd.ends
-
- # handle empty GroupedDataFrame
0 len == 0 && return outcols, colnames
-
- # Handle first group
0 j = fill_row!(first, outcols, rowstart, colstart, colnames)
- @assert j === nothing # eltype is guaranteed to match
- # Handle remaining groups
0 @inbounds for i in rowstart+1:len
404608368 row = wrap_row(do_call(f, gdidx, starts, ends, gd, incols, i), firstmulticol)
303456672 j = fill_row!(row, outcols, i, 1, colnames)
0 if j !== nothing # Need to widen column type
- local newcols
0 let i = i, j = j, outcols=outcols, row=row # Workaround for julia#15276
0 newcols = ntuple(length(outcols)) do k
- S = typeof(row[k])
- T = eltype(outcols[k])
- U = promote_type(S, T)
- if S <: T || U <: T
- outcols[k]
- else
- copyto!(Tables.allocatecolumn(U, length(outcols[k])),
- 1, outcols[k], 1, k >= j ? i-1 : i)
- end
- end
- end
0 return _combine_rows_with_first!(row, newcols, i, j,
- f, gd, incols, colnames, firstmulticol)
- end
- end
32 return outcols, colnames
- end
-
- # This needs to be in a separate function
- # to work around a crash due to JuliaLang/julia#29430
- if VERSION >= v"1.1.0-DEV.723"
- @inline function do_append!(do_it, col, vals)
- do_it && append!(col, vals)
- return do_it
- end
- else
- @noinline function do_append!(do_it, col, vals)
- do_it && append!(col, vals)
- return do_it
- end
- end
-
- function append_rows!(rows, outcols::NTuple{N, AbstractVector},
- colstart::Integer, colnames::NTuple{N, Symbol}) where N
- if !isa(rows, Union{AbstractDataFrame, NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}})
- throw(ArgumentError(ERROR_ROW_COUNT))
- elseif _ncol(rows) != N
- throw(ArgumentError("return value must have the same number of columns " *
- "for all groups (got $N and $(_ncol(rows)))"))
- end
- @inbounds for j in colstart:length(outcols)
- col = outcols[j]
- cn = colnames[j]
- local vals
- try
- vals = getproperty(rows, cn)
- catch
- throw(ArgumentError("return value must have the same column names " *
- "for all groups (got $colnames and $(propertynames(rows)))"))
- end
- S = eltype(vals)
- T = eltype(col)
- if !do_append!(S <: T || promote_type(S, T) <: T, col, vals)
- return j
- end
- end
- return nothing
- end
-
- function _combine_tables_with_first!(first::Union{AbstractDataFrame,
- NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}},
- outcols::NTuple{N, AbstractVector},
- idx::Vector{Int}, rowstart::Integer, colstart::Integer,
- f::Any, gd::GroupedDataFrame,
- incols::Union{Nothing, AbstractVector, Tuple, NamedTuple},
- colnames::NTuple{N, Symbol},
- firstmulticol::Val) where N
- len = length(gd)
- gdidx = gd.idx
- starts = gd.starts
- ends = gd.ends
- # Handle first group
-
- @assert _ncol(first) == N
- if !isempty(colnames) && length(gd) > 0
- j = append_rows!(first, outcols, colstart, colnames)
- @assert j === nothing # eltype is guaranteed to match
- append!(idx, Iterators.repeated(gdidx[starts[rowstart]], _nrow(first)))
- end
- # Handle remaining groups
- @inbounds for i in rowstart+1:len
- rows = wrap_table(do_call(f, gdidx, starts, ends, gd, incols, i), firstmulticol)
- _ncol(rows) == 0 && continue
- if isempty(colnames)
- newcolnames = tuple(propertynames(rows)...)
- if rows isa AbstractDataFrame
- eltys = eltype.(eachcol(rows))
- else
- eltys = map(eltype, rows)
- end
- initialcols = ntuple(i -> Tables.allocatecolumn(eltys[i], 0), _ncol(rows))
- return _combine_tables_with_first!(rows, initialcols, idx, i, 1,
- f, gd, incols, newcolnames, firstmulticol)
- end
- j = append_rows!(rows, outcols, 1, colnames)
- if j !== nothing # Need to widen column type
- local newcols
- let i = i, j = j, outcols=outcols, rows=rows # Workaround for julia#15276
- newcols = ntuple(length(outcols)) do k
- S = eltype(rows isa AbstractDataFrame ? rows[!, k] : rows[k])
- T = eltype(outcols[k])
- U = promote_type(S, T)
- if S <: T || U <: T
- outcols[k]
- else
- copyto!(Tables.allocatecolumn(U, length(outcols[k])), outcols[k])
- end
- end
- end
- return _combine_tables_with_first!(rows, newcols, idx, i, j,
- f, gd, incols, colnames, firstmulticol)
- end
- append!(idx, Iterators.repeated(gdidx[starts[i]], _nrow(rows)))
- end
- return outcols, colnames
- end
-
- """
- select(gd::GroupedDataFrame, args...; copycols::Bool=true, keepkeys::Bool=true,
- ungroup::Bool=true, renamecols::Bool=true)
-
- Apply `args` to `gd` following the rules described in [`combine`](@ref).
-
- If `ungroup=true` the result is a `DataFrame`.
- If `ungroup=false` the result is a `GroupedDataFrame`
- (in this case the returned value retains the order of groups of `gd`).
-
- The `parent` of the returned value has as many rows as `parent(gd)` and
- in the same order, except when the returned value has no columns
- (in which case it has zero rows). If an operation in `args` returns
- a single value it is always broadcasted to have this number of rows.
-
- If `copycols=false` then do not perform copying of columns that are not transformed.
-
- $KWARG_PROCESSING_RULES
-
- # See also
-
- [`groupby`](@ref), [`combine`](@ref), [`select!`](@ref), [`transform`](@ref), [`transform!`](@ref)
-
- # Examples
- ```jldoctest
- julia> df = DataFrame(a = [1, 1, 1, 2, 2, 1, 1, 2],
- b = repeat([2, 1], outer=[4]),
- c = 1:8)
- 8×3 DataFrame
- │ Row │ a │ b │ c │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 1 │ 2 │ 1 │
- │ 2 │ 1 │ 1 │ 2 │
- │ 3 │ 1 │ 2 │ 3 │
- │ 4 │ 2 │ 1 │ 4 │
- │ 5 │ 2 │ 2 │ 5 │
- │ 6 │ 1 │ 1 │ 6 │
- │ 7 │ 1 │ 2 │ 7 │
- │ 8 │ 2 │ 1 │ 8 │
-
- julia> gd = groupby(df, :a);
-
- julia> select(gd, :c => sum, nrow)
- 8×3 DataFrame
- │ Row │ a │ c_sum │ nrow │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 1 │ 19 │ 5 │
- │ 2 │ 1 │ 19 │ 5 │
- │ 3 │ 1 │ 19 │ 5 │
- │ 4 │ 2 │ 17 │ 3 │
- │ 5 │ 2 │ 17 │ 3 │
- │ 6 │ 1 │ 19 │ 5 │
- │ 7 │ 1 │ 19 │ 5 │
- │ 8 │ 2 │ 17 │ 3 │
-
- julia> select(gd, :c => sum, nrow, ungroup=false)
- GroupedDataFrame with 2 groups based on key: a
- First Group (5 rows): a = 1
- │ Row │ a │ c_sum │ nrow │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 1 │ 19 │ 5 │
- │ 2 │ 1 │ 19 │ 5 │
- │ 3 │ 1 │ 19 │ 5 │
- │ 4 │ 1 │ 19 │ 5 │
- │ 5 │ 1 │ 19 │ 5 │
- ⋮
- Last Group (3 rows): a = 2
- │ Row │ a │ c_sum │ nrow │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 2 │ 17 │ 3 │
- │ 2 │ 2 │ 17 │ 3 │
- │ 3 │ 2 │ 17 │ 3 │
-
- julia> select(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column
- 8×2 DataFrame
- │ Row │ a │ sum_log_c │
- │ │ Int64 │ Float64 │
- ├─────┼───────┼───────────┤
- │ 1 │ 1 │ 5.52943 │
- │ 2 │ 1 │ 5.52943 │
- │ 3 │ 1 │ 5.52943 │
- │ 4 │ 2 │ 5.07517 │
- │ 5 │ 2 │ 5.07517 │
- │ 6 │ 1 │ 5.52943 │
- │ 7 │ 1 │ 5.52943 │
- │ 8 │ 2 │ 5.07517 │
-
- julia> select(gd, [:b, :c] .=> sum) # passing a vector of pairs
- 8×3 DataFrame
- │ Row │ a │ b_sum │ c_sum │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 1 │ 8 │ 19 │
- │ 2 │ 1 │ 8 │ 19 │
- │ 3 │ 1 │ 8 │ 19 │
- │ 4 │ 2 │ 4 │ 17 │
- │ 5 │ 2 │ 4 │ 17 │
- │ 6 │ 1 │ 8 │ 19 │
- │ 7 │ 1 │ 8 │ 19 │
- │ 8 │ 2 │ 4 │ 17 │
-
- julia> select(gd, :b => :b1, :c => :c1,
- [:b, :c] => +, keepkeys=false) # multiple arguments, renaming and keepkeys
- 8×3 DataFrame
- │ Row │ b1 │ c1 │ b_c_+ │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 2 │ 1 │ 3 │
- │ 2 │ 1 │ 2 │ 3 │
- │ 3 │ 2 │ 3 │ 5 │
- │ 4 │ 1 │ 4 │ 5 │
- │ 5 │ 2 │ 5 │ 7 │
- │ 6 │ 1 │ 6 │ 7 │
- │ 7 │ 2 │ 7 │ 9 │
- │ 8 │ 1 │ 8 │ 9 │
-
- julia> select(gd, :b, :c => sum) # passing columns and broadcasting
- 8×3 DataFrame
- │ Row │ a │ b │ c_sum │
- │ │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┤
- │ 1 │ 1 │ 2 │ 19 │
- │ 2 │ 1 │ 1 │ 19 │
- │ 3 │ 1 │ 2 │ 19 │
- │ 4 │ 2 │ 1 │ 17 │
- │ 5 │ 2 │ 2 │ 17 │
- │ 6 │ 1 │ 1 │ 19 │
- │ 7 │ 1 │ 2 │ 19 │
- │ 8 │ 2 │ 1 │ 17 │
-
- julia> select(gd, :, AsTable(Not(:a)) => sum, renamecols=false)
- 8×4 DataFrame
- │ Row │ a │ b │ c │ b_c │
- │ │ Int64 │ Int64 │ Int64 │ Int64 │
- ├─────┼───────┼───────┼───────┼───────┤
- │ 1 │ 1 │ 2 │ 1 │ 3 │
- │ 2 │ 1 │ 1 │ 2 │ 3 │
- │ 3 │ 1 │ 2 │ 3 │ 5 │
- │ 4 │ 2 │ 1 │ 4 │ 5 │
- │ 5 │ 2 │ 2 │ 5 │ 7 │
- │ 6 │ 1 │ 1 │ 6 │ 7 │
- │ 7 │ 1 │ 2 │ 7 │ 9 │
- │ 8 │ 2 │ 1 │ 8 │ 9 │
- ```
- """
- select(gd::GroupedDataFrame, args...; copycols::Bool=true, keepkeys::Bool=true,
- ungroup::Bool=true, renamecols::Bool=true) =
- _combine_prepare(gd, args..., copycols=copycols, keepkeys=keepkeys,
- ungroup=ungroup, keeprows=true, renamecols=renamecols)
-
- """
- transform(gd::GroupedDataFrame, args...;
- copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true)
-
- An equivalent of
- `select(gd, :, args..., copycols=copycols, keepkeys=keepkeys, ungroup=ungroup, renamecols=renamecols)`
- but keeps the columns of `parent(gd)` in their original order.
-
- # See also
-
- [`groupby`](@ref), [`combine`](@ref), [`select`](@ref), [`select!`](@ref), [`transform!`](@ref)
- """
- function transform(gd::GroupedDataFrame, args...; copycols::Bool=true,
- keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)
- res = select(gd, :, args..., copycols=copycols, keepkeys=keepkeys,
- ungroup=ungroup, renamecols=renamecols)
- # res can be a GroupedDataFrame based on DataFrame or a DataFrame,
- # so parent always gives a data frame
- select!(parent(res), propertynames(parent(gd)), :)
- return res
- end
-
- """
- select!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true)
-
- An equivalent of
- `select(gd, args..., copycols=false, keepkeys=true, ungroup=ungroup, renamecols=renamecols)`
- but updates `parent(gd)` in place.
-
- `gd` is updated to reflect the new rows of its updated parent.
- If there are independent `GroupedDataFrame` objects constructed
- using the same parent data frame they might get corrupt.
-
- # See also
-
- [`groupby`](@ref), [`combine`](@ref), [`select`](@ref), [`transform`](@ref), [`transform!`](@ref)
- """
- function select!(gd::GroupedDataFrame{DataFrame}, args...;
- ungroup::Bool=true, renamecols::Bool=true)
- newdf = select(gd, args..., copycols=false, renamecols=renamecols)
- df = parent(gd)
- _replace_columns!(df, newdf)
- return ungroup ? df : gd
- end
-
- """
- transform!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true)
-
- An equivalent of
- `transform(gd, args..., copycols=false, keepkeys=true, ungroup=ungroup, renamecols=renamecols)`
- but updates `parent(gd)` in place
- and keeps the columns of `parent(gd)` in their original order.
-
- # See also
-
- [`groupby`](@ref), [`combine`](@ref), [`select`](@ref), [`select!`](@ref), [`transform`](@ref)
- """
- function transform!(gd::GroupedDataFrame{DataFrame}, args...;
- ungroup::Bool=true, renamecols::Bool=true)
- newdf = select(gd, :, args..., copycols=false, renamecols=renamecols)
- df = parent(gd)
- select!(newdf, propertynames(df), :)
- _replace_columns!(df, newdf)
- return ungroup ? df : gd
- end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment