svilupp/comparison_with_python.txt Secret

## comparison_with_python.txt
Comparisons
This section compares DataFrames.jl with other data manipulation frameworks in Python, R, and Stata.

A sample data set can be created using the following code:

using DataFrames
using Statistics

df = DataFrame(grp=repeat(1:2, 3), x=6:-1:1, y=4:9, z=[3:7; missing], id='a':'f')
df2 = DataFrame(grp=[1, 3], w=[10, 11])

Note
Some of the operations mutate the tables so every operation assumes that it is done on the original data frame.

Note that in the comparisons presented below predicates like x -> x >= 1 can be more compactly written as =>(1). The latter form has an additional benefit that it is compiled only once per Julia session (as opposed to x -> x >= 1 which defines a new anonymous function every time it is introduced).

Comparison with the Python package pandas
The following table compares the main functions of DataFrames.jl with the Python package pandas (version 1.1.0):

import pandas as pd
import numpy as np

df = pd.DataFrame({'grp': [1, 2, 1, 2, 1, 2],
                   'x': range(6, 0, -1),
                   'y': range(4, 10),
                   'z': [3, 4, 5, 6, 7, None]},
                   index = list('abcdef'))
df2 = pd.DataFrame({'grp': [1, 3], 'w': [10, 11]})

Because pandas supports multi-index, this example data frame is set up with a to f as row indices rather than a separate id column.

Accessing data
Operation	pandas	DataFrames.jl
Cell indexing by location	df.iloc[1, 1]	df[2, 2]
Row slicing by location	df.iloc[1:3]	df[2:3, :]
Column slicing by location	df.iloc[:, 1:]	df[:, 2:end]
Row indexing by label	df.loc['c']	df[findfirst(==('c'), df.id), :]
Column indexing by label	df.loc[:, 'x']	df[:, :x]
Column slicing by label	df.loc[:, ['x', 'z']]	df[:, [:x, :z]]
df.loc[:, 'x':'z']	df[:, Between(:x, :z)]
Mixed indexing	df.loc['c'][1]	df[findfirst(==('c'), df.id), 2]
Note that Julia uses 1-based indexing, inclusive on both ends. A special keyword end can be used to indicate the last index. Likewise, the begin keyword can be used to indicate the first index.

In addition, when indexing a data frame with the findfirst function, a single DataFrameRow object is returned. In the case that id is not unique, you can use the findall function or boolean indexing instead. It would then return a DataFrame object containing all matched rows. The following two lines of code are functionally equivalent:

df[findall(==('c'), df.id), :]
df[df.id .== 'c', :]

DataFrames.jl's indexing always produces a consistent and predictable return type. By contrast, pandas' loc function returns a Series object when there is exactly one 'c' value in the index, and it returns a DataFrame object when there are multiple rows having the index value of 'c'.

Common operations
Operation	pandas	DataFrames.jl
Reduce multiple values	df['z'].mean(skipna = False)	mean(df.z)
df['z'].mean()	mean(skipmissing(df.z))
df[['z']].agg(['mean'])	combine(df, :z => mean ∘ skipmissing)
Add new columns	df.assign(z1 = df['z'] + 1)	transform(df, :z => (v -> v .+ 1) => :z1)
Rename columns	df.rename(columns = {'x': 'x_new'})	rename(df, :x => :x_new)
Pick & transform columns	df.assign(x_mean = df['x'].mean())[['x_mean', 'y']]	select(df, :x => mean, :y)
Sort rows	df.sort_values(by = 'x')	sort(df, :x)
df.sort_values(by = ['grp', 'x'], ascending = [True, False])	sort(df, [:grp, order(:x, rev = true)])
Drop missing rows	df.dropna()	dropmissing(df)
Select unique rows	df.drop_duplicates()	unique(df)
Note that pandas skips NaN values in its analytic functions by default. By contrast, Julia functions do not skip NaN's. If necessary, you can filter out the NaN's before processing, for example, mean(Iterators.filter(!isnan, x)).

Pandas uses NaN for representing both missing data and the floating point "not a number" value. Julia defines a special value missing for representing missing data. DataFrames.jl respects general rules in Julia in propagating missing values by default. If necessary, the skipmissing function can be used to remove missing data. See the Missing Data section for more information.

In addition, pandas keeps the original column name after applying a function. DataFrames.jl appends a suffix to the column name by default. To keep it simple, the examples above do not synchronize the column names between pandas and DataFrames.jl (you can pass renamecols=false keyword argument to select, transform and combine functions to retain old column names).

Mutating operations
Operation	pandas	DataFrames.jl
Add new columns	df['z1'] = df['z'] + 1	df.z1 = df.z .+ 1
transform!(df, :z => (x -> x .+ 1) => :z1)
df.insert(1, 'const', 10)	insertcols!(df, 2, :const => 10)
Rename columns	df.rename(columns = {'x': 'x_new'}, inplace = True)	rename!(df, :x => :x_new)
Sort rows	df.sort_values(by = 'x', inplace = True)	sort!(df, :x)
Drop missing rows	df.dropna(inplace = True)	dropmissing!(df)
Select unique rows	df.drop_duplicates(inplace = True)	unique!(df)
Generally speaking, DataFrames.jl follows the Julia convention of using ! in the function name to indicate mutation behavior.

Grouping data and aggregation
DataFrames.jl provides a groupby function to apply operations over each group independently. The result of groupby is a GroupedDataFrame object which may be processed using the combine, transform, or select functions. The following table illustrates some common grouping and aggregation usages.

Operation	pandas	DataFrames.jl
Aggregate by groups	df.groupby('grp')['x'].mean()	combine(groupby(df, :grp), :x => mean)
Rename column after aggregation	df.groupby('grp')['x'].mean().rename("my_mean")	combine(groupby(df, :grp), :x => mean => :my_mean)
Add aggregated data as column	df.join(df.groupby('grp')['x'].mean(), on='grp', rsuffix='_mean')	transform(groupby(df, :grp), :x => mean)
...and select output columns	df.join(df.groupby('grp')['x'].mean(), on='grp', rsuffix='_mean')[['grp', 'x_mean']]	select(groupby(df, :grp), :id, :x => mean)
Note that pandas returns a Series object for 1-dimensional result unless reset_index is called afterwards. The corresponding DataFrames.jl examples return an equivalent DataFrame object. Consider the first example:

>>> df.groupby('grp')['x'].mean()
grp
1    4
2    3
Name: x, dtype: int64

For DataFrames.jl, it looks like this:

julia> combine(groupby(df, :grp), :x => mean)
2×2 DataFrame
 Row │ grp    x_mean
     │ Int64  Float64
─────┼────────────────
   1 │     1      4.0
   2 │     2      3.0

In DataFrames.jl, the GroupedDataFrame object supports an efficient key lookup. Hence, it performs well when you need to perform lookups repeatedly.

More advanced commands
This section includes more complex examples.

Operation	pandas	DataFrames.jl
Complex Function	df[['z']].agg(lambda v: np.mean(np.cos(v)))	combine(df, :z => v -> mean(cos, skipmissing(v)))
Aggregate multiple columns	df.agg({'x': max, 'y': min})	combine(df, :x => maximum, :y => minimum)
df[['x', 'y']].mean()	combine(df, [:x, :y] .=> mean)
df.filter(regex=("^x")).mean()	combine(df, names(df, r"^x") .=> mean)
Apply function over multiple variables	df.assign(x_y_cor = np.corrcoef(df.x, df.y)[0, 1])	transform(df, [:x, :y] => cor)
Row-wise operation	df.assign(x_y_min = df.apply(lambda v: min(v.x, v.y), axis=1))	transform(df, [:x, :y] => ByRow(min))
df.assign(x_y_argmax = df.apply(lambda v: df.columns[v.argmax()], axis=1))	transform(df, AsTable([:x, :y]) => ByRow(argmax))
DataFrame as input	df.groupby('grp').head(2)	combine(d -> first(d, 2), groupby(df, :grp))
DataFrame as output	df[['x']].agg(lambda x: [min(x), max(x)])	combine(df, :x => (x -> (x=[minimum(x), maximum(x)],)) => AsTable)
Note that pandas preserves the same row order after groupby whereas DataFrames.jl shows them grouped by the provided keys after the combine operation, but select and transform retain an original row ordering.

Joining data frames
DataFrames.jl supports join operations similar to a relational database.

Operation	pandas	DataFrames.jl
Inner join	pd.merge(df, df2, how = 'inner', on = 'grp')	innerjoin(df, df2, on = :grp)
Outer join	pd.merge(df, df2, how = 'outer', on = 'grp')	outerjoin(df, df2, on = :grp)
Left join	pd.merge(df, df2, how = 'left', on = 'grp')	leftjoin(df, df2, on = :grp)
Right join	pd.merge(df, df2, how = 'right', on = 'grp')	rightjoin(df, df2, on = :grp)
Semi join (filtering)	df[df.grp.isin(df2.grp)]	semijoin(df, df2, on = :grp)
Anti join (filtering)	df[~df.grp.isin(df2.grp)]	antijoin(df, df2, on = :grp)
For multi-column joins, both pandas and DataFrames.jl accept an array for the on keyword argument.

In the cases of semi joins and anti joins, the isin function in pandas can still be used as long as the join keys are combined in a tuple. In DataFrames.jl, it just works normally with an array of join keys specified in the on keyword argument.

Comparison with the R package dplyr
The following table compares the main functions of DataFrames.jl with the R package dplyr (version 1):

df <- tibble(grp = rep(1:2, 3), x = 6:1, y = 4:9,
             z = c(3:7, NA), id = letters[1:6])

Operation	dplyr	DataFrames.jl
Reduce multiple values	summarize(df, mean(x))	combine(df, :x => mean)
Add new columns	mutate(df, x_mean = mean(x))	transform(df, :x => mean => :x_mean)
Rename columns	rename(df, x_new = x)	rename(df, :x => :x_new)
Pick columns	select(df, x, y)	select(df, :x, :y)
Pick & transform columns	transmute(df, mean(x), y)	select(df, :x => mean, :y)
Pick rows	filter(df, x >= 1)	subset(df, :x => ByRow(x -> x >= 1))
Sort rows	arrange(df, x)	sort(df, :x)
As in dplyr, some of these functions can be applied to grouped data frames, in which case they operate by group:

Operation	dplyr	DataFrames.jl
Reduce multiple values	summarize(group_by(df, grp), mean(x))	combine(groupby(df, :grp), :x => mean)
Add new columns	mutate(group_by(df, grp), mean(x))	transform(groupby(df, :grp), :x => mean)
Pick & transform columns	transmute(group_by(df, grp), mean(x), y)	select(groupby(df, :grp), :x => mean, :y)
The table below compares more advanced commands:

Operation	dplyr	DataFrames.jl
Complex Function	summarize(df, mean(x, na.rm = T))	combine(df, :x => x -> mean(skipmissing(x)))
Transform several columns	summarize(df, max(x), min(y))	combine(df, :x => maximum, :y => minimum)
summarize(df, across(c(x, y), mean))	combine(df, [:x, :y] .=> mean)
summarize(df, across(starts_with("x"), mean))	combine(df, names(df, r"^x") .=> mean)
summarize(df, across(c(x, y), list(max, min)))	combine(df, ([:x, :y] .=> [maximum minimum])...)
Multivariate function	mutate(df, cor(x, y))	transform(df, [:x, :y] => cor)
Row-wise	mutate(rowwise(df), min(x, y))	transform(df, [:x, :y] => ByRow(min))
mutate(rowwise(df), which.max(c_across(matches("^x"))))	transform(df, AsTable(r"^x") => ByRow(argmax))
DataFrame as input	summarize(df, head(across(), 2))	combine(d -> first(d, 2), df)
DataFrame as output	summarize(df, tibble(value = c(min(x), max(x))))	combine(df, :x => (x -> (value = [minimum(x), maximum(x)],)) => AsTable)
Comparison with the R package data.table
The following table compares the main functions of DataFrames.jl with the R package data.table (version 1.14.1).

library(data.table)
df  <- data.table(grp = rep(1:2, 3), x = 6:1, y = 4:9,
                  z = c(3:7, NA), id = letters[1:6])
df2 <- data.table(grp=c(1,3), w = c(10,11))

Operation	data.table	DataFrames.jl
Reduce multiple values	df[, .(mean(x))]	combine(df, :x => mean)
Add new columns	df[, x_mean:=mean(x) ]	transform!(df, :x => mean => :x_mean)
Rename column (in place)	setnames(df, "x", "x_new")	rename!(df, :x => :x_new)
Rename multiple columns (in place)	setnames(df, c("x", "y"), c("x_new", "y_new"))	rename!(df, [:x, :y] .=> [:x_new, :y_new])
Pick columns as dataframe	df[, .(x, y)]	select(df, :x, :y)
Pick column as a vector	df[, x]	df[!, :x]
Remove columns	df[, -"x"]	select(df, Not(:x))
Remove columns (in place)	df[, x:=NULL]	select!(df, Not(:x))
Remove columns (in place)	df[, c("x", "y"):=NULL]	select!(df, Not([:x, :y]))
Pick & transform columns	df[, .(mean(x), y)]	select(df, :x => mean, :y)
Pick rows	df[ x >= 1 ]	filter(:x => >=(1), df)
Sort rows (in place)	setorder(df, x)	sort!(df, :x)
Sort rows	df[ order(x) ]	sort(df, :x)
Grouping data and aggregation
Operation	data.table	DataFrames.jl
Reduce multiple values	df[, mean(x), by=id ]	combine(groupby(df, :id), :x => mean)
Add new columns (in place)	df[, x_mean:=mean(x), by=id]	transform!(groupby(df, :id), :x => mean)
Pick & transform columns	df[, .(x_mean = mean(x), y), by=id]	select(groupby(df, :id), :x => mean, :y)
More advanced commands
Operation	data.table	DataFrames.jl
Complex Function	df[, .(mean(x, na.rm=TRUE)) ]	combine(df, :x => x -> mean(skipmissing(x)))
Transform certain rows (in place)	df[x<=0, x:=0]	df.x[df.x .<= 0] .= 0
Transform several columns	df[, .(max(x), min(y)) ]	combine(df, :x => maximum, :y => minimum)
df[, lapply(.SD, mean), .SDcols = c("x", "y") ]	combine(df, [:x, :y] .=> mean)
df[, lapply(.SD, mean), .SDcols = patterns("*x") ]	combine(df, names(df, r"^x") .=> mean)
df[, unlist(lapply(.SD, function(x) c(max=max(x), min=min(x)))), .SDcols = c("x", "y") ]	combine(df, ([:x, :y] .=> [maximum minimum])...)
Multivariate function	df[, .(cor(x,y)) ]	transform(df, [:x, :y] => cor)
Row-wise	df[, min_xy := min(x, y), by = 1:nrow(df)]	transform!(df, [:x, :y] => ByRow(min))
df[, argmax_xy := which.max(.SD) , .SDcols = patterns("*x"), by = 1:nrow(df) ]	transform!(df, AsTable(r"^x") => ByRow(argmax))
DataFrame as output	df[, .SD[1], by=grp]	combine(groupby(df, :grp), first)
DataFrame as output	df[, .SD[which.max(x)], by=grp]	combine(groupby(df, :grp), sdf -> sdf[argmax(sdf.x), :])
Joining data frames
Operation	data.table	DataFrames.jl
Inner join	merge(df, df2, on = "grp")	innerjoin(df, df2, on = :grp)
Outer join	merge(df, df2, all = TRUE, on = "grp")	outerjoin(df, df2, on = :grp)
Left join	merge(df, df2, all.x = TRUE, on = "grp")	leftjoin(df, df2, on = :grp)
Right join	merge(df, df2, all.y = TRUE, on = "grp")	rightjoin(df, df2, on = :grp)
Anti join (filtering)	df[!df2, on = "grp" ]	antijoin(df, df2, on = :grp)
Semi join (filtering)	merge(df1, df2[, .(grp)])	semijoin(df, df2, on = :grp)
Comparison with Stata (version 8 and above)
The following table compares the main functions of DataFrames.jl with Stata:

Operation	Stata	DataFrames.jl
Reduce multiple values	collapse (mean) x	combine(df, :x => mean)
Add new columns	egen x_mean = mean(x)	transform!(df, :x => mean => :x_mean)
Rename columns	rename x x_new	rename!(df, :x => :x_new)
Pick columns	keep x y	select!(df, :x, :y)
Pick rows	keep if x >= 1	subset!(df, :x => ByRow(x -> x >= 1))
Sort rows	sort x	sort!(df, :x)
Note that the suffix ! (i.e. transform!, select!, etc) ensures that the operation transforms the dataframe in place, as in Stata

Some of these functions can be applied to grouped data frames, in which case they operate by group:

Operation	Stata	DataFrames.jl
Add new columns	egen x_mean = mean(x), by(grp)	transform!(groupby(df, :grp), :x => mean)
Reduce multiple values	collapse (mean) x, by(grp)	combine(groupby(df, :grp), :x => mean)
The table below compares more advanced commands:

Operation	Stata	DataFrames.jl
Transform certain rows	replace x = 0 if x <= 0	transform(df, :x => (x -> ifelse.(x .<= 0, 0, x)) => :x)
Transform several columns	collapse (max) x (min) y	combine(df, :x => maximum, :y => minimum)
collapse (mean) x y	combine(df, [:x, :y] .=> mean)
collapse (mean) x*	combine(df, names(df, r"^x") .=> mean)
collapse (max) x y (min) x y	combine(df, ([:x, :y] .=> [maximum minimum])...)
Multivariate function	egen z = corr(x y)	transform!(df, [:x, :y] => cor => :z)
Row-wise	egen z = rowmin(x y)	transform!(df, [:x, :y] => ByRow(min) => :z)

## database_style_joins.txt
Database-Style Joins
Introduction to joins
We often need to combine two or more data sets together to provide a complete picture of the topic we are studying. For example, suppose that we have the following two data sets:

julia> using DataFrames

julia> people = DataFrame(ID=[20, 40], Name=["John Doe", "Jane Doe"])
2×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼─────────────────
   1 │    20  John Doe
   2 │    40  Jane Doe

julia> jobs = DataFrame(ID=[20, 40], Job=["Lawyer", "Doctor"])
2×2 DataFrame
 Row │ ID     Job
     │ Int64  String
─────┼───────────────
   1 │    20  Lawyer
   2 │    40  Doctor

We might want to work with a larger data set that contains both the names and jobs for each ID. We can do this using the innerjoin function:

julia> innerjoin(people, jobs, on = :ID)
2×3 DataFrame
 Row │ ID     Name      Job
     │ Int64  String    String
─────┼─────────────────────────
   1 │    20  John Doe  Lawyer
   2 │    40  Jane Doe  Doctor

In relational database theory, this operation is generally referred to as a join. The columns used to determine which rows should be combined during a join are called keys.

The following functions are provided to perform seven kinds of joins:

innerjoin: the output contains rows for values of the key that exist in all passed data frames.
leftjoin: the output contains rows for values of the key that exist in the first (left) argument, whether or not that value exists in the second (right) argument.
rightjoin: the output contains rows for values of the key that exist in the second (right) argument, whether or not that value exists in the first (left) argument.
outerjoin: the output contains rows for values of the key that exist in any of the passed data frames.
semijoin: Like an inner join, but output is restricted to columns from the first (left) argument.
antijoin: The output contains rows for values of the key that exist in the first (left) but not the second (right) argument. As with semijoin, output is restricted to columns from the first (left) argument.
crossjoin: The output is the cartesian product of rows from all passed data frames.
See the Wikipedia page on SQL joins for more information.

Here are examples of different kinds of join:

julia> jobs = DataFrame(ID=[20, 60], Job=["Lawyer", "Astronaut"])
2×2 DataFrame
 Row │ ID     Job
     │ Int64  String
─────┼──────────────────
   1 │    20  Lawyer
   2 │    60  Astronaut

julia> innerjoin(people, jobs, on = :ID)
1×3 DataFrame
 Row │ ID     Name      Job
     │ Int64  String    String
─────┼─────────────────────────
   1 │    20  John Doe  Lawyer

julia> leftjoin(people, jobs, on = :ID)
2×3 DataFrame
 Row │ ID     Name      Job
     │ Int64  String    String?
─────┼──────────────────────────
   1 │    20  John Doe  Lawyer
   2 │    40  Jane Doe  missing

julia> rightjoin(people, jobs, on = :ID)
2×3 DataFrame
 Row │ ID     Name      Job
     │ Int64  String?   String
─────┼────────────────────────────
   1 │    20  John Doe  Lawyer
   2 │    60  missing   Astronaut

julia> outerjoin(people, jobs, on = :ID)
3×3 DataFrame
 Row │ ID     Name      Job
     │ Int64  String?   String?
─────┼────────────────────────────
   1 │    20  John Doe  Lawyer
   2 │    40  Jane Doe  missing
   3 │    60  missing   Astronaut

julia> semijoin(people, jobs, on = :ID)
1×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼─────────────────
   1 │    20  John Doe

julia> antijoin(people, jobs, on = :ID)
1×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼─────────────────
   1 │    40  Jane Doe

Cross joins are the only kind of join that does not use a on key:

julia> crossjoin(people, jobs, makeunique = true)
4×4 DataFrame
 Row │ ID     Name      ID_1   Job
     │ Int64  String    Int64  String
─────┼───────────────────────────────────
   1 │    20  John Doe     20  Lawyer
   2 │    20  John Doe     60  Astronaut
   3 │    40  Jane Doe     20  Lawyer
   4 │    40  Jane Doe     60  Astronaut

Key value comparisons and floating point values
Key values from the two or more data frames are compared using the isequal function. This is consistent with the Set and Dict types in Julia Base.

It is not recommended to use floating point numbers as keys: floating point comparisons can be surprising and unpredictable. If you do use floating point keys, note that by default an error is raised when keys include -0.0 (negative zero) or NaN values. Here is an example:

julia> innerjoin(DataFrame(id=[-0.0]), DataFrame(id=[0.0]), on=:id)
ERROR: ArgumentError: Currently for numeric values `NaN` and `-0.0` in their real or imaginary components are not allowed. Such value was found in column :id in left data frame. Use CategoricalArrays.jl to wrap these values in a CategoricalVector to perform the requested join.

This can be overridden by wrapping the key values in a categorical vector.

Joining on key columns with different names
In order to join data frames on keys which have different names in the left and right tables, you may pass left => right pairs as on argument:

julia> a = DataFrame(ID=[20, 40], Name=["John Doe", "Jane Doe"])
2×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼─────────────────
   1 │    20  John Doe
   2 │    40  Jane Doe

julia> b = DataFrame(IDNew=[20, 40], Job=["Lawyer", "Doctor"])
2×2 DataFrame
 Row │ IDNew  Job
     │ Int64  String
─────┼───────────────
   1 │    20  Lawyer
   2 │    40  Doctor

julia> innerjoin(a, b, on = :ID => :IDNew)
2×3 DataFrame
 Row │ ID     Name      Job
     │ Int64  String    String
─────┼─────────────────────────
   1 │    20  John Doe  Lawyer
   2 │    40  Jane Doe  Doctor

Here is another example with multiple columns:

julia> a = DataFrame(City=["Amsterdam", "London", "London", "New York", "New York"],
                     Job=["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
                     Category=[1, 2, 3, 4, 5])
5×3 DataFrame
 Row │ City       Job     Category
     │ String     String  Int64
─────┼─────────────────────────────
   1 │ Amsterdam  Lawyer         1
   2 │ London     Lawyer         2
   3 │ London     Lawyer         3
   4 │ New York   Doctor         4
   5 │ New York   Doctor         5

julia> b = DataFrame(Location=["Amsterdam", "London", "London", "New York", "New York"],
                     Work=["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
                     Name=["a", "b", "c", "d", "e"])
5×3 DataFrame
 Row │ Location   Work    Name
     │ String     String  String
─────┼───────────────────────────
   1 │ Amsterdam  Lawyer  a
   2 │ London     Lawyer  b
   3 │ London     Lawyer  c
   4 │ New York   Doctor  d
   5 │ New York   Doctor  e

julia> innerjoin(a, b, on = [:City => :Location, :Job => :Work])
9×4 DataFrame
 Row │ City       Job     Category  Name
     │ String     String  Int64     String
─────┼─────────────────────────────────────
   1 │ Amsterdam  Lawyer         1  a
   2 │ London     Lawyer         2  b
   3 │ London     Lawyer         3  b
   4 │ London     Lawyer         2  c
   5 │ London     Lawyer         3  c
   6 │ New York   Doctor         4  d
   7 │ New York   Doctor         5  d
   8 │ New York   Doctor         4  e
   9 │ New York   Doctor         5  e

Handling of duplicate keys and tracking source data frame
Additionally, notice that in the last join rows 2 and 3 had the same values on on variables in both joined DataFrames. In such a situation innerjoin, outerjoin, leftjoin and rightjoin will produce all combinations of matching rows. In our example rows from 2 to 5 were created as a result. The same behavior can be observed for rows 4 and 5 in both joined DataFrames.

In order to check that columns passed as the on argument define unique keys (according to isequal) in each input data frame you can set the validate keyword argument to a two-element tuple or a pair of Bool values, with each element indicating whether to run check for the corresponding data frame. Here is an example for the join operation described above:

julia> innerjoin(a, b, on = [(:City => :Location), (:Job => :Work)], validate=(true, true))
ERROR: ArgumentError: Merge key(s) are not unique in both df1 and df2. df1 contains 2 duplicate keys: (City = "London", Job = "Lawyer") and (City = "New York", Job = "Doctor"). df2 contains 2 duplicate keys: (Location = "London", Work = "Lawyer") and (Location = "New York", Work = "Doctor").

Finally, using the source keyword argument you can add a column to the resulting data frame indicating whether the given row appeared only in the left, the right or both data frames. Here is an example:

julia> a = DataFrame(ID=[20, 40], Name=["John", "Jane"])
2×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼───────────────
   1 │    20  John
   2 │    40  Jane

julia> b = DataFrame(ID=[20, 60], Job=["Lawyer", "Doctor"])
2×2 DataFrame
 Row │ ID     Job
     │ Int64  String
─────┼───────────────
   1 │    20  Lawyer
   2 │    60  Doctor

julia> outerjoin(a, b, on=:ID, validate=(true, true), source=:source)
3×4 DataFrame
 Row │ ID     Name     Job      source
     │ Int64  String?  String?  String
─────┼─────────────────────────────────────
   1 │    20  John     Lawyer   both
   2 │    40  Jane     missing  left_only
   3 │    60  missing  Doctor   right_only

Note that this time we also used the validate keyword argument and it did not produce errors as the keys defined in both source data frames were unique.

Renaming joined columns
Often you want to keep track of the source data frame. This feature is supported with the renamecols keyword argument:

julia> innerjoin(a, b, on=:ID, renamecols = "_left" => "_right")
1×3 DataFrame
 Row │ ID     Name_left  Job_right
     │ Int64  String     String
─────┼─────────────────────────────
   1 │    20  John       Lawyer

In the above example we added the "_left" suffix to the non-key columns from the left table and the "_right" suffix to the non-key columns from the right table.

Alternatively it is allowed to pass a function transforming column names:

julia> innerjoin(a, b, on=:ID, renamecols = lowercase => uppercase)
1×3 DataFrame
 Row │ ID     name    JOB
     │ Int64  String  String
─────┼───────────────────────
   1 │    20  John    Lawyer

Matching missing values in joins
By default when you try to to perform a join on a key that has missing values you get an error:

julia> df1 = DataFrame(id=[1, missing, 3], a=1:3)
3×2 DataFrame
 Row │ id       a
     │ Int64?   Int64
─────┼────────────────
   1 │       1      1
   2 │ missing      2
   3 │       3      3

julia> df2 = DataFrame(id=[1, 2, missing], b=1:3)
3×2 DataFrame
 Row │ id       b
     │ Int64?   Int64
─────┼────────────────
   1 │       1      1
   2 │       2      2
   3 │ missing      3

julia> innerjoin(df1, df2, on=:id)
ERROR: ArgumentError: Missing values in key columns are not allowed when matchmissing == :error. `missing` found in column :id in left data frame.

If you would prefer missing values to be treated as equal pass the matchmissing=:equal keyword argument:

julia> innerjoin(df1, df2, on=:id, matchmissing=:equal)
2×3 DataFrame
 Row │ id       a      b
     │ Int64?   Int64  Int64
─────┼───────────────────────
   1 │       1      1      1
   2 │ missing      2      3

Alternatively you might want to drop all rows with missing values. In this case pass matchmissing=:notequal:

julia> innerjoin(df1, df2, on=:id, matchmissing=:notequal)
1×3 DataFrame
 Row │ id      a      b
     │ Int64?  Int64  Int64
─────┼──────────────────────
   1 │      1      1      1

Specifying row order in the join result
By default the order of rows produced by the join operation is undefined:

julia> df_left = DataFrame(id=[1, 2, 4, 5], left=1:4)
4×2 DataFrame
 Row │ id     left
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     4      3
   4 │     5      4

julia> df_right = DataFrame(id=[2, 1, 3, 6, 7], right=1:5)
5×2 DataFrame
 Row │ id     right
     │ Int64  Int64
─────┼──────────────
   1 │     2      1
   2 │     1      2
   3 │     3      3
   4 │     6      4
   5 │     7      5

julia> outerjoin(df_left, df_right, on=:id)
7×3 DataFrame
 Row │ id     left     right
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     2        2        1
   2 │     1        1        2
   3 │     4        3  missing
   4 │     5        4  missing
   5 │     3  missing        3
   6 │     6  missing        4
   7 │     7  missing        5

If you would like the result to keep the row order of the left table pass the order=:left keyword argument:

julia> outerjoin(df_left, df_right, on=:id, order=:left)
7×3 DataFrame
 Row │ id     left     right
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     1        1        2
   2 │     2        2        1
   3 │     4        3  missing
   4 │     5        4  missing
   5 │     3  missing        3
   6 │     6  missing        4
   7 │     7  missing        5

Note that in this case keys missing from the left table are put after the keys present in it.

Similarly order=:right keeps the order of the right table (and puts keys not present in it at the end):

julia> outerjoin(df_left, df_right, on=:id, order=:right)
7×3 DataFrame
 Row │ id     left     right
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     2        2        1
   2 │     1        1        2
   3 │     3  missing        3
   4 │     6  missing        4
   5 │     7  missing        5
   6 │     4        3  missing
   7 │     5        4  missing

In-place left join
A common operation is adding data from a reference table to some main table. It is possible to perform such an in-place update using the leftjoin! function. In this case the left table is updated in place with matching rows from the right table.

julia> main = DataFrame(id=1:4, main=1:4)
4×2 DataFrame
 Row │ id     main
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     4      4

julia> leftjoin!(main, DataFrame(id=[2, 4], info=["a", "b"]), on=:id);

julia> main
4×3 DataFrame
 Row │ id     main   info
     │ Int64  Int64  String?
─────┼───────────────────────
   1 │     1      1  missing
   2 │     2      2  a
   3 │     3      3  missing
   4 │     4      4  b

Note that in this case the order and number of rows in the left table is not changed. Therefore, in particular, it is not allowed to have duplicate keys in the right table:

julia> leftjoin!(main, DataFrame(id=[2, 2], info_bad=["a", "b"]), on=:id)
ERROR: ArgumentError: duplicate rows found in right table

## rag_part1_by_hand.jl
# # Title: End-to-end RAG example - low-level
# This script runs an end-to-end RAG applications over text documents in `documents/` folder

# Package Imports
# using OpenAI # for interacting with the API, import if you want to skip the first part of tutorial
using JSON3
using HTTP
using LinearAlgebra

# ! Config
const MODEL_CHAT = "gpt-3.5-turbo" # default model for chat, fast and "cheap"
const MODEL_EMBEDDING = "text-embedding-ada-002" # default model for embedding test
const API_KEY = ENV["OPENAI_API_KEY"]; # do not commit, share or sync anywhere!

# Define shared abstract type for custom printing
abstract type AbstractBuildingBlock end

# Convenience for building "messages" for the chatbot faster (see the standard format in the API calls section...)
@kwdef struct PromptTemplate <: AbstractBuildingBlock
    system_prompt::Union{String, Nothing} = nothing
    user_prompt::String = ""
end

# Return type for the AI model
@kwdef struct AIMessage <: AbstractBuildingBlock
    content::AbstractString
    status::Union{Int, Nothing} = nothing
    tokens::Tuple{Int, Int} = (-1, -1)
    elapsed::Float64 = -1.0
end

# Stores document chunks and their embeddings
@kwdef struct ChunkIndex{T <: AbstractString} <: AbstractBuildingBlock
    id::Symbol = gensym("ChunkIndex")
    embeddings::Matrix{Float32}
    chunks::Vector{T}
    sources::Vector{<:AbstractString}
end

# Structured show method for easier reading (each kwarg on a new line)
function Base.show(io::IO, t::AbstractBuildingBlock)
    dump(IOContext(io, :limit => true), t, maxdepth = 1)
end

# Utitity to be able to combine indices from different sources/documents easily
function Base.vcat(i1::ChunkIndex{T}, i2::ChunkIndex{T}) where {T <: AbstractString}
    ChunkIndex(;
        embeddings = hcat(i1.embeddings, i2.embeddings),
        chunks = vcat(i1.chunks, i2.chunks),
        sources = vcat(i1.sources, i2.sources))
end

## Useful functions for API calls
# defined in OpenAI.jl (more sophisticated version of it...)
function create_chat(api_key::String,
        model::String,
        messages::Vector{Dict{String, String}};
        http_kwargs::NamedTuple = NamedTuple(), api_kwargs::NamedTuple = NamedTuple())
    url = "https://api.openai.com/v1/chat/completions"
    headers = Dict("Authorization" => "Bearer $api_key",
        "Content-Type" => "application/json")
    # JSON-encoded data (string)
    body = (; messages, model, api_kwargs...) |> JSON3.write
    resp = HTTP.request("POST", url; body, headers, http_kwargs...)
    return (;
        response = JSON3.read(resp.body),
        status = resp.status)
end

"Builds a history of the conversation (=messages) to provide the prompt to the API. All kwargs are passed as replacements such that `{{key}} => value` in the template.}}"
function render(prompt::PromptTemplate; kwargs...)
    conversation = Dict{String, String}[]
    !isnothing(prompt.system_prompt) &&
        push!(conversation, Dict("role" => "system", "content" => prompt.system_prompt))
    # Replace any handlebar-style placeholders like `{{key}}` in the user_prompt with user-provided kwargs
    user_prompt = replace(prompt.user_prompt, ["{{$(k)}}" => v for (k, v) in kwargs]...)
    push!(conversation, Dict("role" => "user", "content" => user_prompt))
    return conversation
end

"Builds the request to the API and waits for the response."
function aigenerate(template::PromptTemplate;
        api_key::String = API_KEY,
        model::String = MODEL_CHAT,
        # Let's use smart defaults because OpenAI is a bit fiddly...
        http_kwargs::NamedTuple = (;
            retry_non_idempotent = true,
            retries = 10,
            readtimeout = 30), api_kwargs::NamedTuple = NamedTuple(),
        kwargs...)
    ##
    conversation = render(template; kwargs...)
    time = @elapsed r = create_chat(api_key,
        model,
        conversation;
        http_kwargs,
        api_kwargs...)
    return AIMessage(; content = r.response[:choices][begin][:message][:content] |> strip,
        status = Int(r.status),
        tokens = (r.response[:usage][:prompt_tokens],
            r.response[:usage][:completion_tokens]),
        elapsed = time)
end

"Creates embeddings for `docs` (string or array of strings) and returns a normalized matrix (column-wise)"
function aiembed(docs::Union{AbstractString, Vector{<:AbstractString}},
        postprocess::F = normalize;
        api_key::String = API_KEY,
        model::String = MODEL_EMBEDDING,
        http_kwargs::NamedTuple = NamedTuple(), api_kwargs::NamedTuple = NamedTuple(),
        kwargs...) where {F <: Function}
    r = create_embeddings(api_key, docs, model; http_kwargs, api_kwargs...)
    return mapreduce(x -> postprocess(x[:embedding]), hcat, r.response.data)
end

"RAG wrapper that answers the given question and inject the context if needed from `index`"
function airag(index::ChunkIndex, rag_template::PromptTemplate;
        question::AbstractString, top_k::Int = 3, kwargs...)
    question_emb = aiembed(question)

    idxs = find_closest(index, question_emb; top_k)
    close_chunks = [join(index.chunks[max(begin, i - 2):min(end, i + 2)], "\n")
                    for i in idxs]
    return aigenerate(rag_template;
        question,
        context = join(close_chunks, "\n\n"),
        kwargs...)
end
## Useful functions for document chunk retrieval
"Splits `doc` into text chunks of size at most `max_size` (in characters), ie, it accumulates smaller chunks to match the desired size"
function build_chunks(doc::AbstractString;
        max_size::Int = 128,
        split_pattern::Union{String, Regex} = r"\n|\. ",
        join_key::String = "\n")
    ## shortcut if doc is too short
    length(doc) < max_size && return [doc]
    ## proceed
    texts = split(doc, split_pattern)
    doc_chunks = Vector{eltype(texts)}()
    start, counter = 1, 0
    # accumulate chunks until we reach the max size
    for i in eachindex(texts)
        l = length(texts[i])
        # if it doesn't fit, we push all preceeding docs, reset the counter and start a new chunk
        if l == 0 || (counter + l >= max_size)
            push!(doc_chunks, join(texts[start:max(i - 1, 1)], join_key))
            start = i # current text becomes the next chunk
            counter = 0
        end
        counter += l
    end
    # last chunk is never pushed in, so we need to do it manually
    push!(doc_chunks, join(texts[start:end], join_key))
    return doc_chunks
end

"Finds the indices of chunks (represented by embeddings in `emb`) that are closest (cosine similarity) to query embedding (`query_emb`). Returns only `top_k` closest indices."
function find_closest(emb::AbstractMatrix{<:Real},
        query_emb::AbstractVector{<:Real};
        top_k::Int = 100)
    query_emb' * emb |> vec |> sortperm |> reverse |> x -> first(x, top_k)
end
function find_closest(index::ChunkIndex,
        query_emb::AbstractVector{<:Real};
        top_k::Int = 100)
    find_closest(index.embeddings, query_emb; top_k)
end
# # Main
# ## Preparation

# Before you do anything, put together 10 questions that a user might ask and potential answers based on the sources you want to use.
# It helps you think through:
# - what are the sources you want to use? how is the data structured?
# - what are the questions you want to answer? how do you want to answer them?
#
# Moreover, there cannot be a RAG application without an evaluation set.
# So these questions will allow you to measure how well you're progressing.
# Ideally, copy the source text you used, so you can later on measure if your RAG app used the "right" context as a source for the answer.

# Now, let's go to the documentation of DataFrames.jl and let's download a few pages from there.
# Just copy & paste the text into a text file and save it in `documents/` folder.
# You might want to remove the useless data at the beginning and end to make your life easier - this is a tutorial afterall :)

# ## Load the data
# Example
dir_raw = joinpath("documents")
files = ["comparison_with_python.txt", "database_style_joins.txt", "what_is_dataframes.txt"]
labels = ["DataFrames-Python", "DataFrames-Joins", "DataFrames-WhatIs"]

# Embed documents and build ChunkIndex (bundle of embeddings, chunks and their respective sources)
indices = ChunkIndex[]
for (fn, lbl) in zip(files, labels)
    doc_raw = read(joinpath(dir_raw, fn), String)
    # split into chunks, if you want to start simple - just do `split(text,"\n\n")`
    doc_chunks = build_chunks(doc_raw; max_size = 256, split_pattern = "\n\n") |>
                 texts -> mapreduce(x -> build_chunks(x;
            max_size = 256,
            split_pattern = r"\n|\. "),
        vcat,
        texts)
    # Notice that we embed all doc_chunks at once, not one by one
    # OpenAI supports embedding multiple documents if they are short enough, ie, it's only because the documentation pages are small
    embeddings = aiembed(doc_chunks) .|> Float32
    index = ChunkIndex(;
        embeddings,
        chunks = doc_chunks,
        sources = fill(lbl, length(doc_chunks)))
    push!(indices, index)
end
index = reduce(vcat, indices) # combine it across several documents

# ## Ask questions - Step-by-Step
rag_template = PromptTemplate(;
    system_prompt = "Act as a world-class AI assistant and an expert in Julia language. Answer the question based only on the provided context. Be brief and concise.",
    user_prompt = """
      # Context

      {{context}}

      # Question

      {{question}}

      # Answer
      """)

question = "I like dplyr, what is the equivalent in Julia?"
question_emb = aiembed(question)

# Build the context of similar docs -- take the top 3 closest chunks
idxs = find_closest(index, question_emb; top_k = 3)
# We add 2 chunks before and after each of the closest chunk
close_chunks = [join(index.chunks[max(begin, i - 2):min(end, i + 2)], "\n")
                for i in idxs]
answer = aigenerate(rag_template;
    question,
    context = join(close_chunks, "\n\n"))
println(answer.content)

# ## Ask questions E2E
question = "I like dplyr, what is the equivalent in Julia?"
answer = airag(index, rag_template; question)
println(answer.content)

## what_is_dataframes.txt
Welcome to the DataFrames.jl documentation!

This resource aims to teach you everything you need to know to get up and running with tabular data manipulation using the DataFrames.jl package.

For more illustrations of DataFrames.jl usage, in particular in conjunction with other packages you can check-out the following resources (they are kept up to date with the released version of DataFrames.jl):

What is DataFrames.jl?
DataFrames.jl provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of pandas (in Python) and data.frame, data.table and dplyr (in R), making it a great general purpose data science tool.

DataFrames.jl plays a central role in the Julia Data ecosystem, and has tight integrations with a range of different libraries. DataFrames.jl isn't the only tool for working with tabular data in Julia – as noted below, there are some other great libraries for certain use-cases – but it provides great data wrangling functionality through a familiar interface.

To understand the toolchain in more detail, have a look at the tutorials in this manual. New users can start with the First Steps with DataFrames.jl section.

You may find the DataFramesMeta.jl package or one of the other convenience packages discussed in the Data manipulation frameworks section of this manual helpful when writing more advanced data transformations, especially if you do not have a significant programming experience. These packages provide convenience syntax similar to dplyr in R.

If you use metadata when working with DataFrames.jl you might find the TableMetadataTools.jl package useful. This package defines several convenience functions for performing typical metadata operations.

DataFrames.jl and the Julia Data Ecosystem
The Julia data ecosystem can be a difficult space for new users to navigate, in part because the Julia ecosystem tends to distribute functionality across different libraries more than some other languages. Because many people coming to DataFrames.jl are just starting to explore the Julia data ecosystem, below is a list of well-supported libraries that provide different data science tools, along with a few notes about what makes each library special, and how well integrated they are with DataFrames.jl.

Statistics
StatsKit.jl: A convenience meta-package which loads a set of essential packages for statistics, including those mentioned below in this section and DataFrames.jl itself.
Statistics: The Julia standard library comes with a wide range of statistics functionality, but to gain access to these functions you must call using Statistics.
LinearAlgebra: Like Statistics, many linear algebra features (factorizations, inversions, etc.) live in a library you have to load to use.
SparseArrays are also in the standard library but must be loaded to be used.
FreqTables.jl: Create frequency tables / cross-tabulations. Tightly integrated with DataFrames.jl.
HypothesisTests.jl: A range of hypothesis testing tools.
GLM.jl: Tools for estimating linear and generalized linear models. Tightly integrated with DataFrames.jl.
StatsModels.jl: For converting heterogeneous DataFrame into homogeneous matrices for use with linear algebra libraries or machine learning applications that don't directly support DataFrames. Will do things like convert categorical variables into indicators/one-hot-encodings, create interaction terms, etc.
MultivariateStats.jl: linear regression, ridge regression, PCA, component analyses tools. Not well integrated with DataFrames.jl, but easily used in combination with StatsModels.
Machine Learning
MLJ.jl: if you're more of an applied user, there is a single package the pulls from all these different libraries and provides a single, scikit-learn inspired API: MLJ.jl. MLJ.jl provides a common interface for a wide range of machine learning algorithms.
ScikitLearn.jl: A Julia wrapper around the full Python scikit-learn machine learning library. Not well integrated with DataFrames.jl, but can be combined using StatsModels.jl.
AutoMLPipeline: A package that makes it trivial to create complex ML pipeline structures using simple expressions. It leverages on the built-in macro programming features of Julia to symbolically process, manipulate pipeline expressions, and makes it easy to discover optimal structures for machine learning regression and classification.
Deep learning: KNet.jl and Flux.jl.
Plotting
Plots.jl: Powerful, modern plotting library with a syntax akin to that of matplotlib (in Python) or plot (in R). StatsPlots.jl provides Plots.jl with recipes for many standard statistical plots.
Gadfly.jl: High-level plotting library with a "grammar of graphics" syntax akin to that of ggplot (in R).
AlgebraOfGraphics.jl: A "grammar of graphics" library build upon Makie.jl.
VegaLite.jl: High-level plotting library that uses a different "grammar of graphics" syntax and has an emphasis on interactive graphics.
Data Wrangling:
Impute.jl: various methods for handling missing data in vectors, matrices and tables.
DataFramesMeta.jl: A range of convenience functions for DataFrames.jl that augment select and transform to provide a user experience similar to that provided by dplyr in R.
DataFrameMacros.jl: Provides macro versions of the common DataFrames.jl functions similar to DataFramesMeta.jl, with convenient syntax for the manipulation of multiple columns at once.
Query.jl: Query.jl provides a single framework for data wrangling that works with a range of libraries, including DataFrames.jl, other tabular data libraries (more on those below), and even non-tabular data. Provides many convenience functions analogous to those in dplyr in R or LINQ.
You can find more information on these packages in the Data manipulation frameworks section of this manual.
And More!
Graphs.jl: A pure-Julia, high performance network analysis library. Edgelists in DataFrames can be easily converted into graphs using the GraphDataFrameBridge.jl package.
IO:
DataFrames.jl work well with a range of formats, including:
CSV files (using CSV.jl),
Apache Arrow (using Arrow.jl)
reading Stata, SAS and SPSS files (using ReadStatTables.jl; alternatively Queryverse users can choose StatFiles.jl),
Parquet files (using Parquet2.jl),
reading R data files (.rda, .RData) (using RData.jl).
While not all of these libraries are tightly integrated with DataFrames.jl, because DataFrames are essentially collections of aligned Julia vectors, so it is easy to (a) pull out a vector for use with a non-DataFrames-integrated library, or (b) convert your table into a homogeneously-typed matrix using the Matrix constructor or StatsModels.jl.

Other Julia Tabular Libraries
DataFrames.jl is a great general purpose tool for data manipulation and wrangling, but it's not ideal for all applications. For users with more specialized needs, consider using:

TypedTables.jl: Type-stable heterogeneous tables. Useful for improved performance when the structure of your table is relatively stable and does not feature thousands of columns.
JuliaDB.jl: For users working with data that is too large to fit in memory, we suggest JuliaDB.jl, which offers better performance for large datasets, and can handle out-of-core data manipulations (Python users can think of JuliaDB.jl as the Julia version of dask).
Note that most tabular data libraries in the Julia ecosystem (including DataFrames.jl) support a common interface (defined in the Tables.jl package). As a result, some libraries are capable or working with a range of tabular data structures, making it easy to move between tabular libraries as your needs change. A user of Query.jl, for example, can use the same code to manipulate data in a DataFrame, a Table (defined by TypedTables.jl), or a JuliaDB table.

Questions?
If there is something you expect DataFrames to be capable of, but cannot figure out how to do, please reach out with questions in Domains/Data on Discourse. Additionally you might want to listen to an introduction to DataFrames.jl on JuliaAcademy.

Please report bugs by opening an issue.

You can follow the source links throughout the documentation to jump right to the source files on GitHub to make pull requests for improving the documentation and function capabilities.

Please review DataFrames contributing guidelines before submitting your first PR!

Information on specific versions can be found on the Release page.

Package Manual
First Steps with DataFrames.jl
Setting up the Environment
Constructors and Basic Utility Functions
Getting and Setting Data in a Data Frame
Basic Usage of Transformation Functions
Getting Started
Installation
The DataFrame Type
Database-Style Joins
Introduction to joins
Key value comparisons and floating point values
Joining on key columns with different names
Handling of duplicate keys and tracking source data frame
Renaming joined columns
Matching missing values in joins
Specifying row order in the join result
In-place left join
The Split-Apply-Combine Strategy
Design of the split-apply-combine support
Examples of the split-apply-combine operations
Using GroupedDataFrame as an iterable and indexable object
Simulating the SQL where clause
Column-independent operations
Column-independent operations versus functions
Specifying group order in groupby
Reshaping and Pivoting Data
Sorting
Categorical Data
Missing Data
Comparisons
Comparison with the Python package pandas
Comparison with the R package dplyr
Comparison with the R package data.table
Comparison with Stata (version 8 and above)
Data manipulation frameworks
DataFramesMeta.jl
DataFrameMacros.jl
Query.jl
API
Only exported (i.e. available for use without DataFrames. qualifier after loading the DataFrames.jl package with using DataFrames) types and functions are considered a part of the public API of the DataFrames.jl package. In general all such objects are documented in this manual (in case some documentation is missing please kindly report an issue here).

Note
Breaking changes to public and documented API are avoided in DataFrames.jl where possible.

The following changes are not considered breaking:

specific floating point values computed by operations may change at any time; users should rely only on approximate accuracy;
in functions that use the default random number generator provided by Base Julia the specific random numbers computed may change across Julia versions;
if the changed functionality is classified as a bug;
if the changed behavior was not documented; two major cases are:
in its implementation some function accepted a wider range of arguments that it was documented to handle - changes in handling of undocumented arguments are not considered as breaking;
the type of the value returned by a function changes, but it still follows the contract specified in the documentation; for example if a function is documented to return a vector then changing its type from Vector to PooledVector is not considered as breaking;
error behavior: code that threw an exception can change exception type thrown or stop throwing an exception;
changes in display (how objects are printed);
changes to the state of global objects from Base Julia whose state normally is considered volatile (e.g. state of global random number generator).
All types and functions that are part of public API are guaranteed to go through a deprecation period before a breaking change is made to them or they would be removed.

The standard practice is that breaking changes are implemented when a major release of DataFrames.jl is made (e.g. functionalities deprecated in a 1.x release would be changed in the 2.0 release).

In rare cases a breaking change might be introduced in a minor release. In such a case the changed behavior still goes through one minor release during which it is deprecated. The situations where such a breaking change might be allowed are (still such breaking changes will be avoided if possible):

the affected functionality was previously clearly identified in the documentation as being subject to changes (for example in DataFrames.jl 1.4 release propagation rules of :note-style metadata are documented as such);
the change is on the border of being classified as a bug (in rare cases even if a behavior of some function was documented its consequences for certain argument combinations could be decided to be unintended and not wanted);
the change is needed to adjust DataFrames.jl functionality to changes in Base Julia.
Please be warned that while Julia allows you to access internal functions or types of DataFrames.jl these can change without warning between versions of DataFrames.jl. In particular it is not safe to directly access fields of types that are a part of public API of the DataFrames.jl package using e.g. the getfield function. Whenever some operation on fields of defined types is considered allowed an appropriate exported function should be used instead.
	Comparisons
	This section compares DataFrames.jl with other data manipulation frameworks in Python, R, and Stata.

	A sample data set can be created using the following code:

	using DataFrames
	using Statistics

	df = DataFrame(grp=repeat(1:2, 3), x=6:-1:1, y=4:9, z=[3:7; missing], id='a':'f')
	df2 = DataFrame(grp=[1, 3], w=[10, 11])

	Note
	Some of the operations mutate the tables so every operation assumes that it is done on the original data frame.

	Note that in the comparisons presented below predicates like x -> x >= 1 can be more compactly written as =>(1). The latter form has an additional benefit that it is compiled only once per Julia session (as opposed to x -> x >= 1 which defines a new anonymous function every time it is introduced).

	Comparison with the Python package pandas
	The following table compares the main functions of DataFrames.jl with the Python package pandas (version 1.1.0):

	import pandas as pd
	import numpy as np

	df = pd.DataFrame({'grp': [1, 2, 1, 2, 1, 2],
	'x': range(6, 0, -1),
	'y': range(4, 10),
	'z': [3, 4, 5, 6, 7, None]},
	index = list('abcdef'))
	df2 = pd.DataFrame({'grp': [1, 3], 'w': [10, 11]})

	Because pandas supports multi-index, this example data frame is set up with a to f as row indices rather than a separate id column.

	Accessing data
	Operation pandas DataFrames.jl
	Cell indexing by location df.iloc[1, 1] df[2, 2]
	Row slicing by location df.iloc[1:3] df[2:3, :]
	Column slicing by location df.iloc[:, 1:] df[:, 2:end]
	Row indexing by label df.loc['c'] df[findfirst(==('c'), df.id), :]
	Column indexing by label df.loc[:, 'x'] df[:, :x]
	Column slicing by label df.loc[:, ['x', 'z']] df[:, [:x, :z]]
	df.loc[:, 'x':'z'] df[:, Between(:x, :z)]
	Mixed indexing df.loc['c'][1] df[findfirst(==('c'), df.id), 2]
	Note that Julia uses 1-based indexing, inclusive on both ends. A special keyword end can be used to indicate the last index. Likewise, the begin keyword can be used to indicate the first index.

	In addition, when indexing a data frame with the findfirst function, a single DataFrameRow object is returned. In the case that id is not unique, you can use the findall function or boolean indexing instead. It would then return a DataFrame object containing all matched rows. The following two lines of code are functionally equivalent:

	df[findall(==('c'), df.id), :]
	df[df.id .== 'c', :]

	DataFrames.jl's indexing always produces a consistent and predictable return type. By contrast, pandas' loc function returns a Series object when there is exactly one 'c' value in the index, and it returns a DataFrame object when there are multiple rows having the index value of 'c'.

	Common operations
	Operation pandas DataFrames.jl
	Reduce multiple values df['z'].mean(skipna = False) mean(df.z)
	df['z'].mean() mean(skipmissing(df.z))
	df[['z']].agg(['mean']) combine(df, :z => mean ∘ skipmissing)
	Add new columns df.assign(z1 = df['z'] + 1) transform(df, :z => (v -> v .+ 1) => :z1)
	Rename columns df.rename(columns = {'x': 'x_new'}) rename(df, :x => :x_new)
	Pick & transform columns df.assign(x_mean = df['x'].mean())[['x_mean', 'y']] select(df, :x => mean, :y)
	Sort rows df.sort_values(by = 'x') sort(df, :x)
	df.sort_values(by = ['grp', 'x'], ascending = [True, False]) sort(df, [:grp, order(:x, rev = true)])
	Drop missing rows df.dropna() dropmissing(df)
	Select unique rows df.drop_duplicates() unique(df)
	Note that pandas skips NaN values in its analytic functions by default. By contrast, Julia functions do not skip NaN's. If necessary, you can filter out the NaN's before processing, for example, mean(Iterators.filter(!isnan, x)).

	Pandas uses NaN for representing both missing data and the floating point "not a number" value. Julia defines a special value missing for representing missing data. DataFrames.jl respects general rules in Julia in propagating missing values by default. If necessary, the skipmissing function can be used to remove missing data. See the Missing Data section for more information.

	In addition, pandas keeps the original column name after applying a function. DataFrames.jl appends a suffix to the column name by default. To keep it simple, the examples above do not synchronize the column names between pandas and DataFrames.jl (you can pass renamecols=false keyword argument to select, transform and combine functions to retain old column names).

	Mutating operations
	Operation pandas DataFrames.jl
	Add new columns df['z1'] = df['z'] + 1 df.z1 = df.z .+ 1
	transform!(df, :z => (x -> x .+ 1) => :z1)
	df.insert(1, 'const', 10) insertcols!(df, 2, :const => 10)
	Rename columns df.rename(columns = {'x': 'x_new'}, inplace = True) rename!(df, :x => :x_new)
	Sort rows df.sort_values(by = 'x', inplace = True) sort!(df, :x)
	Drop missing rows df.dropna(inplace = True) dropmissing!(df)
	Select unique rows df.drop_duplicates(inplace = True) unique!(df)
	Generally speaking, DataFrames.jl follows the Julia convention of using ! in the function name to indicate mutation behavior.

	Grouping data and aggregation
	DataFrames.jl provides a groupby function to apply operations over each group independently. The result of groupby is a GroupedDataFrame object which may be processed using the combine, transform, or select functions. The following table illustrates some common grouping and aggregation usages.

	Operation pandas DataFrames.jl
	Aggregate by groups df.groupby('grp')['x'].mean() combine(groupby(df, :grp), :x => mean)
	Rename column after aggregation df.groupby('grp')['x'].mean().rename("my_mean") combine(groupby(df, :grp), :x => mean => :my_mean)
	Add aggregated data as column df.join(df.groupby('grp')['x'].mean(), on='grp', rsuffix='_mean') transform(groupby(df, :grp), :x => mean)
	...and select output columns df.join(df.groupby('grp')['x'].mean(), on='grp', rsuffix='_mean')[['grp', 'x_mean']] select(groupby(df, :grp), :id, :x => mean)
	Note that pandas returns a Series object for 1-dimensional result unless reset_index is called afterwards. The corresponding DataFrames.jl examples return an equivalent DataFrame object. Consider the first example:

	>>> df.groupby('grp')['x'].mean()
	grp
	1 4
	2 3
	Name: x, dtype: int64

	For DataFrames.jl, it looks like this:

	julia> combine(groupby(df, :grp), :x => mean)
	2×2 DataFrame
	Row │ grp x_mean
	│ Int64 Float64
	─────┼────────────────
	1 │ 1 4.0
	2 │ 2 3.0

	In DataFrames.jl, the GroupedDataFrame object supports an efficient key lookup. Hence, it performs well when you need to perform lookups repeatedly.

	More advanced commands
	This section includes more complex examples.

	Operation pandas DataFrames.jl
	Complex Function df[['z']].agg(lambda v: np.mean(np.cos(v))) combine(df, :z => v -> mean(cos, skipmissing(v)))
	Aggregate multiple columns df.agg({'x': max, 'y': min}) combine(df, :x => maximum, :y => minimum)
	df[['x', 'y']].mean() combine(df, [:x, :y] .=> mean)
	df.filter(regex=("^x")).mean() combine(df, names(df, r"^x") .=> mean)
	Apply function over multiple variables df.assign(x_y_cor = np.corrcoef(df.x, df.y)[0, 1]) transform(df, [:x, :y] => cor)
	Row-wise operation df.assign(x_y_min = df.apply(lambda v: min(v.x, v.y), axis=1)) transform(df, [:x, :y] => ByRow(min))
	df.assign(x_y_argmax = df.apply(lambda v: df.columns[v.argmax()], axis=1)) transform(df, AsTable([:x, :y]) => ByRow(argmax))
	DataFrame as input df.groupby('grp').head(2) combine(d -> first(d, 2), groupby(df, :grp))
	DataFrame as output df[['x']].agg(lambda x: [min(x), max(x)]) combine(df, :x => (x -> (x=[minimum(x), maximum(x)],)) => AsTable)
	Note that pandas preserves the same row order after groupby whereas DataFrames.jl shows them grouped by the provided keys after the combine operation, but select and transform retain an original row ordering.

	Joining data frames
	DataFrames.jl supports join operations similar to a relational database.

	Operation pandas DataFrames.jl
	Inner join pd.merge(df, df2, how = 'inner', on = 'grp') innerjoin(df, df2, on = :grp)
	Outer join pd.merge(df, df2, how = 'outer', on = 'grp') outerjoin(df, df2, on = :grp)
	Left join pd.merge(df, df2, how = 'left', on = 'grp') leftjoin(df, df2, on = :grp)
	Right join pd.merge(df, df2, how = 'right', on = 'grp') rightjoin(df, df2, on = :grp)
	Semi join (filtering) df[df.grp.isin(df2.grp)] semijoin(df, df2, on = :grp)
	Anti join (filtering) df[~df.grp.isin(df2.grp)] antijoin(df, df2, on = :grp)
	For multi-column joins, both pandas and DataFrames.jl accept an array for the on keyword argument.

	In the cases of semi joins and anti joins, the isin function in pandas can still be used as long as the join keys are combined in a tuple. In DataFrames.jl, it just works normally with an array of join keys specified in the on keyword argument.

	Comparison with the R package dplyr
	The following table compares the main functions of DataFrames.jl with the R package dplyr (version 1):

	df <- tibble(grp = rep(1:2, 3), x = 6:1, y = 4:9,
	z = c(3:7, NA), id = letters[1:6])

	Operation dplyr DataFrames.jl
	Reduce multiple values summarize(df, mean(x)) combine(df, :x => mean)
	Add new columns mutate(df, x_mean = mean(x)) transform(df, :x => mean => :x_mean)
	Rename columns rename(df, x_new = x) rename(df, :x => :x_new)
	Pick columns select(df, x, y) select(df, :x, :y)
	Pick & transform columns transmute(df, mean(x), y) select(df, :x => mean, :y)
	Pick rows filter(df, x >= 1) subset(df, :x => ByRow(x -> x >= 1))
	Sort rows arrange(df, x) sort(df, :x)
	As in dplyr, some of these functions can be applied to grouped data frames, in which case they operate by group:

	Operation dplyr DataFrames.jl
	Reduce multiple values summarize(group_by(df, grp), mean(x)) combine(groupby(df, :grp), :x => mean)
	Add new columns mutate(group_by(df, grp), mean(x)) transform(groupby(df, :grp), :x => mean)
	Pick & transform columns transmute(group_by(df, grp), mean(x), y) select(groupby(df, :grp), :x => mean, :y)
	The table below compares more advanced commands:

	Operation dplyr DataFrames.jl
	Complex Function summarize(df, mean(x, na.rm = T)) combine(df, :x => x -> mean(skipmissing(x)))
	Transform several columns summarize(df, max(x), min(y)) combine(df, :x => maximum, :y => minimum)
	summarize(df, across(c(x, y), mean)) combine(df, [:x, :y] .=> mean)
	summarize(df, across(starts_with("x"), mean)) combine(df, names(df, r"^x") .=> mean)
	summarize(df, across(c(x, y), list(max, min))) combine(df, ([:x, :y] .=> [maximum minimum])...)
	Multivariate function mutate(df, cor(x, y)) transform(df, [:x, :y] => cor)
	Row-wise mutate(rowwise(df), min(x, y)) transform(df, [:x, :y] => ByRow(min))
	mutate(rowwise(df), which.max(c_across(matches("^x")))) transform(df, AsTable(r"^x") => ByRow(argmax))
	DataFrame as input summarize(df, head(across(), 2)) combine(d -> first(d, 2), df)
	DataFrame as output summarize(df, tibble(value = c(min(x), max(x)))) combine(df, :x => (x -> (value = [minimum(x), maximum(x)],)) => AsTable)
	Comparison with the R package data.table
	The following table compares the main functions of DataFrames.jl with the R package data.table (version 1.14.1).

	library(data.table)
	df <- data.table(grp = rep(1:2, 3), x = 6:1, y = 4:9,
	z = c(3:7, NA), id = letters[1:6])
	df2 <- data.table(grp=c(1,3), w = c(10,11))

	Operation data.table DataFrames.jl
	Reduce multiple values df[, .(mean(x))] combine(df, :x => mean)
	Add new columns df[, x_mean:=mean(x) ] transform!(df, :x => mean => :x_mean)
	Rename column (in place) setnames(df, "x", "x_new") rename!(df, :x => :x_new)
	Rename multiple columns (in place) setnames(df, c("x", "y"), c("x_new", "y_new")) rename!(df, [:x, :y] .=> [:x_new, :y_new])
	Pick columns as dataframe df[, .(x, y)] select(df, :x, :y)
	Pick column as a vector df[, x] df[!, :x]
	Remove columns df[, -"x"] select(df, Not(:x))
	Remove columns (in place) df[, x:=NULL] select!(df, Not(:x))
	Remove columns (in place) df[, c("x", "y"):=NULL] select!(df, Not([:x, :y]))
	Pick & transform columns df[, .(mean(x), y)] select(df, :x => mean, :y)
	Pick rows df[ x >= 1 ] filter(:x => >=(1), df)
	Sort rows (in place) setorder(df, x) sort!(df, :x)
	Sort rows df[ order(x) ] sort(df, :x)
	Grouping data and aggregation
	Operation data.table DataFrames.jl
	Reduce multiple values df[, mean(x), by=id ] combine(groupby(df, :id), :x => mean)
	Add new columns (in place) df[, x_mean:=mean(x), by=id] transform!(groupby(df, :id), :x => mean)
	Pick & transform columns df[, .(x_mean = mean(x), y), by=id] select(groupby(df, :id), :x => mean, :y)
	More advanced commands
	Operation data.table DataFrames.jl
	Complex Function df[, .(mean(x, na.rm=TRUE)) ] combine(df, :x => x -> mean(skipmissing(x)))
	Transform certain rows (in place) df[x<=0, x:=0] df.x[df.x .<= 0] .= 0
	Transform several columns df[, .(max(x), min(y)) ] combine(df, :x => maximum, :y => minimum)
	df[, lapply(.SD, mean), .SDcols = c("x", "y") ] combine(df, [:x, :y] .=> mean)
	df[, lapply(.SD, mean), .SDcols = patterns("*x") ] combine(df, names(df, r"^x") .=> mean)
	df[, unlist(lapply(.SD, function(x) c(max=max(x), min=min(x)))), .SDcols = c("x", "y") ] combine(df, ([:x, :y] .=> [maximum minimum])...)
	Multivariate function df[, .(cor(x,y)) ] transform(df, [:x, :y] => cor)
	Row-wise df[, min_xy := min(x, y), by = 1:nrow(df)] transform!(df, [:x, :y] => ByRow(min))
	df[, argmax_xy := which.max(.SD) , .SDcols = patterns("*x"), by = 1:nrow(df) ] transform!(df, AsTable(r"^x") => ByRow(argmax))
	DataFrame as output df[, .SD[1], by=grp] combine(groupby(df, :grp), first)
	DataFrame as output df[, .SD[which.max(x)], by=grp] combine(groupby(df, :grp), sdf -> sdf[argmax(sdf.x), :])
	Joining data frames
	Operation data.table DataFrames.jl
	Inner join merge(df, df2, on = "grp") innerjoin(df, df2, on = :grp)
	Outer join merge(df, df2, all = TRUE, on = "grp") outerjoin(df, df2, on = :grp)
	Left join merge(df, df2, all.x = TRUE, on = "grp") leftjoin(df, df2, on = :grp)
	Right join merge(df, df2, all.y = TRUE, on = "grp") rightjoin(df, df2, on = :grp)
	Anti join (filtering) df[!df2, on = "grp" ] antijoin(df, df2, on = :grp)
	Semi join (filtering) merge(df1, df2[, .(grp)]) semijoin(df, df2, on = :grp)
	Comparison with Stata (version 8 and above)
	The following table compares the main functions of DataFrames.jl with Stata:

	Operation Stata DataFrames.jl
	Reduce multiple values collapse (mean) x combine(df, :x => mean)
	Add new columns egen x_mean = mean(x) transform!(df, :x => mean => :x_mean)
	Rename columns rename x x_new rename!(df, :x => :x_new)
	Pick columns keep x y select!(df, :x, :y)
	Pick rows keep if x >= 1 subset!(df, :x => ByRow(x -> x >= 1))
	Sort rows sort x sort!(df, :x)
	Note that the suffix ! (i.e. transform!, select!, etc) ensures that the operation transforms the dataframe in place, as in Stata

	Some of these functions can be applied to grouped data frames, in which case they operate by group:

	Operation Stata DataFrames.jl
	Add new columns egen x_mean = mean(x), by(grp) transform!(groupby(df, :grp), :x => mean)
	Reduce multiple values collapse (mean) x, by(grp) combine(groupby(df, :grp), :x => mean)
	The table below compares more advanced commands:

	Operation Stata DataFrames.jl
	Transform certain rows replace x = 0 if x <= 0 transform(df, :x => (x -> ifelse.(x .<= 0, 0, x)) => :x)
	Transform several columns collapse (max) x (min) y combine(df, :x => maximum, :y => minimum)
	collapse (mean) x y combine(df, [:x, :y] .=> mean)
	collapse (mean) x* combine(df, names(df, r"^x") .=> mean)
	collapse (max) x y (min) x y combine(df, ([:x, :y] .=> [maximum minimum])...)
	Multivariate function egen z = corr(x y) transform!(df, [:x, :y] => cor => :z)
	Row-wise egen z = rowmin(x y) transform!(df, [:x, :y] => ByRow(min) => :z)
	Database-Style Joins
	Introduction to joins
	We often need to combine two or more data sets together to provide a complete picture of the topic we are studying. For example, suppose that we have the following two data sets:

	julia> using DataFrames

	julia> people = DataFrame(ID=[20, 40], Name=["John Doe", "Jane Doe"])
	2×2 DataFrame
	Row │ ID Name
	│ Int64 String
	─────┼─────────────────
	1 │ 20 John Doe
	2 │ 40 Jane Doe

	julia> jobs = DataFrame(ID=[20, 40], Job=["Lawyer", "Doctor"])
	2×2 DataFrame
	Row │ ID Job
	│ Int64 String
	─────┼───────────────
	1 │ 20 Lawyer
	2 │ 40 Doctor

	We might want to work with a larger data set that contains both the names and jobs for each ID. We can do this using the innerjoin function:

	julia> innerjoin(people, jobs, on = :ID)
	2×3 DataFrame
	Row │ ID Name Job
	│ Int64 String String
	─────┼─────────────────────────
	1 │ 20 John Doe Lawyer
	2 │ 40 Jane Doe Doctor

	In relational database theory, this operation is generally referred to as a join. The columns used to determine which rows should be combined during a join are called keys.

	The following functions are provided to perform seven kinds of joins:

	innerjoin: the output contains rows for values of the key that exist in all passed data frames.
	leftjoin: the output contains rows for values of the key that exist in the first (left) argument, whether or not that value exists in the second (right) argument.
	rightjoin: the output contains rows for values of the key that exist in the second (right) argument, whether or not that value exists in the first (left) argument.
	outerjoin: the output contains rows for values of the key that exist in any of the passed data frames.
	semijoin: Like an inner join, but output is restricted to columns from the first (left) argument.
	antijoin: The output contains rows for values of the key that exist in the first (left) but not the second (right) argument. As with semijoin, output is restricted to columns from the first (left) argument.
	crossjoin: The output is the cartesian product of rows from all passed data frames.
	See the Wikipedia page on SQL joins for more information.

	Here are examples of different kinds of join:

	julia> jobs = DataFrame(ID=[20, 60], Job=["Lawyer", "Astronaut"])
	2×2 DataFrame
	Row │ ID Job
	│ Int64 String
	─────┼──────────────────
	1 │ 20 Lawyer
	2 │ 60 Astronaut

	julia> innerjoin(people, jobs, on = :ID)
	1×3 DataFrame
	Row │ ID Name Job
	│ Int64 String String
	─────┼─────────────────────────
	1 │ 20 John Doe Lawyer

	julia> leftjoin(people, jobs, on = :ID)
	2×3 DataFrame
	Row │ ID Name Job
	│ Int64 String String?
	─────┼──────────────────────────
	1 │ 20 John Doe Lawyer
	2 │ 40 Jane Doe missing

	julia> rightjoin(people, jobs, on = :ID)
	2×3 DataFrame
	Row │ ID Name Job
	│ Int64 String? String
	─────┼────────────────────────────
	1 │ 20 John Doe Lawyer
	2 │ 60 missing Astronaut

	julia> outerjoin(people, jobs, on = :ID)
	3×3 DataFrame
	Row │ ID Name Job
	│ Int64 String? String?
	─────┼────────────────────────────
	1 │ 20 John Doe Lawyer
	2 │ 40 Jane Doe missing
	3 │ 60 missing Astronaut

	julia> semijoin(people, jobs, on = :ID)
	1×2 DataFrame
	Row │ ID Name
	│ Int64 String
	─────┼─────────────────
	1 │ 20 John Doe

	julia> antijoin(people, jobs, on = :ID)
	1×2 DataFrame
	Row │ ID Name
	│ Int64 String
	─────┼─────────────────
	1 │ 40 Jane Doe

	Cross joins are the only kind of join that does not use a on key:

	julia> crossjoin(people, jobs, makeunique = true)
	4×4 DataFrame
	Row │ ID Name ID_1 Job
	│ Int64 String Int64 String
	─────┼───────────────────────────────────
	1 │ 20 John Doe 20 Lawyer
	2 │ 20 John Doe 60 Astronaut
	3 │ 40 Jane Doe 20 Lawyer
	4 │ 40 Jane Doe 60 Astronaut

	Key value comparisons and floating point values
	Key values from the two or more data frames are compared using the isequal function. This is consistent with the Set and Dict types in Julia Base.

	It is not recommended to use floating point numbers as keys: floating point comparisons can be surprising and unpredictable. If you do use floating point keys, note that by default an error is raised when keys include -0.0 (negative zero) or NaN values. Here is an example:

	julia> innerjoin(DataFrame(id=[-0.0]), DataFrame(id=[0.0]), on=:id)
	ERROR: ArgumentError: Currently for numeric values `NaN` and `-0.0` in their real or imaginary components are not allowed. Such value was found in column :id in left data frame. Use CategoricalArrays.jl to wrap these values in a CategoricalVector to perform the requested join.

	This can be overridden by wrapping the key values in a categorical vector.

	Joining on key columns with different names
	In order to join data frames on keys which have different names in the left and right tables, you may pass left => right pairs as on argument:

	julia> a = DataFrame(ID=[20, 40], Name=["John Doe", "Jane Doe"])
	2×2 DataFrame
	Row │ ID Name
	│ Int64 String
	─────┼─────────────────
	1 │ 20 John Doe
	2 │ 40 Jane Doe

	julia> b = DataFrame(IDNew=[20, 40], Job=["Lawyer", "Doctor"])
	2×2 DataFrame
	Row │ IDNew Job
	│ Int64 String
	─────┼───────────────
	1 │ 20 Lawyer
	2 │ 40 Doctor

	julia> innerjoin(a, b, on = :ID => :IDNew)
	2×3 DataFrame
	Row │ ID Name Job
	│ Int64 String String
	─────┼─────────────────────────
	1 │ 20 John Doe Lawyer
	2 │ 40 Jane Doe Doctor

	Here is another example with multiple columns:

	julia> a = DataFrame(City=["Amsterdam", "London", "London", "New York", "New York"],
	Job=["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
	Category=[1, 2, 3, 4, 5])
	5×3 DataFrame
	Row │ City Job Category
	│ String String Int64
	─────┼─────────────────────────────
	1 │ Amsterdam Lawyer 1
	2 │ London Lawyer 2
	3 │ London Lawyer 3
	4 │ New York Doctor 4
	5 │ New York Doctor 5

	julia> b = DataFrame(Location=["Amsterdam", "London", "London", "New York", "New York"],
	Work=["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
	Name=["a", "b", "c", "d", "e"])
	5×3 DataFrame
	Row │ Location Work Name
	│ String String String
	─────┼───────────────────────────
	1 │ Amsterdam Lawyer a
	2 │ London Lawyer b
	3 │ London Lawyer c
	4 │ New York Doctor d
	5 │ New York Doctor e

	julia> innerjoin(a, b, on = [:City => :Location, :Job => :Work])
	9×4 DataFrame
	Row │ City Job Category Name
	│ String String Int64 String
	─────┼─────────────────────────────────────
	1 │ Amsterdam Lawyer 1 a
	2 │ London Lawyer 2 b
	3 │ London Lawyer 3 b
	4 │ London Lawyer 2 c
	5 │ London Lawyer 3 c
	6 │ New York Doctor 4 d
	7 │ New York Doctor 5 d
	8 │ New York Doctor 4 e
	9 │ New York Doctor 5 e

	Handling of duplicate keys and tracking source data frame
	Additionally, notice that in the last join rows 2 and 3 had the same values on on variables in both joined DataFrames. In such a situation innerjoin, outerjoin, leftjoin and rightjoin will produce all combinations of matching rows. In our example rows from 2 to 5 were created as a result. The same behavior can be observed for rows 4 and 5 in both joined DataFrames.

	In order to check that columns passed as the on argument define unique keys (according to isequal) in each input data frame you can set the validate keyword argument to a two-element tuple or a pair of Bool values, with each element indicating whether to run check for the corresponding data frame. Here is an example for the join operation described above:

	julia> innerjoin(a, b, on = [(:City => :Location), (:Job => :Work)], validate=(true, true))
	ERROR: ArgumentError: Merge key(s) are not unique in both df1 and df2. df1 contains 2 duplicate keys: (City = "London", Job = "Lawyer") and (City = "New York", Job = "Doctor"). df2 contains 2 duplicate keys: (Location = "London", Work = "Lawyer") and (Location = "New York", Work = "Doctor").

	Finally, using the source keyword argument you can add a column to the resulting data frame indicating whether the given row appeared only in the left, the right or both data frames. Here is an example:

	julia> a = DataFrame(ID=[20, 40], Name=["John", "Jane"])
	2×2 DataFrame
	Row │ ID Name
	│ Int64 String
	─────┼───────────────
	1 │ 20 John
	2 │ 40 Jane

	julia> b = DataFrame(ID=[20, 60], Job=["Lawyer", "Doctor"])
	2×2 DataFrame
	Row │ ID Job
	│ Int64 String
	─────┼───────────────
	1 │ 20 Lawyer
	2 │ 60 Doctor

	julia> outerjoin(a, b, on=:ID, validate=(true, true), source=:source)
	3×4 DataFrame
	Row │ ID Name Job source
	│ Int64 String? String? String
	─────┼─────────────────────────────────────
	1 │ 20 John Lawyer both
	2 │ 40 Jane missing left_only
	3 │ 60 missing Doctor right_only

	Note that this time we also used the validate keyword argument and it did not produce errors as the keys defined in both source data frames were unique.

	Renaming joined columns
	Often you want to keep track of the source data frame. This feature is supported with the renamecols keyword argument:

	julia> innerjoin(a, b, on=:ID, renamecols = "_left" => "_right")
	1×3 DataFrame
	Row │ ID Name_left Job_right
	│ Int64 String String
	─────┼─────────────────────────────
	1 │ 20 John Lawyer

	In the above example we added the "_left" suffix to the non-key columns from the left table and the "_right" suffix to the non-key columns from the right table.

	Alternatively it is allowed to pass a function transforming column names:

	julia> innerjoin(a, b, on=:ID, renamecols = lowercase => uppercase)
	1×3 DataFrame
	Row │ ID name JOB
	│ Int64 String String
	─────┼───────────────────────
	1 │ 20 John Lawyer

	Matching missing values in joins
	By default when you try to to perform a join on a key that has missing values you get an error:

	julia> df1 = DataFrame(id=[1, missing, 3], a=1:3)
	3×2 DataFrame
	Row │ id a
	│ Int64? Int64
	─────┼────────────────
	1 │ 1 1
	2 │ missing 2
	3 │ 3 3

	julia> df2 = DataFrame(id=[1, 2, missing], b=1:3)
	3×2 DataFrame
	Row │ id b
	│ Int64? Int64
	─────┼────────────────
	1 │ 1 1
	2 │ 2 2
	3 │ missing 3

	julia> innerjoin(df1, df2, on=:id)
	ERROR: ArgumentError: Missing values in key columns are not allowed when matchmissing == :error. `missing` found in column :id in left data frame.

	If you would prefer missing values to be treated as equal pass the matchmissing=:equal keyword argument:

	julia> innerjoin(df1, df2, on=:id, matchmissing=:equal)
	2×3 DataFrame
	Row │ id a b
	│ Int64? Int64 Int64
	─────┼───────────────────────
	1 │ 1 1 1
	2 │ missing 2 3

	Alternatively you might want to drop all rows with missing values. In this case pass matchmissing=:notequal:

	julia> innerjoin(df1, df2, on=:id, matchmissing=:notequal)
	1×3 DataFrame
	Row │ id a b
	│ Int64? Int64 Int64
	─────┼──────────────────────
	1 │ 1 1 1

	Specifying row order in the join result
	By default the order of rows produced by the join operation is undefined:

	julia> df_left = DataFrame(id=[1, 2, 4, 5], left=1:4)
	4×2 DataFrame
	Row │ id left
	│ Int64 Int64
	─────┼──────────────
	1 │ 1 1
	2 │ 2 2
	3 │ 4 3
	4 │ 5 4

	julia> df_right = DataFrame(id=[2, 1, 3, 6, 7], right=1:5)
	5×2 DataFrame
	Row │ id right
	│ Int64 Int64
	─────┼──────────────
	1 │ 2 1
	2 │ 1 2
	3 │ 3 3
	4 │ 6 4
	5 │ 7 5

	julia> outerjoin(df_left, df_right, on=:id)
	7×3 DataFrame
	Row │ id left right
	│ Int64 Int64? Int64?
	─────┼─────────────────────────
	1 │ 2 2 1
	2 │ 1 1 2
	3 │ 4 3 missing
	4 │ 5 4 missing
	5 │ 3 missing 3
	6 │ 6 missing 4
	7 │ 7 missing 5

	If you would like the result to keep the row order of the left table pass the order=:left keyword argument:

	julia> outerjoin(df_left, df_right, on=:id, order=:left)
	7×3 DataFrame
	Row │ id left right
	│ Int64 Int64? Int64?
	─────┼─────────────────────────
	1 │ 1 1 2
	2 │ 2 2 1
	3 │ 4 3 missing
	4 │ 5 4 missing
	5 │ 3 missing 3
	6 │ 6 missing 4
	7 │ 7 missing 5

	Note that in this case keys missing from the left table are put after the keys present in it.

	Similarly order=:right keeps the order of the right table (and puts keys not present in it at the end):

	julia> outerjoin(df_left, df_right, on=:id, order=:right)
	7×3 DataFrame
	Row │ id left right
	│ Int64 Int64? Int64?
	─────┼─────────────────────────
	1 │ 2 2 1
	2 │ 1 1 2
	3 │ 3 missing 3
	4 │ 6 missing 4
	5 │ 7 missing 5
	6 │ 4 3 missing
	7 │ 5 4 missing

	In-place left join
	A common operation is adding data from a reference table to some main table. It is possible to perform such an in-place update using the leftjoin! function. In this case the left table is updated in place with matching rows from the right table.

	julia> main = DataFrame(id=1:4, main=1:4)
	4×2 DataFrame
	Row │ id main
	│ Int64 Int64
	─────┼──────────────
	1 │ 1 1
	2 │ 2 2
	3 │ 3 3
	4 │ 4 4

	julia> leftjoin!(main, DataFrame(id=[2, 4], info=["a", "b"]), on=:id);

	julia> main
	4×3 DataFrame
	Row │ id main info
	│ Int64 Int64 String?
	─────┼───────────────────────
	1 │ 1 1 missing
	2 │ 2 2 a
	3 │ 3 3 missing
	4 │ 4 4 b

	Note that in this case the order and number of rows in the left table is not changed. Therefore, in particular, it is not allowed to have duplicate keys in the right table:

	julia> leftjoin!(main, DataFrame(id=[2, 2], info_bad=["a", "b"]), on=:id)
	ERROR: ArgumentError: duplicate rows found in right table
	# # Title: End-to-end RAG example - low-level
	# This script runs an end-to-end RAG applications over text documents in `documents/` folder

	# Package Imports
	# using OpenAI # for interacting with the API, import if you want to skip the first part of tutorial
	using JSON3
	using HTTP
	using LinearAlgebra

	# ! Config
	const MODEL_CHAT = "gpt-3.5-turbo" # default model for chat, fast and "cheap"
	const MODEL_EMBEDDING = "text-embedding-ada-002" # default model for embedding test
	const API_KEY = ENV["OPENAI_API_KEY"]; # do not commit, share or sync anywhere!

	# Define shared abstract type for custom printing
	abstract type AbstractBuildingBlock end

	# Convenience for building "messages" for the chatbot faster (see the standard format in the API calls section...)
	@kwdef struct PromptTemplate <: AbstractBuildingBlock
	system_prompt::Union{String, Nothing} = nothing
	user_prompt::String = ""
	end

	# Return type for the AI model
	@kwdef struct AIMessage <: AbstractBuildingBlock
	content::AbstractString
	status::Union{Int, Nothing} = nothing
	tokens::Tuple{Int, Int} = (-1, -1)
	elapsed::Float64 = -1.0
	end

	# Stores document chunks and their embeddings
	@kwdef struct ChunkIndex{T <: AbstractString} <: AbstractBuildingBlock
	id::Symbol = gensym("ChunkIndex")
	embeddings::Matrix{Float32}
	chunks::Vector{T}
	sources::Vector{<:AbstractString}
	end

	# Structured show method for easier reading (each kwarg on a new line)
	function Base.show(io::IO, t::AbstractBuildingBlock)
	dump(IOContext(io, :limit => true), t, maxdepth = 1)
	end

	# Utitity to be able to combine indices from different sources/documents easily
	function Base.vcat(i1::ChunkIndex{T}, i2::ChunkIndex{T}) where {T <: AbstractString}
	ChunkIndex(;
	embeddings = hcat(i1.embeddings, i2.embeddings),
	chunks = vcat(i1.chunks, i2.chunks),
	sources = vcat(i1.sources, i2.sources))
	end

	## Useful functions for API calls
	# defined in OpenAI.jl (more sophisticated version of it...)
	function create_chat(api_key::String,
	model::String,
	messages::Vector{Dict{String, String}};
	http_kwargs::NamedTuple = NamedTuple(), api_kwargs::NamedTuple = NamedTuple())
	url = "https://api.openai.com/v1/chat/completions"
	headers = Dict("Authorization" => "Bearer $api_key",
	"Content-Type" => "application/json")
	# JSON-encoded data (string)
	body = (; messages, model, api_kwargs...) \|> JSON3.write
	resp = HTTP.request("POST", url; body, headers, http_kwargs...)
	return (;
	response = JSON3.read(resp.body),
	status = resp.status)
	end

	"Builds a history of the conversation (=messages) to provide the prompt to the API. All kwargs are passed as replacements such that `{{key}} => value` in the template.}}"
	function render(prompt::PromptTemplate; kwargs...)
	conversation = Dict{String, String}[]
	!isnothing(prompt.system_prompt) &&
	push!(conversation, Dict("role" => "system", "content" => prompt.system_prompt))
	# Replace any handlebar-style placeholders like `{{key}}` in the user_prompt with user-provided kwargs
	user_prompt = replace(prompt.user_prompt, ["{{$(k)}}" => v for (k, v) in kwargs]...)
	push!(conversation, Dict("role" => "user", "content" => user_prompt))
	return conversation
	end

	"Builds the request to the API and waits for the response."
	function aigenerate(template::PromptTemplate;
	api_key::String = API_KEY,
	model::String = MODEL_CHAT,
	# Let's use smart defaults because OpenAI is a bit fiddly...
	http_kwargs::NamedTuple = (;
	retry_non_idempotent = true,
	retries = 10,
	readtimeout = 30), api_kwargs::NamedTuple = NamedTuple(),
	kwargs...)
	##
	conversation = render(template; kwargs...)
	time = @elapsed r = create_chat(api_key,
	model,
	conversation;
	http_kwargs,
	api_kwargs...)
	return AIMessage(; content = r.response[:choices][begin][:message][:content] \|> strip,
	status = Int(r.status),
	tokens = (r.response[:usage][:prompt_tokens],
	r.response[:usage][:completion_tokens]),
	elapsed = time)
	end

	"Creates embeddings for `docs` (string or array of strings) and returns a normalized matrix (column-wise)"
	function aiembed(docs::Union{AbstractString, Vector{<:AbstractString}},
	postprocess::F = normalize;
	api_key::String = API_KEY,
	model::String = MODEL_EMBEDDING,
	http_kwargs::NamedTuple = NamedTuple(), api_kwargs::NamedTuple = NamedTuple(),
	kwargs...) where {F <: Function}
	r = create_embeddings(api_key, docs, model; http_kwargs, api_kwargs...)
	return mapreduce(x -> postprocess(x[:embedding]), hcat, r.response.data)
	end

	"RAG wrapper that answers the given question and inject the context if needed from `index`"
	function airag(index::ChunkIndex, rag_template::PromptTemplate;
	question::AbstractString, top_k::Int = 3, kwargs...)
	question_emb = aiembed(question)

	idxs = find_closest(index, question_emb; top_k)
	close_chunks = [join(index.chunks[max(begin, i - 2):min(end, i + 2)], "\n")
	for i in idxs]
	return aigenerate(rag_template;
	question,
	context = join(close_chunks, "\n\n"),
	kwargs...)
	end
	## Useful functions for document chunk retrieval
	"Splits `doc` into text chunks of size at most `max_size` (in characters), ie, it accumulates smaller chunks to match the desired size"
	function build_chunks(doc::AbstractString;
	max_size::Int = 128,
	split_pattern::Union{String, Regex} = r"\n\|\. ",
	join_key::String = "\n")
	## shortcut if doc is too short
	length(doc) < max_size && return [doc]
	## proceed
	texts = split(doc, split_pattern)
	doc_chunks = Vector{eltype(texts)}()
	start, counter = 1, 0
	# accumulate chunks until we reach the max size
	for i in eachindex(texts)
	l = length(texts[i])
	# if it doesn't fit, we push all preceeding docs, reset the counter and start a new chunk
	if l == 0 \|\| (counter + l >= max_size)
	push!(doc_chunks, join(texts[start:max(i - 1, 1)], join_key))
	start = i # current text becomes the next chunk
	counter = 0
	end
	counter += l
	end
	# last chunk is never pushed in, so we need to do it manually
	push!(doc_chunks, join(texts[start:end], join_key))
	return doc_chunks
	end

	"Finds the indices of chunks (represented by embeddings in `emb`) that are closest (cosine similarity) to query embedding (`query_emb`). Returns only `top_k` closest indices."
	function find_closest(emb::AbstractMatrix{<:Real},
	query_emb::AbstractVector{<:Real};
	top_k::Int = 100)
	query_emb' * emb \|> vec \|> sortperm \|> reverse \|> x -> first(x, top_k)
	end
	function find_closest(index::ChunkIndex,
	query_emb::AbstractVector{<:Real};
	top_k::Int = 100)
	find_closest(index.embeddings, query_emb; top_k)
	end
	# # Main
	# ## Preparation

	# Before you do anything, put together 10 questions that a user might ask and potential answers based on the sources you want to use.
	# It helps you think through:
	# - what are the sources you want to use? how is the data structured?
	# - what are the questions you want to answer? how do you want to answer them?
	#
	# Moreover, there cannot be a RAG application without an evaluation set.
	# So these questions will allow you to measure how well you're progressing.
	# Ideally, copy the source text you used, so you can later on measure if your RAG app used the "right" context as a source for the answer.

	# Now, let's go to the documentation of DataFrames.jl and let's download a few pages from there.
	# Just copy & paste the text into a text file and save it in `documents/` folder.
	# You might want to remove the useless data at the beginning and end to make your life easier - this is a tutorial afterall :)

	# ## Load the data
	# Example
	dir_raw = joinpath("documents")
	files = ["comparison_with_python.txt", "database_style_joins.txt", "what_is_dataframes.txt"]
	labels = ["DataFrames-Python", "DataFrames-Joins", "DataFrames-WhatIs"]

	# Embed documents and build ChunkIndex (bundle of embeddings, chunks and their respective sources)
	indices = ChunkIndex[]
	for (fn, lbl) in zip(files, labels)
	doc_raw = read(joinpath(dir_raw, fn), String)
	# split into chunks, if you want to start simple - just do `split(text,"\n\n")`
	doc_chunks = build_chunks(doc_raw; max_size = 256, split_pattern = "\n\n") \|>
	texts -> mapreduce(x -> build_chunks(x;
	max_size = 256,
	split_pattern = r"\n\|\. "),
	vcat,
	texts)
	# Notice that we embed all doc_chunks at once, not one by one
	# OpenAI supports embedding multiple documents if they are short enough, ie, it's only because the documentation pages are small
	embeddings = aiembed(doc_chunks) .\|> Float32
	index = ChunkIndex(;
	embeddings,
	chunks = doc_chunks,
	sources = fill(lbl, length(doc_chunks)))
	push!(indices, index)
	end
	index = reduce(vcat, indices) # combine it across several documents

	# ## Ask questions - Step-by-Step
	rag_template = PromptTemplate(;
	system_prompt = "Act as a world-class AI assistant and an expert in Julia language. Answer the question based only on the provided context. Be brief and concise.",
	user_prompt = """
	# Context

	{{context}}

	# Question

	{{question}}

	# Answer
	""")

	question = "I like dplyr, what is the equivalent in Julia?"
	question_emb = aiembed(question)

	# Build the context of similar docs -- take the top 3 closest chunks
	idxs = find_closest(index, question_emb; top_k = 3)
	# We add 2 chunks before and after each of the closest chunk
	close_chunks = [join(index.chunks[max(begin, i - 2):min(end, i + 2)], "\n")
	for i in idxs]
	answer = aigenerate(rag_template;
	question,
	context = join(close_chunks, "\n\n"))
	println(answer.content)

	# ## Ask questions E2E
	question = "I like dplyr, what is the equivalent in Julia?"
	answer = airag(index, rag_template; question)
	println(answer.content)
	Welcome to the DataFrames.jl documentation!

	This resource aims to teach you everything you need to know to get up and running with tabular data manipulation using the DataFrames.jl package.

	For more illustrations of DataFrames.jl usage, in particular in conjunction with other packages you can check-out the following resources (they are kept up to date with the released version of DataFrames.jl):

	What is DataFrames.jl?
	DataFrames.jl provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of pandas (in Python) and data.frame, data.table and dplyr (in R), making it a great general purpose data science tool.

	DataFrames.jl plays a central role in the Julia Data ecosystem, and has tight integrations with a range of different libraries. DataFrames.jl isn't the only tool for working with tabular data in Julia – as noted below, there are some other great libraries for certain use-cases – but it provides great data wrangling functionality through a familiar interface.

	To understand the toolchain in more detail, have a look at the tutorials in this manual. New users can start with the First Steps with DataFrames.jl section.

	You may find the DataFramesMeta.jl package or one of the other convenience packages discussed in the Data manipulation frameworks section of this manual helpful when writing more advanced data transformations, especially if you do not have a significant programming experience. These packages provide convenience syntax similar to dplyr in R.

	If you use metadata when working with DataFrames.jl you might find the TableMetadataTools.jl package useful. This package defines several convenience functions for performing typical metadata operations.

	DataFrames.jl and the Julia Data Ecosystem
	The Julia data ecosystem can be a difficult space for new users to navigate, in part because the Julia ecosystem tends to distribute functionality across different libraries more than some other languages. Because many people coming to DataFrames.jl are just starting to explore the Julia data ecosystem, below is a list of well-supported libraries that provide different data science tools, along with a few notes about what makes each library special, and how well integrated they are with DataFrames.jl.

	Statistics
	StatsKit.jl: A convenience meta-package which loads a set of essential packages for statistics, including those mentioned below in this section and DataFrames.jl itself.
	Statistics: The Julia standard library comes with a wide range of statistics functionality, but to gain access to these functions you must call using Statistics.
	LinearAlgebra: Like Statistics, many linear algebra features (factorizations, inversions, etc.) live in a library you have to load to use.
	SparseArrays are also in the standard library but must be loaded to be used.
	FreqTables.jl: Create frequency tables / cross-tabulations. Tightly integrated with DataFrames.jl.
	HypothesisTests.jl: A range of hypothesis testing tools.
	GLM.jl: Tools for estimating linear and generalized linear models. Tightly integrated with DataFrames.jl.
	StatsModels.jl: For converting heterogeneous DataFrame into homogeneous matrices for use with linear algebra libraries or machine learning applications that don't directly support DataFrames. Will do things like convert categorical variables into indicators/one-hot-encodings, create interaction terms, etc.
	MultivariateStats.jl: linear regression, ridge regression, PCA, component analyses tools. Not well integrated with DataFrames.jl, but easily used in combination with StatsModels.
	Machine Learning
	MLJ.jl: if you're more of an applied user, there is a single package the pulls from all these different libraries and provides a single, scikit-learn inspired API: MLJ.jl. MLJ.jl provides a common interface for a wide range of machine learning algorithms.
	ScikitLearn.jl: A Julia wrapper around the full Python scikit-learn machine learning library. Not well integrated with DataFrames.jl, but can be combined using StatsModels.jl.
	AutoMLPipeline: A package that makes it trivial to create complex ML pipeline structures using simple expressions. It leverages on the built-in macro programming features of Julia to symbolically process, manipulate pipeline expressions, and makes it easy to discover optimal structures for machine learning regression and classification.
	Deep learning: KNet.jl and Flux.jl.
	Plotting
	Plots.jl: Powerful, modern plotting library with a syntax akin to that of matplotlib (in Python) or plot (in R). StatsPlots.jl provides Plots.jl with recipes for many standard statistical plots.
	Gadfly.jl: High-level plotting library with a "grammar of graphics" syntax akin to that of ggplot (in R).
	AlgebraOfGraphics.jl: A "grammar of graphics" library build upon Makie.jl.
	VegaLite.jl: High-level plotting library that uses a different "grammar of graphics" syntax and has an emphasis on interactive graphics.
	Data Wrangling:
	Impute.jl: various methods for handling missing data in vectors, matrices and tables.
	DataFramesMeta.jl: A range of convenience functions for DataFrames.jl that augment select and transform to provide a user experience similar to that provided by dplyr in R.
	DataFrameMacros.jl: Provides macro versions of the common DataFrames.jl functions similar to DataFramesMeta.jl, with convenient syntax for the manipulation of multiple columns at once.
	Query.jl: Query.jl provides a single framework for data wrangling that works with a range of libraries, including DataFrames.jl, other tabular data libraries (more on those below), and even non-tabular data. Provides many convenience functions analogous to those in dplyr in R or LINQ.
	You can find more information on these packages in the Data manipulation frameworks section of this manual.
	And More!
	Graphs.jl: A pure-Julia, high performance network analysis library. Edgelists in DataFrames can be easily converted into graphs using the GraphDataFrameBridge.jl package.
	IO:
	DataFrames.jl work well with a range of formats, including:
	CSV files (using CSV.jl),
	Apache Arrow (using Arrow.jl)
	reading Stata, SAS and SPSS files (using ReadStatTables.jl; alternatively Queryverse users can choose StatFiles.jl),
	Parquet files (using Parquet2.jl),
	reading R data files (.rda, .RData) (using RData.jl).
	While not all of these libraries are tightly integrated with DataFrames.jl, because DataFrames are essentially collections of aligned Julia vectors, so it is easy to (a) pull out a vector for use with a non-DataFrames-integrated library, or (b) convert your table into a homogeneously-typed matrix using the Matrix constructor or StatsModels.jl.

	Other Julia Tabular Libraries
	DataFrames.jl is a great general purpose tool for data manipulation and wrangling, but it's not ideal for all applications. For users with more specialized needs, consider using:

	TypedTables.jl: Type-stable heterogeneous tables. Useful for improved performance when the structure of your table is relatively stable and does not feature thousands of columns.
	JuliaDB.jl: For users working with data that is too large to fit in memory, we suggest JuliaDB.jl, which offers better performance for large datasets, and can handle out-of-core data manipulations (Python users can think of JuliaDB.jl as the Julia version of dask).
	Note that most tabular data libraries in the Julia ecosystem (including DataFrames.jl) support a common interface (defined in the Tables.jl package). As a result, some libraries are capable or working with a range of tabular data structures, making it easy to move between tabular libraries as your needs change. A user of Query.jl, for example, can use the same code to manipulate data in a DataFrame, a Table (defined by TypedTables.jl), or a JuliaDB table.

	Questions?
	If there is something you expect DataFrames to be capable of, but cannot figure out how to do, please reach out with questions in Domains/Data on Discourse. Additionally you might want to listen to an introduction to DataFrames.jl on JuliaAcademy.

	Please report bugs by opening an issue.

	You can follow the source links throughout the documentation to jump right to the source files on GitHub to make pull requests for improving the documentation and function capabilities.

	Please review DataFrames contributing guidelines before submitting your first PR!

	Information on specific versions can be found on the Release page.

	Package Manual
	First Steps with DataFrames.jl
	Setting up the Environment
	Constructors and Basic Utility Functions
	Getting and Setting Data in a Data Frame
	Basic Usage of Transformation Functions
	Getting Started
	Installation
	The DataFrame Type
	Database-Style Joins
	Introduction to joins
	Key value comparisons and floating point values
	Joining on key columns with different names
	Handling of duplicate keys and tracking source data frame
	Renaming joined columns
	Matching missing values in joins
	Specifying row order in the join result
	In-place left join
	The Split-Apply-Combine Strategy
	Design of the split-apply-combine support
	Examples of the split-apply-combine operations
	Using GroupedDataFrame as an iterable and indexable object
	Simulating the SQL where clause
	Column-independent operations
	Column-independent operations versus functions
	Specifying group order in groupby
	Reshaping and Pivoting Data
	Sorting
	Categorical Data
	Missing Data
	Comparisons
	Comparison with the Python package pandas
	Comparison with the R package dplyr
	Comparison with the R package data.table
	Comparison with Stata (version 8 and above)
	Data manipulation frameworks
	DataFramesMeta.jl
	DataFrameMacros.jl
	Query.jl
	API
	Only exported (i.e. available for use without DataFrames. qualifier after loading the DataFrames.jl package with using DataFrames) types and functions are considered a part of the public API of the DataFrames.jl package. In general all such objects are documented in this manual (in case some documentation is missing please kindly report an issue here).

	Note
	Breaking changes to public and documented API are avoided in DataFrames.jl where possible.

	The following changes are not considered breaking:

	specific floating point values computed by operations may change at any time; users should rely only on approximate accuracy;
	in functions that use the default random number generator provided by Base Julia the specific random numbers computed may change across Julia versions;
	if the changed functionality is classified as a bug;
	if the changed behavior was not documented; two major cases are:
	in its implementation some function accepted a wider range of arguments that it was documented to handle - changes in handling of undocumented arguments are not considered as breaking;
	the type of the value returned by a function changes, but it still follows the contract specified in the documentation; for example if a function is documented to return a vector then changing its type from Vector to PooledVector is not considered as breaking;
	error behavior: code that threw an exception can change exception type thrown or stop throwing an exception;
	changes in display (how objects are printed);
	changes to the state of global objects from Base Julia whose state normally is considered volatile (e.g. state of global random number generator).
	All types and functions that are part of public API are guaranteed to go through a deprecation period before a breaking change is made to them or they would be removed.

	The standard practice is that breaking changes are implemented when a major release of DataFrames.jl is made (e.g. functionalities deprecated in a 1.x release would be changed in the 2.0 release).

	In rare cases a breaking change might be introduced in a minor release. In such a case the changed behavior still goes through one minor release during which it is deprecated. The situations where such a breaking change might be allowed are (still such breaking changes will be avoided if possible):

	the affected functionality was previously clearly identified in the documentation as being subject to changes (for example in DataFrames.jl 1.4 release propagation rules of :note-style metadata are documented as such);
	the change is on the border of being classified as a bug (in rare cases even if a behavior of some function was documented its consequences for certain argument combinations could be decided to be unintended and not wanted);
	the change is needed to adjust DataFrames.jl functionality to changes in Base Julia.
	Please be warned that while Julia allows you to access internal functions or types of DataFrames.jl these can change without warning between versions of DataFrames.jl. In particular it is not safe to directly access fields of types that are a part of public API of the DataFrames.jl package using e.g. the getfield function. Whenever some operation on fields of defined types is considered allowed an appropriate exported function should be used instead.