Skip to content

Instantly share code, notes, and snippets.

Last active March 14, 2024 09:30
Show Gist options
  • Save svilupp/8f7d364e37650ba7520f9b4783482cb2 to your computer and use it in GitHub Desktop.
Save svilupp/8f7d364e37650ba7520f9b4783482cb2 to your computer and use it in GitHub Desktop.
Building a RAG Chatbot over DataFrames.jl Documentation
This section compares DataFrames.jl with other data manipulation frameworks in Python, R, and Stata.
A sample data set can be created using the following code:
using DataFrames
using Statistics
df = DataFrame(grp=repeat(1:2, 3), x=6:-1:1, y=4:9, z=[3:7; missing], id='a':'f')
df2 = DataFrame(grp=[1, 3], w=[10, 11])
Some of the operations mutate the tables so every operation assumes that it is done on the original data frame.
Note that in the comparisons presented below predicates like x -> x >= 1 can be more compactly written as =>(1). The latter form has an additional benefit that it is compiled only once per Julia session (as opposed to x -> x >= 1 which defines a new anonymous function every time it is introduced).
Comparison with the Python package pandas
The following table compares the main functions of DataFrames.jl with the Python package pandas (version 1.1.0):
import pandas as pd
import numpy as np
df = pd.DataFrame({'grp': [1, 2, 1, 2, 1, 2],
'x': range(6, 0, -1),
'y': range(4, 10),
'z': [3, 4, 5, 6, 7, None]},
index = list('abcdef'))
df2 = pd.DataFrame({'grp': [1, 3], 'w': [10, 11]})
Because pandas supports multi-index, this example data frame is set up with a to f as row indices rather than a separate id column.
Accessing data
Operation pandas DataFrames.jl
Cell indexing by location df.iloc[1, 1] df[2, 2]
Row slicing by location df.iloc[1:3] df[2:3, :]
Column slicing by location df.iloc[:, 1:] df[:, 2:end]
Row indexing by label df.loc['c'] df[findfirst(==('c'),, :]
Column indexing by label df.loc[:, 'x'] df[:, :x]
Column slicing by label df.loc[:, ['x', 'z']] df[:, [:x, :z]]
df.loc[:, 'x':'z'] df[:, Between(:x, :z)]
Mixed indexing df.loc['c'][1] df[findfirst(==('c'),, 2]
Note that Julia uses 1-based indexing, inclusive on both ends. A special keyword end can be used to indicate the last index. Likewise, the begin keyword can be used to indicate the first index.
In addition, when indexing a data frame with the findfirst function, a single DataFrameRow object is returned. In the case that id is not unique, you can use the findall function or boolean indexing instead. It would then return a DataFrame object containing all matched rows. The following two lines of code are functionally equivalent:
df[findall(==('c'),, :]
df[ .== 'c', :]
DataFrames.jl's indexing always produces a consistent and predictable return type. By contrast, pandas' loc function returns a Series object when there is exactly one 'c' value in the index, and it returns a DataFrame object when there are multiple rows having the index value of 'c'.
Common operations
Operation pandas DataFrames.jl
Reduce multiple values df['z'].mean(skipna = False) mean(df.z)
df['z'].mean() mean(skipmissing(df.z))
df[['z']].agg(['mean']) combine(df, :z => mean ∘ skipmissing)
Add new columns df.assign(z1 = df['z'] + 1) transform(df, :z => (v -> v .+ 1) => :z1)
Rename columns df.rename(columns = {'x': 'x_new'}) rename(df, :x => :x_new)
Pick & transform columns df.assign(x_mean = df['x'].mean())[['x_mean', 'y']] select(df, :x => mean, :y)
Sort rows df.sort_values(by = 'x') sort(df, :x)
df.sort_values(by = ['grp', 'x'], ascending = [True, False]) sort(df, [:grp, order(:x, rev = true)])
Drop missing rows df.dropna() dropmissing(df)
Select unique rows df.drop_duplicates() unique(df)
Note that pandas skips NaN values in its analytic functions by default. By contrast, Julia functions do not skip NaN's. If necessary, you can filter out the NaN's before processing, for example, mean(Iterators.filter(!isnan, x)).
Pandas uses NaN for representing both missing data and the floating point "not a number" value. Julia defines a special value missing for representing missing data. DataFrames.jl respects general rules in Julia in propagating missing values by default. If necessary, the skipmissing function can be used to remove missing data. See the Missing Data section for more information.
In addition, pandas keeps the original column name after applying a function. DataFrames.jl appends a suffix to the column name by default. To keep it simple, the examples above do not synchronize the column names between pandas and DataFrames.jl (you can pass renamecols=false keyword argument to select, transform and combine functions to retain old column names).
Mutating operations
Operation pandas DataFrames.jl
Add new columns df['z1'] = df['z'] + 1 df.z1 = df.z .+ 1
transform!(df, :z => (x -> x .+ 1) => :z1)
df.insert(1, 'const', 10) insertcols!(df, 2, :const => 10)
Rename columns df.rename(columns = {'x': 'x_new'}, inplace = True) rename!(df, :x => :x_new)
Sort rows df.sort_values(by = 'x', inplace = True) sort!(df, :x)
Drop missing rows df.dropna(inplace = True) dropmissing!(df)
Select unique rows df.drop_duplicates(inplace = True) unique!(df)
Generally speaking, DataFrames.jl follows the Julia convention of using ! in the function name to indicate mutation behavior.
Grouping data and aggregation
DataFrames.jl provides a groupby function to apply operations over each group independently. The result of groupby is a GroupedDataFrame object which may be processed using the combine, transform, or select functions. The following table illustrates some common grouping and aggregation usages.
Operation pandas DataFrames.jl
Aggregate by groups df.groupby('grp')['x'].mean() combine(groupby(df, :grp), :x => mean)
Rename column after aggregation df.groupby('grp')['x'].mean().rename("my_mean") combine(groupby(df, :grp), :x => mean => :my_mean)
Add aggregated data as column df.join(df.groupby('grp')['x'].mean(), on='grp', rsuffix='_mean') transform(groupby(df, :grp), :x => mean)
...and select output columns df.join(df.groupby('grp')['x'].mean(), on='grp', rsuffix='_mean')[['grp', 'x_mean']] select(groupby(df, :grp), :id, :x => mean)
Note that pandas returns a Series object for 1-dimensional result unless reset_index is called afterwards. The corresponding DataFrames.jl examples return an equivalent DataFrame object. Consider the first example:
>>> df.groupby('grp')['x'].mean()
1 4
2 3
Name: x, dtype: int64
For DataFrames.jl, it looks like this:
julia> combine(groupby(df, :grp), :x => mean)
2×2 DataFrame
Row │ grp x_mean
│ Int64 Float64
1 │ 1 4.0
2 │ 2 3.0
In DataFrames.jl, the GroupedDataFrame object supports an efficient key lookup. Hence, it performs well when you need to perform lookups repeatedly.
More advanced commands
This section includes more complex examples.
Operation pandas DataFrames.jl
Complex Function df[['z']].agg(lambda v: np.mean(np.cos(v))) combine(df, :z => v -> mean(cos, skipmissing(v)))
Aggregate multiple columns df.agg({'x': max, 'y': min}) combine(df, :x => maximum, :y => minimum)
df[['x', 'y']].mean() combine(df, [:x, :y] .=> mean)
df.filter(regex=("^x")).mean() combine(df, names(df, r"^x") .=> mean)
Apply function over multiple variables df.assign(x_y_cor = np.corrcoef(df.x, df.y)[0, 1]) transform(df, [:x, :y] => cor)
Row-wise operation df.assign(x_y_min = df.apply(lambda v: min(v.x, v.y), axis=1)) transform(df, [:x, :y] => ByRow(min))
df.assign(x_y_argmax = df.apply(lambda v: df.columns[v.argmax()], axis=1)) transform(df, AsTable([:x, :y]) => ByRow(argmax))
DataFrame as input df.groupby('grp').head(2) combine(d -> first(d, 2), groupby(df, :grp))
DataFrame as output df[['x']].agg(lambda x: [min(x), max(x)]) combine(df, :x => (x -> (x=[minimum(x), maximum(x)],)) => AsTable)
Note that pandas preserves the same row order after groupby whereas DataFrames.jl shows them grouped by the provided keys after the combine operation, but select and transform retain an original row ordering.
Joining data frames
DataFrames.jl supports join operations similar to a relational database.
Operation pandas DataFrames.jl
Inner join pd.merge(df, df2, how = 'inner', on = 'grp') innerjoin(df, df2, on = :grp)
Outer join pd.merge(df, df2, how = 'outer', on = 'grp') outerjoin(df, df2, on = :grp)
Left join pd.merge(df, df2, how = 'left', on = 'grp') leftjoin(df, df2, on = :grp)
Right join pd.merge(df, df2, how = 'right', on = 'grp') rightjoin(df, df2, on = :grp)
Semi join (filtering) df[df.grp.isin(df2.grp)] semijoin(df, df2, on = :grp)
Anti join (filtering) df[~df.grp.isin(df2.grp)] antijoin(df, df2, on = :grp)
For multi-column joins, both pandas and DataFrames.jl accept an array for the on keyword argument.
In the cases of semi joins and anti joins, the isin function in pandas can still be used as long as the join keys are combined in a tuple. In DataFrames.jl, it just works normally with an array of join keys specified in the on keyword argument.
Comparison with the R package dplyr
The following table compares the main functions of DataFrames.jl with the R package dplyr (version 1):
df <- tibble(grp = rep(1:2, 3), x = 6:1, y = 4:9,
z = c(3:7, NA), id = letters[1:6])
Operation dplyr DataFrames.jl
Reduce multiple values summarize(df, mean(x)) combine(df, :x => mean)
Add new columns mutate(df, x_mean = mean(x)) transform(df, :x => mean => :x_mean)
Rename columns rename(df, x_new = x) rename(df, :x => :x_new)
Pick columns select(df, x, y) select(df, :x, :y)
Pick & transform columns transmute(df, mean(x), y) select(df, :x => mean, :y)
Pick rows filter(df, x >= 1) subset(df, :x => ByRow(x -> x >= 1))
Sort rows arrange(df, x) sort(df, :x)
As in dplyr, some of these functions can be applied to grouped data frames, in which case they operate by group:
Operation dplyr DataFrames.jl
Reduce multiple values summarize(group_by(df, grp), mean(x)) combine(groupby(df, :grp), :x => mean)
Add new columns mutate(group_by(df, grp), mean(x)) transform(groupby(df, :grp), :x => mean)
Pick & transform columns transmute(group_by(df, grp), mean(x), y) select(groupby(df, :grp), :x => mean, :y)
The table below compares more advanced commands:
Operation dplyr DataFrames.jl
Complex Function summarize(df, mean(x, na.rm = T)) combine(df, :x => x -> mean(skipmissing(x)))
Transform several columns summarize(df, max(x), min(y)) combine(df, :x => maximum, :y => minimum)
summarize(df, across(c(x, y), mean)) combine(df, [:x, :y] .=> mean)
summarize(df, across(starts_with("x"), mean)) combine(df, names(df, r"^x") .=> mean)
summarize(df, across(c(x, y), list(max, min))) combine(df, ([:x, :y] .=> [maximum minimum])...)
Multivariate function mutate(df, cor(x, y)) transform(df, [:x, :y] => cor)
Row-wise mutate(rowwise(df), min(x, y)) transform(df, [:x, :y] => ByRow(min))
mutate(rowwise(df), which.max(c_across(matches("^x")))) transform(df, AsTable(r"^x") => ByRow(argmax))
DataFrame as input summarize(df, head(across(), 2)) combine(d -> first(d, 2), df)
DataFrame as output summarize(df, tibble(value = c(min(x), max(x)))) combine(df, :x => (x -> (value = [minimum(x), maximum(x)],)) => AsTable)
Comparison with the R package data.table
The following table compares the main functions of DataFrames.jl with the R package data.table (version 1.14.1).
df <- data.table(grp = rep(1:2, 3), x = 6:1, y = 4:9,
z = c(3:7, NA), id = letters[1:6])
df2 <- data.table(grp=c(1,3), w = c(10,11))
Operation data.table DataFrames.jl
Reduce multiple values df[, .(mean(x))] combine(df, :x => mean)
Add new columns df[, x_mean:=mean(x) ] transform!(df, :x => mean => :x_mean)
Rename column (in place) setnames(df, "x", "x_new") rename!(df, :x => :x_new)
Rename multiple columns (in place) setnames(df, c("x", "y"), c("x_new", "y_new")) rename!(df, [:x, :y] .=> [:x_new, :y_new])
Pick columns as dataframe df[, .(x, y)] select(df, :x, :y)
Pick column as a vector df[, x] df[!, :x]
Remove columns df[, -"x"] select(df, Not(:x))
Remove columns (in place) df[, x:=NULL] select!(df, Not(:x))
Remove columns (in place) df[, c("x", "y"):=NULL] select!(df, Not([:x, :y]))
Pick & transform columns df[, .(mean(x), y)] select(df, :x => mean, :y)
Pick rows df[ x >= 1 ] filter(:x => >=(1), df)
Sort rows (in place) setorder(df, x) sort!(df, :x)
Sort rows df[ order(x) ] sort(df, :x)
Grouping data and aggregation
Operation data.table DataFrames.jl
Reduce multiple values df[, mean(x), by=id ] combine(groupby(df, :id), :x => mean)
Add new columns (in place) df[, x_mean:=mean(x), by=id] transform!(groupby(df, :id), :x => mean)
Pick & transform columns df[, .(x_mean = mean(x), y), by=id] select(groupby(df, :id), :x => mean, :y)
More advanced commands
Operation data.table DataFrames.jl
Complex Function df[, .(mean(x, na.rm=TRUE)) ] combine(df, :x => x -> mean(skipmissing(x)))
Transform certain rows (in place) df[x<=0, x:=0] df.x[df.x .<= 0] .= 0
Transform several columns df[, .(max(x), min(y)) ] combine(df, :x => maximum, :y => minimum)
df[, lapply(.SD, mean), .SDcols = c("x", "y") ] combine(df, [:x, :y] .=> mean)
df[, lapply(.SD, mean), .SDcols = patterns("*x") ] combine(df, names(df, r"^x") .=> mean)
df[, unlist(lapply(.SD, function(x) c(max=max(x), min=min(x)))), .SDcols = c("x", "y") ] combine(df, ([:x, :y] .=> [maximum minimum])...)
Multivariate function df[, .(cor(x,y)) ] transform(df, [:x, :y] => cor)
Row-wise df[, min_xy := min(x, y), by = 1:nrow(df)] transform!(df, [:x, :y] => ByRow(min))
df[, argmax_xy := which.max(.SD) , .SDcols = patterns("*x"), by = 1:nrow(df) ] transform!(df, AsTable(r"^x") => ByRow(argmax))
DataFrame as output df[, .SD[1], by=grp] combine(groupby(df, :grp), first)
DataFrame as output df[, .SD[which.max(x)], by=grp] combine(groupby(df, :grp), sdf -> sdf[argmax(sdf.x), :])
Joining data frames
Operation data.table DataFrames.jl
Inner join merge(df, df2, on = "grp") innerjoin(df, df2, on = :grp)
Outer join merge(df, df2, all = TRUE, on = "grp") outerjoin(df, df2, on = :grp)
Left join merge(df, df2, all.x = TRUE, on = "grp") leftjoin(df, df2, on = :grp)
Right join merge(df, df2, all.y = TRUE, on = "grp") rightjoin(df, df2, on = :grp)
Anti join (filtering) df[!df2, on = "grp" ] antijoin(df, df2, on = :grp)
Semi join (filtering) merge(df1, df2[, .(grp)]) semijoin(df, df2, on = :grp)
Comparison with Stata (version 8 and above)
The following table compares the main functions of DataFrames.jl with Stata:
Operation Stata DataFrames.jl
Reduce multiple values collapse (mean) x combine(df, :x => mean)
Add new columns egen x_mean = mean(x) transform!(df, :x => mean => :x_mean)
Rename columns rename x x_new rename!(df, :x => :x_new)
Pick columns keep x y select!(df, :x, :y)
Pick rows keep if x >= 1 subset!(df, :x => ByRow(x -> x >= 1))
Sort rows sort x sort!(df, :x)
Note that the suffix ! (i.e. transform!, select!, etc) ensures that the operation transforms the dataframe in place, as in Stata
Some of these functions can be applied to grouped data frames, in which case they operate by group:
Operation Stata DataFrames.jl
Add new columns egen x_mean = mean(x), by(grp) transform!(groupby(df, :grp), :x => mean)
Reduce multiple values collapse (mean) x, by(grp) combine(groupby(df, :grp), :x => mean)
The table below compares more advanced commands:
Operation Stata DataFrames.jl
Transform certain rows replace x = 0 if x <= 0 transform(df, :x => (x -> ifelse.(x .<= 0, 0, x)) => :x)
Transform several columns collapse (max) x (min) y combine(df, :x => maximum, :y => minimum)
collapse (mean) x y combine(df, [:x, :y] .=> mean)
collapse (mean) x* combine(df, names(df, r"^x") .=> mean)
collapse (max) x y (min) x y combine(df, ([:x, :y] .=> [maximum minimum])...)
Multivariate function egen z = corr(x y) transform!(df, [:x, :y] => cor => :z)
Row-wise egen z = rowmin(x y) transform!(df, [:x, :y] => ByRow(min) => :z)
Database-Style Joins
Introduction to joins
We often need to combine two or more data sets together to provide a complete picture of the topic we are studying. For example, suppose that we have the following two data sets:
julia> using DataFrames
julia> people = DataFrame(ID=[20, 40], Name=["John Doe", "Jane Doe"])
2×2 DataFrame
Row │ ID Name
│ Int64 String
1 │ 20 John Doe
2 │ 40 Jane Doe
julia> jobs = DataFrame(ID=[20, 40], Job=["Lawyer", "Doctor"])
2×2 DataFrame
Row │ ID Job
│ Int64 String
1 │ 20 Lawyer
2 │ 40 Doctor
We might want to work with a larger data set that contains both the names and jobs for each ID. We can do this using the innerjoin function:
julia> innerjoin(people, jobs, on = :ID)
2×3 DataFrame
Row │ ID Name Job
│ Int64 String String
1 │ 20 John Doe Lawyer
2 │ 40 Jane Doe Doctor
In relational database theory, this operation is generally referred to as a join. The columns used to determine which rows should be combined during a join are called keys.
The following functions are provided to perform seven kinds of joins:
innerjoin: the output contains rows for values of the key that exist in all passed data frames.
leftjoin: the output contains rows for values of the key that exist in the first (left) argument, whether or not that value exists in the second (right) argument.
rightjoin: the output contains rows for values of the key that exist in the second (right) argument, whether or not that value exists in the first (left) argument.
outerjoin: the output contains rows for values of the key that exist in any of the passed data frames.
semijoin: Like an inner join, but output is restricted to columns from the first (left) argument.
antijoin: The output contains rows for values of the key that exist in the first (left) but not the second (right) argument. As with semijoin, output is restricted to columns from the first (left) argument.
crossjoin: The output is the cartesian product of rows from all passed data frames.
See the Wikipedia page on SQL joins for more information.
Here are examples of different kinds of join:
julia> jobs = DataFrame(ID=[20, 60], Job=["Lawyer", "Astronaut"])
2×2 DataFrame
Row │ ID Job
│ Int64 String
1 │ 20 Lawyer
2 │ 60 Astronaut
julia> innerjoin(people, jobs, on = :ID)
1×3 DataFrame
Row │ ID Name Job
│ Int64 String String
1 │ 20 John Doe Lawyer
julia> leftjoin(people, jobs, on = :ID)
2×3 DataFrame
Row │ ID Name Job
│ Int64 String String?
1 │ 20 John Doe Lawyer
2 │ 40 Jane Doe missing
julia> rightjoin(people, jobs, on = :ID)
2×3 DataFrame
Row │ ID Name Job
│ Int64 String? String
1 │ 20 John Doe Lawyer
2 │ 60 missing Astronaut
julia> outerjoin(people, jobs, on = :ID)
3×3 DataFrame
Row │ ID Name Job
│ Int64 String? String?
1 │ 20 John Doe Lawyer
2 │ 40 Jane Doe missing
3 │ 60 missing Astronaut
julia> semijoin(people, jobs, on = :ID)
1×2 DataFrame
Row │ ID Name
│ Int64 String
1 │ 20 John Doe
julia> antijoin(people, jobs, on = :ID)
1×2 DataFrame
Row │ ID Name
│ Int64 String
1 │ 40 Jane Doe
Cross joins are the only kind of join that does not use a on key:
julia> crossjoin(people, jobs, makeunique = true)
4×4 DataFrame
Row │ ID Name ID_1 Job
│ Int64 String Int64 String
1 │ 20 John Doe 20 Lawyer
2 │ 20 John Doe 60 Astronaut
3 │ 40 Jane Doe 20 Lawyer
4 │ 40 Jane Doe 60 Astronaut
Key value comparisons and floating point values
Key values from the two or more data frames are compared using the isequal function. This is consistent with the Set and Dict types in Julia Base.
It is not recommended to use floating point numbers as keys: floating point comparisons can be surprising and unpredictable. If you do use floating point keys, note that by default an error is raised when keys include -0.0 (negative zero) or NaN values. Here is an example:
julia> innerjoin(DataFrame(id=[-0.0]), DataFrame(id=[0.0]), on=:id)
ERROR: ArgumentError: Currently for numeric values `NaN` and `-0.0` in their real or imaginary components are not allowed. Such value was found in column :id in left data frame. Use CategoricalArrays.jl to wrap these values in a CategoricalVector to perform the requested join.
This can be overridden by wrapping the key values in a categorical vector.
Joining on key columns with different names
In order to join data frames on keys which have different names in the left and right tables, you may pass left => right pairs as on argument:
julia> a = DataFrame(ID=[20, 40], Name=["John Doe", "Jane Doe"])
2×2 DataFrame
Row │ ID Name
│ Int64 String
1 │ 20 John Doe
2 │ 40 Jane Doe
julia> b = DataFrame(IDNew=[20, 40], Job=["Lawyer", "Doctor"])
2×2 DataFrame
Row │ IDNew Job
│ Int64 String
1 │ 20 Lawyer
2 │ 40 Doctor
julia> innerjoin(a, b, on = :ID => :IDNew)
2×3 DataFrame
Row │ ID Name Job
│ Int64 String String
1 │ 20 John Doe Lawyer
2 │ 40 Jane Doe Doctor
Here is another example with multiple columns:
julia> a = DataFrame(City=["Amsterdam", "London", "London", "New York", "New York"],
Job=["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
Category=[1, 2, 3, 4, 5])
5×3 DataFrame
Row │ City Job Category
│ String String Int64
1 │ Amsterdam Lawyer 1
2 │ London Lawyer 2
3 │ London Lawyer 3
4 │ New York Doctor 4
5 │ New York Doctor 5
julia> b = DataFrame(Location=["Amsterdam", "London", "London", "New York", "New York"],
Work=["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
Name=["a", "b", "c", "d", "e"])
5×3 DataFrame
Row │ Location Work Name
│ String String String
1 │ Amsterdam Lawyer a
2 │ London Lawyer b
3 │ London Lawyer c
4 │ New York Doctor d
5 │ New York Doctor e
julia> innerjoin(a, b, on = [:City => :Location, :Job => :Work])
9×4 DataFrame
Row │ City Job Category Name
│ String String Int64 String
1 │ Amsterdam Lawyer 1 a
2 │ London Lawyer 2 b
3 │ London Lawyer 3 b
4 │ London Lawyer 2 c
5 │ London Lawyer 3 c
6 │ New York Doctor 4 d
7 │ New York Doctor 5 d
8 │ New York Doctor 4 e
9 │ New York Doctor 5 e
Handling of duplicate keys and tracking source data frame
Additionally, notice that in the last join rows 2 and 3 had the same values on on variables in both joined DataFrames. In such a situation innerjoin, outerjoin, leftjoin and rightjoin will produce all combinations of matching rows. In our example rows from 2 to 5 were created as a result. The same behavior can be observed for rows 4 and 5 in both joined DataFrames.
In order to check that columns passed as the on argument define unique keys (according to isequal) in each input data frame you can set the validate keyword argument to a two-element tuple or a pair of Bool values, with each element indicating whether to run check for the corresponding data frame. Here is an example for the join operation described above:
julia> innerjoin(a, b, on = [(:City => :Location), (:Job => :Work)], validate=(true, true))
ERROR: ArgumentError: Merge key(s) are not unique in both df1 and df2. df1 contains 2 duplicate keys: (City = "London", Job = "Lawyer") and (City = "New York", Job = "Doctor"). df2 contains 2 duplicate keys: (Location = "London", Work = "Lawyer") and (Location = "New York", Work = "Doctor").
Finally, using the source keyword argument you can add a column to the resulting data frame indicating whether the given row appeared only in the left, the right or both data frames. Here is an example:
julia> a = DataFrame(ID=[20, 40], Name=["John", "Jane"])
2×2 DataFrame
Row │ ID Name
│ Int64 String
1 │ 20 John
2 │ 40 Jane
julia> b = DataFrame(ID=[20, 60], Job=["Lawyer", "Doctor"])
2×2 DataFrame
Row │ ID Job
│ Int64 String
1 │ 20 Lawyer
2 │ 60 Doctor
julia> outerjoin(a, b, on=:ID, validate=(true, true), source=:source)
3×4 DataFrame
Row │ ID Name Job source
│ Int64 String? String? String
1 │ 20 John Lawyer both
2 │ 40 Jane missing left_only
3 │ 60 missing Doctor right_only
Note that this time we also used the validate keyword argument and it did not produce errors as the keys defined in both source data frames were unique.
Renaming joined columns
Often you want to keep track of the source data frame. This feature is supported with the renamecols keyword argument:
julia> innerjoin(a, b, on=:ID, renamecols = "_left" => "_right")
1×3 DataFrame
Row │ ID Name_left Job_right
│ Int64 String String
1 │ 20 John Lawyer
In the above example we added the "_left" suffix to the non-key columns from the left table and the "_right" suffix to the non-key columns from the right table.
Alternatively it is allowed to pass a function transforming column names:
julia> innerjoin(a, b, on=:ID, renamecols = lowercase => uppercase)
1×3 DataFrame
Row │ ID name JOB
│ Int64 String String
1 │ 20 John Lawyer
Matching missing values in joins
By default when you try to to perform a join on a key that has missing values you get an error:
julia> df1 = DataFrame(id=[1, missing, 3], a=1:3)
3×2 DataFrame
Row │ id a
│ Int64? Int64
1 │ 1 1
2 │ missing 2
3 │ 3 3
julia> df2 = DataFrame(id=[1, 2, missing], b=1:3)
3×2 DataFrame
Row │ id b
│ Int64? Int64
1 │ 1 1
2 │ 2 2
3 │ missing 3
julia> innerjoin(df1, df2, on=:id)
ERROR: ArgumentError: Missing values in key columns are not allowed when matchmissing == :error. `missing` found in column :id in left data frame.
If you would prefer missing values to be treated as equal pass the matchmissing=:equal keyword argument:
julia> innerjoin(df1, df2, on=:id, matchmissing=:equal)
2×3 DataFrame
Row │ id a b
│ Int64? Int64 Int64
1 │ 1 1 1
2 │ missing 2 3
Alternatively you might want to drop all rows with missing values. In this case pass matchmissing=:notequal:
julia> innerjoin(df1, df2, on=:id, matchmissing=:notequal)
1×3 DataFrame
Row │ id a b
│ Int64? Int64 Int64
1 │ 1 1 1
Specifying row order in the join result
By default the order of rows produced by the join operation is undefined:
julia> df_left = DataFrame(id=[1, 2, 4, 5], left=1:4)
4×2 DataFrame
Row │ id left
│ Int64 Int64
1 │ 1 1
2 │ 2 2
3 │ 4 3
4 │ 5 4
julia> df_right = DataFrame(id=[2, 1, 3, 6, 7], right=1:5)
5×2 DataFrame
Row │ id right
│ Int64 Int64
1 │ 2 1
2 │ 1 2
3 │ 3 3
4 │ 6 4
5 │ 7 5
julia> outerjoin(df_left, df_right, on=:id)
7×3 DataFrame
Row │ id left right
│ Int64 Int64? Int64?
1 │ 2 2 1
2 │ 1 1 2
3 │ 4 3 missing
4 │ 5 4 missing
5 │ 3 missing 3
6 │ 6 missing 4
7 │ 7 missing 5
If you would like the result to keep the row order of the left table pass the order=:left keyword argument:
julia> outerjoin(df_left, df_right, on=:id, order=:left)
7×3 DataFrame
Row │ id left right
│ Int64 Int64? Int64?
1 │ 1 1 2
2 │ 2 2 1
3 │ 4 3 missing
4 │ 5 4 missing
5 │ 3 missing 3
6 │ 6 missing 4
7 │ 7 missing 5
Note that in this case keys missing from the left table are put after the keys present in it.
Similarly order=:right keeps the order of the right table (and puts keys not present in it at the end):
julia> outerjoin(df_left, df_right, on=:id, order=:right)
7×3 DataFrame
Row │ id left right
│ Int64 Int64? Int64?
1 │ 2 2 1
2 │ 1 1 2
3 │ 3 missing 3
4 │ 6 missing 4
5 │ 7 missing 5
6 │ 4 3 missing
7 │ 5 4 missing
In-place left join
A common operation is adding data from a reference table to some main table. It is possible to perform such an in-place update using the leftjoin! function. In this case the left table is updated in place with matching rows from the right table.
julia> main = DataFrame(id=1:4, main=1:4)
4×2 DataFrame
Row │ id main
│ Int64 Int64
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 4 4
julia> leftjoin!(main, DataFrame(id=[2, 4], info=["a", "b"]), on=:id);
julia> main
4×3 DataFrame
Row │ id main info
│ Int64 Int64 String?
1 │ 1 1 missing
2 │ 2 2 a
3 │ 3 3 missing
4 │ 4 4 b
Note that in this case the order and number of rows in the left table is not changed. Therefore, in particular, it is not allowed to have duplicate keys in the right table:
julia> leftjoin!(main, DataFrame(id=[2, 2], info_bad=["a", "b"]), on=:id)
ERROR: ArgumentError: duplicate rows found in right table
# # Title: End-to-end RAG example - low-level
# This script runs an end-to-end RAG applications over text documents in `documents/` folder
# Package Imports
# using OpenAI # for interacting with the API, import if you want to skip the first part of tutorial
using JSON3
using HTTP
using LinearAlgebra
# ! Config
const MODEL_CHAT = "gpt-3.5-turbo" # default model for chat, fast and "cheap"
const MODEL_EMBEDDING = "text-embedding-ada-002" # default model for embedding test
const API_KEY = ENV["OPENAI_API_KEY"]; # do not commit, share or sync anywhere!
# Define shared abstract type for custom printing
abstract type AbstractBuildingBlock end
# Convenience for building "messages" for the chatbot faster (see the standard format in the API calls section...)
@kwdef struct PromptTemplate <: AbstractBuildingBlock
system_prompt::Union{String, Nothing} = nothing
user_prompt::String = ""
# Return type for the AI model
@kwdef struct AIMessage <: AbstractBuildingBlock
status::Union{Int, Nothing} = nothing
tokens::Tuple{Int, Int} = (-1, -1)
elapsed::Float64 = -1.0
# Stores document chunks and their embeddings
@kwdef struct ChunkIndex{T <: AbstractString} <: AbstractBuildingBlock
id::Symbol = gensym("ChunkIndex")
# Structured show method for easier reading (each kwarg on a new line)
function, t::AbstractBuildingBlock)
dump(IOContext(io, :limit => true), t, maxdepth = 1)
# Utitity to be able to combine indices from different sources/documents easily
function Base.vcat(i1::ChunkIndex{T}, i2::ChunkIndex{T}) where {T <: AbstractString}
embeddings = hcat(i1.embeddings, i2.embeddings),
chunks = vcat(i1.chunks, i2.chunks),
sources = vcat(i1.sources, i2.sources))
## Useful functions for API calls
# defined in OpenAI.jl (more sophisticated version of it...)
function create_chat(api_key::String,
messages::Vector{Dict{String, String}};
http_kwargs::NamedTuple = NamedTuple(), api_kwargs::NamedTuple = NamedTuple())
url = ""
headers = Dict("Authorization" => "Bearer $api_key",
"Content-Type" => "application/json")
# JSON-encoded data (string)
body = (; messages, model, api_kwargs...) |> JSON3.write
resp = HTTP.request("POST", url; body, headers, http_kwargs...)
return (;
response =,
status = resp.status)
"Builds a history of the conversation (=messages) to provide the prompt to the API. All kwargs are passed as replacements such that `{{key}} => value` in the template.}}"
function render(prompt::PromptTemplate; kwargs...)
conversation = Dict{String, String}[]
!isnothing(prompt.system_prompt) &&
push!(conversation, Dict("role" => "system", "content" => prompt.system_prompt))
# Replace any handlebar-style placeholders like `{{key}}` in the user_prompt with user-provided kwargs
user_prompt = replace(prompt.user_prompt, ["{{$(k)}}" => v for (k, v) in kwargs]...)
push!(conversation, Dict("role" => "user", "content" => user_prompt))
return conversation
"Builds the request to the API and waits for the response."
function aigenerate(template::PromptTemplate;
api_key::String = API_KEY,
model::String = MODEL_CHAT,
# Let's use smart defaults because OpenAI is a bit fiddly...
http_kwargs::NamedTuple = (;
retry_non_idempotent = true,
retries = 10,
readtimeout = 30), api_kwargs::NamedTuple = NamedTuple(),
conversation = render(template; kwargs...)
time = @elapsed r = create_chat(api_key,
return AIMessage(; content = r.response[:choices][begin][:message][:content] |> strip,
status = Int(r.status),
tokens = (r.response[:usage][:prompt_tokens],
elapsed = time)
"Creates embeddings for `docs` (string or array of strings) and returns a normalized matrix (column-wise)"
function aiembed(docs::Union{AbstractString, Vector{<:AbstractString}},
postprocess::F = normalize;
api_key::String = API_KEY,
model::String = MODEL_EMBEDDING,
http_kwargs::NamedTuple = NamedTuple(), api_kwargs::NamedTuple = NamedTuple(),
kwargs...) where {F <: Function}
r = create_embeddings(api_key, docs, model; http_kwargs, api_kwargs...)
return mapreduce(x -> postprocess(x[:embedding]), hcat,
"RAG wrapper that answers the given question and inject the context if needed from `index`"
function airag(index::ChunkIndex, rag_template::PromptTemplate;
question::AbstractString, top_k::Int = 3, kwargs...)
question_emb = aiembed(question)
idxs = find_closest(index, question_emb; top_k)
close_chunks = [join(index.chunks[max(begin, i - 2):min(end, i + 2)], "\n")
for i in idxs]
return aigenerate(rag_template;
context = join(close_chunks, "\n\n"),
## Useful functions for document chunk retrieval
"Splits `doc` into text chunks of size at most `max_size` (in characters), ie, it accumulates smaller chunks to match the desired size"
function build_chunks(doc::AbstractString;
max_size::Int = 128,
split_pattern::Union{String, Regex} = r"\n|\. ",
join_key::String = "\n")
## shortcut if doc is too short
length(doc) < max_size && return [doc]
## proceed
texts = split(doc, split_pattern)
doc_chunks = Vector{eltype(texts)}()
start, counter = 1, 0
# accumulate chunks until we reach the max size
for i in eachindex(texts)
l = length(texts[i])
# if it doesn't fit, we push all preceeding docs, reset the counter and start a new chunk
if l == 0 || (counter + l >= max_size)
push!(doc_chunks, join(texts[start:max(i - 1, 1)], join_key))
start = i # current text becomes the next chunk
counter = 0
counter += l
# last chunk is never pushed in, so we need to do it manually
push!(doc_chunks, join(texts[start:end], join_key))
return doc_chunks
"Finds the indices of chunks (represented by embeddings in `emb`) that are closest (cosine similarity) to query embedding (`query_emb`). Returns only `top_k` closest indices."
function find_closest(emb::AbstractMatrix{<:Real},
top_k::Int = 100)
query_emb' * emb |> vec |> sortperm |> reverse |> x -> first(x, top_k)
function find_closest(index::ChunkIndex,
top_k::Int = 100)
find_closest(index.embeddings, query_emb; top_k)
# # Main
# ## Preparation
# Before you do anything, put together 10 questions that a user might ask and potential answers based on the sources you want to use.
# It helps you think through:
# - what are the sources you want to use? how is the data structured?
# - what are the questions you want to answer? how do you want to answer them?
# Moreover, there cannot be a RAG application without an evaluation set.
# So these questions will allow you to measure how well you're progressing.
# Ideally, copy the source text you used, so you can later on measure if your RAG app used the "right" context as a source for the answer.
# Now, let's go to the documentation of DataFrames.jl and let's download a few pages from there.
# Just copy & paste the text into a text file and save it in `documents/` folder.
# You might want to remove the useless data at the beginning and end to make your life easier - this is a tutorial afterall :)
# ## Load the data
# Example
dir_raw = joinpath("documents")
files = ["comparison_with_python.txt", "database_style_joins.txt", "what_is_dataframes.txt"]
labels = ["DataFrames-Python", "DataFrames-Joins", "DataFrames-WhatIs"]
# Embed documents and build ChunkIndex (bundle of embeddings, chunks and their respective sources)
indices = ChunkIndex[]
for (fn, lbl) in zip(files, labels)
doc_raw = read(joinpath(dir_raw, fn), String)
# split into chunks, if you want to start simple - just do `split(text,"\n\n")`
doc_chunks = build_chunks(doc_raw; max_size = 256, split_pattern = "\n\n") |>
texts -> mapreduce(x -> build_chunks(x;
max_size = 256,
split_pattern = r"\n|\. "),
# Notice that we embed all doc_chunks at once, not one by one
# OpenAI supports embedding multiple documents if they are short enough, ie, it's only because the documentation pages are small
embeddings = aiembed(doc_chunks) .|> Float32
index = ChunkIndex(;
chunks = doc_chunks,
sources = fill(lbl, length(doc_chunks)))
push!(indices, index)
index = reduce(vcat, indices) # combine it across several documents
# ## Ask questions - Step-by-Step
rag_template = PromptTemplate(;
system_prompt = "Act as a world-class AI assistant and an expert in Julia language. Answer the question based only on the provided context. Be brief and concise.",
user_prompt = """
# Context
# Question
# Answer
question = "I like dplyr, what is the equivalent in Julia?"
question_emb = aiembed(question)
# Build the context of similar docs -- take the top 3 closest chunks
idxs = find_closest(index, question_emb; top_k = 3)
# We add 2 chunks before and after each of the closest chunk
close_chunks = [join(index.chunks[max(begin, i - 2):min(end, i + 2)], "\n")
for i in idxs]
answer = aigenerate(rag_template;
context = join(close_chunks, "\n\n"))
# ## Ask questions E2E
question = "I like dplyr, what is the equivalent in Julia?"
answer = airag(index, rag_template; question)
Welcome to the DataFrames.jl documentation!
This resource aims to teach you everything you need to know to get up and running with tabular data manipulation using the DataFrames.jl package.
For more illustrations of DataFrames.jl usage, in particular in conjunction with other packages you can check-out the following resources (they are kept up to date with the released version of DataFrames.jl):
What is DataFrames.jl?
DataFrames.jl provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of pandas (in Python) and data.frame, data.table and dplyr (in R), making it a great general purpose data science tool.
DataFrames.jl plays a central role in the Julia Data ecosystem, and has tight integrations with a range of different libraries. DataFrames.jl isn't the only tool for working with tabular data in Julia – as noted below, there are some other great libraries for certain use-cases – but it provides great data wrangling functionality through a familiar interface.
To understand the toolchain in more detail, have a look at the tutorials in this manual. New users can start with the First Steps with DataFrames.jl section.
You may find the DataFramesMeta.jl package or one of the other convenience packages discussed in the Data manipulation frameworks section of this manual helpful when writing more advanced data transformations, especially if you do not have a significant programming experience. These packages provide convenience syntax similar to dplyr in R.
If you use metadata when working with DataFrames.jl you might find the TableMetadataTools.jl package useful. This package defines several convenience functions for performing typical metadata operations.
DataFrames.jl and the Julia Data Ecosystem
The Julia data ecosystem can be a difficult space for new users to navigate, in part because the Julia ecosystem tends to distribute functionality across different libraries more than some other languages. Because many people coming to DataFrames.jl are just starting to explore the Julia data ecosystem, below is a list of well-supported libraries that provide different data science tools, along with a few notes about what makes each library special, and how well integrated they are with DataFrames.jl.
StatsKit.jl: A convenience meta-package which loads a set of essential packages for statistics, including those mentioned below in this section and DataFrames.jl itself.
Statistics: The Julia standard library comes with a wide range of statistics functionality, but to gain access to these functions you must call using Statistics.
LinearAlgebra: Like Statistics, many linear algebra features (factorizations, inversions, etc.) live in a library you have to load to use.
SparseArrays are also in the standard library but must be loaded to be used.
FreqTables.jl: Create frequency tables / cross-tabulations. Tightly integrated with DataFrames.jl.
HypothesisTests.jl: A range of hypothesis testing tools.
GLM.jl: Tools for estimating linear and generalized linear models. Tightly integrated with DataFrames.jl.
StatsModels.jl: For converting heterogeneous DataFrame into homogeneous matrices for use with linear algebra libraries or machine learning applications that don't directly support DataFrames. Will do things like convert categorical variables into indicators/one-hot-encodings, create interaction terms, etc.
MultivariateStats.jl: linear regression, ridge regression, PCA, component analyses tools. Not well integrated with DataFrames.jl, but easily used in combination with StatsModels.
Machine Learning
MLJ.jl: if you're more of an applied user, there is a single package the pulls from all these different libraries and provides a single, scikit-learn inspired API: MLJ.jl. MLJ.jl provides a common interface for a wide range of machine learning algorithms.
ScikitLearn.jl: A Julia wrapper around the full Python scikit-learn machine learning library. Not well integrated with DataFrames.jl, but can be combined using StatsModels.jl.
AutoMLPipeline: A package that makes it trivial to create complex ML pipeline structures using simple expressions. It leverages on the built-in macro programming features of Julia to symbolically process, manipulate pipeline expressions, and makes it easy to discover optimal structures for machine learning regression and classification.
Deep learning: KNet.jl and Flux.jl.
Plots.jl: Powerful, modern plotting library with a syntax akin to that of matplotlib (in Python) or plot (in R). StatsPlots.jl provides Plots.jl with recipes for many standard statistical plots.
Gadfly.jl: High-level plotting library with a "grammar of graphics" syntax akin to that of ggplot (in R).
AlgebraOfGraphics.jl: A "grammar of graphics" library build upon Makie.jl.
VegaLite.jl: High-level plotting library that uses a different "grammar of graphics" syntax and has an emphasis on interactive graphics.
Data Wrangling:
Impute.jl: various methods for handling missing data in vectors, matrices and tables.
DataFramesMeta.jl: A range of convenience functions for DataFrames.jl that augment select and transform to provide a user experience similar to that provided by dplyr in R.
DataFrameMacros.jl: Provides macro versions of the common DataFrames.jl functions similar to DataFramesMeta.jl, with convenient syntax for the manipulation of multiple columns at once.
Query.jl: Query.jl provides a single framework for data wrangling that works with a range of libraries, including DataFrames.jl, other tabular data libraries (more on those below), and even non-tabular data. Provides many convenience functions analogous to those in dplyr in R or LINQ.
You can find more information on these packages in the Data manipulation frameworks section of this manual.
And More!
Graphs.jl: A pure-Julia, high performance network analysis library. Edgelists in DataFrames can be easily converted into graphs using the GraphDataFrameBridge.jl package.
DataFrames.jl work well with a range of formats, including:
CSV files (using CSV.jl),
Apache Arrow (using Arrow.jl)
reading Stata, SAS and SPSS files (using ReadStatTables.jl; alternatively Queryverse users can choose StatFiles.jl),
Parquet files (using Parquet2.jl),
reading R data files (.rda, .RData) (using RData.jl).
While not all of these libraries are tightly integrated with DataFrames.jl, because DataFrames are essentially collections of aligned Julia vectors, so it is easy to (a) pull out a vector for use with a non-DataFrames-integrated library, or (b) convert your table into a homogeneously-typed matrix using the Matrix constructor or StatsModels.jl.
Other Julia Tabular Libraries
DataFrames.jl is a great general purpose tool for data manipulation and wrangling, but it's not ideal for all applications. For users with more specialized needs, consider using:
TypedTables.jl: Type-stable heterogeneous tables. Useful for improved performance when the structure of your table is relatively stable and does not feature thousands of columns.
JuliaDB.jl: For users working with data that is too large to fit in memory, we suggest JuliaDB.jl, which offers better performance for large datasets, and can handle out-of-core data manipulations (Python users can think of JuliaDB.jl as the Julia version of dask).
Note that most tabular data libraries in the Julia ecosystem (including DataFrames.jl) support a common interface (defined in the Tables.jl package). As a result, some libraries are capable or working with a range of tabular data structures, making it easy to move between tabular libraries as your needs change. A user of Query.jl, for example, can use the same code to manipulate data in a DataFrame, a Table (defined by TypedTables.jl), or a JuliaDB table.
If there is something you expect DataFrames to be capable of, but cannot figure out how to do, please reach out with questions in Domains/Data on Discourse. Additionally you might want to listen to an introduction to DataFrames.jl on JuliaAcademy.
Please report bugs by opening an issue.
You can follow the source links throughout the documentation to jump right to the source files on GitHub to make pull requests for improving the documentation and function capabilities.
Please review DataFrames contributing guidelines before submitting your first PR!
Information on specific versions can be found on the Release page.
Package Manual
First Steps with DataFrames.jl
Setting up the Environment
Constructors and Basic Utility Functions
Getting and Setting Data in a Data Frame
Basic Usage of Transformation Functions
Getting Started
The DataFrame Type
Database-Style Joins
Introduction to joins
Key value comparisons and floating point values
Joining on key columns with different names
Handling of duplicate keys and tracking source data frame
Renaming joined columns
Matching missing values in joins
Specifying row order in the join result
In-place left join
The Split-Apply-Combine Strategy
Design of the split-apply-combine support
Examples of the split-apply-combine operations
Using GroupedDataFrame as an iterable and indexable object
Simulating the SQL where clause
Column-independent operations
Column-independent operations versus functions
Specifying group order in groupby
Reshaping and Pivoting Data
Categorical Data
Missing Data
Comparison with the Python package pandas
Comparison with the R package dplyr
Comparison with the R package data.table
Comparison with Stata (version 8 and above)
Data manipulation frameworks
Only exported (i.e. available for use without DataFrames. qualifier after loading the DataFrames.jl package with using DataFrames) types and functions are considered a part of the public API of the DataFrames.jl package. In general all such objects are documented in this manual (in case some documentation is missing please kindly report an issue here).
Breaking changes to public and documented API are avoided in DataFrames.jl where possible.
The following changes are not considered breaking:
specific floating point values computed by operations may change at any time; users should rely only on approximate accuracy;
in functions that use the default random number generator provided by Base Julia the specific random numbers computed may change across Julia versions;
if the changed functionality is classified as a bug;
if the changed behavior was not documented; two major cases are:
in its implementation some function accepted a wider range of arguments that it was documented to handle - changes in handling of undocumented arguments are not considered as breaking;
the type of the value returned by a function changes, but it still follows the contract specified in the documentation; for example if a function is documented to return a vector then changing its type from Vector to PooledVector is not considered as breaking;
error behavior: code that threw an exception can change exception type thrown or stop throwing an exception;
changes in display (how objects are printed);
changes to the state of global objects from Base Julia whose state normally is considered volatile (e.g. state of global random number generator).
All types and functions that are part of public API are guaranteed to go through a deprecation period before a breaking change is made to them or they would be removed.
The standard practice is that breaking changes are implemented when a major release of DataFrames.jl is made (e.g. functionalities deprecated in a 1.x release would be changed in the 2.0 release).
In rare cases a breaking change might be introduced in a minor release. In such a case the changed behavior still goes through one minor release during which it is deprecated. The situations where such a breaking change might be allowed are (still such breaking changes will be avoided if possible):
the affected functionality was previously clearly identified in the documentation as being subject to changes (for example in DataFrames.jl 1.4 release propagation rules of :note-style metadata are documented as such);
the change is on the border of being classified as a bug (in rare cases even if a behavior of some function was documented its consequences for certain argument combinations could be decided to be unintended and not wanted);
the change is needed to adjust DataFrames.jl functionality to changes in Base Julia.
Please be warned that while Julia allows you to access internal functions or types of DataFrames.jl these can change without warning between versions of DataFrames.jl. In particular it is not safe to directly access fields of types that are a part of public API of the DataFrames.jl package using e.g. the getfield function. Whenever some operation on fields of defined types is considered allowed an appropriate exported function should be used instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment