Skip to content

Instantly share code, notes, and snippets.

@dmbates
Created June 4, 2022 13:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dmbates/73bfe8c60f5d4350225c8f106407e537 to your computer and use it in GitHub Desktop.
Save dmbates/73bfe8c60f5d4350225c8f106407e537 to your computer and use it in GitHub Desktop.
Code for benchmarking sort!(unique!(df)) versus sort!(df) on a real-world example
using CSV, DataFrames, Downloads, Tar
datadir = "biofast-data-v1"
tarball = "$datadir.tar.gz"
if !isfile(tarball)
dataurl = joinpath(
"https://github.com/lh3/biofast/releases/download",
datadir,
tarball,
)
Downloads.download(dataurl, tarball)
end
isdir(datadir) || mkdir(datadir)
bedfilenames = ["ex-anno.bed", "ex-rna.bed"]
if !all(nm -> isfile(joinpath(datadir, nm)), bedfilenames)
tmpdir = Tar.extract(
h -> endswith(h.path, ".bed"), `zcat ./$tarball`,
)
for nm in bedfilenames
mv(joinpath(tmpdir, datadir, nm), datadir)
end
end
rnadf = CSV.read(
joinpath(datadir, "ex-rna.bed"),
DataFrame;
delim='\t',
types=[String, Int32, Int32,],
header=[:chr, :start, :stop,],
)
tmpdf = copy(rnadf) # because it will be modified
@time copy(rnadf); # negligible compared to sort! and unique!
sort!(unique!(tmpdf))
@time sort!(unique!(copy(rnadf)));
@time sort!(copy(rnadf));
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment