Skip to content

Instantly share code, notes, and snippets.

@hucsmn
Last active March 4, 2020 07:43
Show Gist options
  • Save hucsmn/1e7848bed0c09a843291ae5afe0cd895 to your computer and use it in GitHub Desktop.
Save hucsmn/1e7848bed0c09a843291ae5afe0cd895 to your computer and use it in GitHub Desktop.
Performance of suffix array contruction algorithms available in rust.
[package]
name = "rustsatest"
version = "0.1.0"
edition = "2018"
[dependencies]
suffix_array = { version = "0.4", default-features = false, features = ["parallel"] }
cdivsufsort = "*"
divsufsort = "*"
suffix = "*"
use std::env::args;
use std::fs;
use std::iter::FromIterator;
use std::time;
use cdivsufsort::sort as cdssort;
use divsufsort::sort as dssort;
use suffix::SuffixTable;
use suffix_array::SuffixArray;
fn timeit<F, T>(f: F) -> (T, time::Duration)
where
F: FnOnce() -> T,
{
let start = time::Instant::now();
let ret = f();
let dur = start.elapsed();
(ret, dur)
}
fn main() {
for path in args().skip(1) {
let data = fs::read(&path).unwrap();
println!("* file `{}` ({} bytes) *", &path, data.len());
{
let (sa, t) = timeit(|| cdssort(&data[..]).into_parts().1);
println!(
"cdivsufsort: {} indexes in {:.3}s",
sa.len(),
t.as_secs_f64()
);
}
{
let (sa, t) = timeit(|| dssort(&data[..]).into_parts().1);
println!(
"divsufsort: {} indexes in {:.3}s",
sa.len(),
t.as_secs_f64()
);
}
{
let (sa, t) = timeit(|| SuffixArray::new(&data[..]).into_parts().1);
println!(
"suffix_array: {} indexes in {:.3}s",
sa.len(),
t.as_secs_f64()
);
}
{
// tranform binary into utf8
let text = String::from_iter(data.iter().map(|&b| b as char));
let (sa, t) = timeit(|| SuffixTable::new(&text[..]).into_parts().1);
println!(
"suffix: {} indexes in {:.3}s",
sa.len(),
t.as_secs_f64()
);
}
{
// force to treat data as utf8
let text = unsafe { std::str::from_utf8_unchecked(&data[..]) };
let (sa, t) = timeit(|| SuffixTable::new(text).into_parts().1);
println!(
"suffix (force utf8): {} indexes in {:.3}s",
sa.len(),
t.as_secs_f64()
);
}
println!("");
}
}

Comparison of different SACAs available in rust.

name description
cdivsufsort C-binding to Yuta Mori's dissufsort, by Amos Wenger
divsufsort Amos Wenger's hand-ported divsufsort in rust
suffix_array a partially paralleled SACA-K for binary data, along with searching algorithms
suffix burntsushi's suffix, featuring utf-8 text indexing, using a SAIS variant with two additional bytes per character.

Test in a Intel Core i5-4200H CPU @ 2.80GHz (2 cores, 4 threads) machine with 12 GiB memory. Test data was obtained from Pizza&Chili Corpus.

Table:

name dna-50m midi-50m xml-50m code-50m en-50m pr-50m
cdivsufsort 5.973s 3.774s 4.222s 4.818s 5.618s 7.167s
divsufsort 7.617s 5.345s 5.991s 6.012s 7.844s 9.241s
suffix_array 14.999s 10.244s 13.203s 13.433s 17.660s 18.028s
suffix 24.251s 15.203s 19.828s 18.855s 27.751s 30.532s
suffix (force utf8) 24.333s 15.426s 19.289s 17.699s 27.779s 29.843s

Output log:

* file `dna-50m` (52428800 bytes) *
cdivsufsort: 52428800 indexes in 5.973s
divsufsort: 52428800 indexes in 7.617s
suffix_array: 52428801 indexes in 14.999s
suffix: 52428800 indexes in 24.251s
suffix (force utf8): 52428800 indexes in 24.333s

* file `midi-50m` (52428800 bytes) *
cdivsufsort: 52428800 indexes in 3.774s
divsufsort: 52428800 indexes in 5.345s
suffix_array: 52428801 indexes in 10.244s
suffix: 52446443 indexes in 15.203s
suffix (force utf8): 52428800 indexes in 15.426s

* file `xml-50m` (52428800 bytes) *
cdivsufsort: 52428800 indexes in 4.222s
divsufsort: 52428800 indexes in 5.991s
suffix_array: 52428801 indexes in 13.203s
suffix: 52428800 indexes in 19.828s
suffix (force utf8): 52428800 indexes in 19.289s

* file `code-50m` (52428800 bytes) *
cdivsufsort: 52428800 indexes in 4.818s
divsufsort: 52428800 indexes in 6.012s
suffix_array: 52428801 indexes in 13.433s
suffix: 52429385 indexes in 18.855s
suffix (force utf8): 52428800 indexes in 17.699s

* file `en-50m` (52428800 bytes) *
cdivsufsort: 52428800 indexes in 5.618s
divsufsort: 52428800 indexes in 7.844s
suffix_array: 52428801 indexes in 17.660s
suffix: 52445886 indexes in 27.751s
suffix (force utf8): 52428800 indexes in 27.779s

* file `pr-50m` (52428800 bytes) *
cdivsufsort: 52428800 indexes in 7.167s
divsufsort: 52428800 indexes in 9.241s
suffix_array: 52428801 indexes in 18.028s
suffix: 52428800 indexes in 30.532s
suffix (force utf8): 52428800 indexes in 29.843s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment