Created
January 8, 2013 21:26
-
-
Save anonymous/4488118 to your computer and use it in GitHub Desktop.
Handling Strings with Rcpp
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: Handling Strings with Rcpp | |
author: Kevin Ushey | |
license: GPL (>= 2) | |
tags: string vector | |
summary: Demonstrates how one might handle a vector of strings with `Rcpp`, | |
in addition to returning output. | |
--- | |
This is a quick example of how you might use Rcpp to send and receive R | |
'strings' to and from R. We'll demonstrate this with a few operations. | |
Sort a String with R | |
----- | |
Note that we can do this in R in a fairly fast way: | |
```{r, tidy=FALSE} | |
my_strings <- c("apples", "and", "cranberries") | |
R_str_sort <- function(strings) { | |
sapply( strings, USE.NAMES=FALSE, function(x) { | |
intToUtf8( sort( utf8ToInt( x ) ) ) | |
}) | |
} | |
R_str_sort( my_strings ) | |
``` | |
Sort a String with C++/Rcpp | |
---- | |
Let's see if we can re-create the output with Rcpp. | |
```{r, engine='Rcpp'} | |
#include <Rcpp.h> | |
using namespace Rcpp; | |
using namespace std; | |
// [[Rcpp::export]] | |
CharacterVector cpp_str_sort( CharacterVector x ) { | |
vector< string > strings = as< vector< string > >(x); | |
int len = strings.size(); | |
for( int i=0; i < len; i++ ) { | |
sort( strings[i].begin(), strings[i].end() ); | |
} | |
return wrap(strings); | |
} | |
``` | |
Note the main things we do here: | |
* We use `as` to pass our `CharacterVector` x to a `vector` of `std::string`s, | |
* We then call the `void` method `std::sort`, which can sort a string in place, | |
* We then simply use Rcpp's `wrap` function to export `strings` back as a | |
`CharacterVector`. | |
Now, let's test it, and let's benchmark it as well. | |
```{r} | |
cpp_str_sort( my_strings ) | |
long_strings <- rep( paste( collapse="", sample( letters, 1E5, replace=TRUE ) ), | |
times=100 ) | |
rbenchmark::benchmark( cpp_str_sort(long_strings), | |
R_str_sort(long_strings), | |
replications=3 | |
) | |
``` | |
Note that the C++ implementation is quite a bit faster (on my machine). However, | |
the C++ sort will not handle UTF-8 encoded vectors. | |
Now, let's do something crazy -- let's see if we can use Rcpp to perform an | |
operation that takes a vector of strings, and returns a list of vectors of | |
strings. (Or, in R parlance, a list of vectors of type character). | |
We'll do a simple 'split', such that the vector is split every `n` indices. | |
Split a string at consecutive indices n | |
----- | |
```{r, engine='Rcpp'} | |
#include <Rcpp.h> | |
using namespace Rcpp; | |
using namespace std; | |
// [[Rcpp::export]] | |
List cpp_str_split( CharacterVector x, int n ) { | |
vector< string > strings = as< vector< string > >(x); | |
int num_strings = strings.size(); | |
vector< vector< string > > out_strings; | |
for( int i=0; i < num_strings; i++ ) { | |
int num_substr = strings[i].length() / n; | |
vector< string > tmp; | |
for( int j=0; j < num_substr; j++ ) { | |
tmp.push_back( strings[i].substr( j*n, n ) ); | |
} | |
out_strings.push_back( tmp ); | |
} | |
return wrap(out_strings); | |
} | |
``` | |
Main things to notice: | |
* We declare the output to be a `List`, | |
* We form a container like a 2D array of strings; a vector of vector of strings, | |
* We construct the split strings one by one, then place them back into our | |
output container, | |
* The Rcpp `wrap` call still automagically coerces our vector of vectors of | |
strings into a list of character vectors. | |
```{r} | |
cpp_str_split( c("abcd", "efgh", "ijkl"), 2 ) | |
cpp_str_split( c("abc", "de"), 2 ) | |
``` | |
My solution is perhaps a bit deficient (bug or feature?) in that it truncates | |
any strings not long enough; ideally, we'd either improve the C++ code or form | |
an appropriate wrapper to the function in R (and warn the user if truncation | |
might occur). | |
Hopefully this gives you a better idea how you might use Rcpp to perform more | |
extensive string manipulation with R character vectors. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment