Skip to content

Instantly share code, notes, and snippets.

Created January 8, 2013 21:26
Show Gist options
  • Save anonymous/4488118 to your computer and use it in GitHub Desktop.
Save anonymous/4488118 to your computer and use it in GitHub Desktop.
Handling Strings with Rcpp
---
title: Handling Strings with Rcpp
author: Kevin Ushey
license: GPL (>= 2)
tags: string vector
summary: Demonstrates how one might handle a vector of strings with `Rcpp`,
in addition to returning output.
---
This is a quick example of how you might use Rcpp to send and receive R
'strings' to and from R. We'll demonstrate this with a few operations.
Sort a String with R
-----
Note that we can do this in R in a fairly fast way:
```{r, tidy=FALSE}
my_strings <- c("apples", "and", "cranberries")
R_str_sort <- function(strings) {
sapply( strings, USE.NAMES=FALSE, function(x) {
intToUtf8( sort( utf8ToInt( x ) ) )
})
}
R_str_sort( my_strings )
```
Sort a String with C++/Rcpp
----
Let's see if we can re-create the output with Rcpp.
```{r, engine='Rcpp'}
#include <Rcpp.h>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export]]
CharacterVector cpp_str_sort( CharacterVector x ) {
vector< string > strings = as< vector< string > >(x);
int len = strings.size();
for( int i=0; i < len; i++ ) {
sort( strings[i].begin(), strings[i].end() );
}
return wrap(strings);
}
```
Note the main things we do here:
* We use `as` to pass our `CharacterVector` x to a `vector` of `std::string`s,
* We then call the `void` method `std::sort`, which can sort a string in place,
* We then simply use Rcpp's `wrap` function to export `strings` back as a
`CharacterVector`.
Now, let's test it, and let's benchmark it as well.
```{r}
cpp_str_sort( my_strings )
long_strings <- rep( paste( collapse="", sample( letters, 1E5, replace=TRUE ) ),
times=100 )
rbenchmark::benchmark( cpp_str_sort(long_strings),
R_str_sort(long_strings),
replications=3
)
```
Note that the C++ implementation is quite a bit faster (on my machine). However,
the C++ sort will not handle UTF-8 encoded vectors.
Now, let's do something crazy -- let's see if we can use Rcpp to perform an
operation that takes a vector of strings, and returns a list of vectors of
strings. (Or, in R parlance, a list of vectors of type character).
We'll do a simple 'split', such that the vector is split every `n` indices.
Split a string at consecutive indices n
-----
```{r, engine='Rcpp'}
#include <Rcpp.h>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export]]
List cpp_str_split( CharacterVector x, int n ) {
vector< string > strings = as< vector< string > >(x);
int num_strings = strings.size();
vector< vector< string > > out_strings;
for( int i=0; i < num_strings; i++ ) {
int num_substr = strings[i].length() / n;
vector< string > tmp;
for( int j=0; j < num_substr; j++ ) {
tmp.push_back( strings[i].substr( j*n, n ) );
}
out_strings.push_back( tmp );
}
return wrap(out_strings);
}
```
Main things to notice:
* We declare the output to be a `List`,
* We form a container like a 2D array of strings; a vector of vector of strings,
* We construct the split strings one by one, then place them back into our
output container,
* The Rcpp `wrap` call still automagically coerces our vector of vectors of
strings into a list of character vectors.
```{r}
cpp_str_split( c("abcd", "efgh", "ijkl"), 2 )
cpp_str_split( c("abc", "de"), 2 )
```
My solution is perhaps a bit deficient (bug or feature?) in that it truncates
any strings not long enough; ideally, we'd either improve the C++ code or form
an appropriate wrapper to the function in R (and warn the user if truncation
might occur).
Hopefully this gives you a better idea how you might use Rcpp to perform more
extensive string manipulation with R character vectors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment