Skip to content

Instantly share code, notes, and snippets.

@rygorous
Created January 26, 2013 21:22
Embed
What would you like to do?
This is not an acceptable way to strip whitespace from a string.
void RemoveWhitespace(cString &szString)
{
// Remove leading whitespace
size_t nFirstIndex = szString.find_first_not_of(_L(' '));
if(nFirstIndex != cString::npos)
{
szString = szString.substr(nFirstIndex);
}
// Remove trailing newlines
size_t nLastIndex = szString.find_last_not_of(_L('\n'));
while(nLastIndex != szString.length()-1)
{
szString.erase(nLastIndex+1,1);
nLastIndex = szString.find_last_not_of(_L('\n'));
};
// Tabs
nLastIndex = szString.find_last_not_of(_L('\t'));
while(nLastIndex != szString.length()-1)
{
szString.erase(nLastIndex+1,1);
nLastIndex = szString.find_last_not_of(_L('\t'));
};
// Spaces
nLastIndex = szString.find_last_not_of(_L(' '));
while(nLastIndex != szString.length()-1)
{
szString.erase(nLastIndex+1,1);
nLastIndex = szString.find_last_not_of(_L(' '));
};
}
@darksylinc
Copy link

Ok, I cannot take it anymore. must not correct* ok I'll do it.... That code is horrible. Lots of veterans are laughing in horror, but unfortunately laughing won't make junior coders 'get it', so I'll proceed to explain. This comment aims to those young ones. Hopefully it will help someone:

First, erase has up to O( N ) complexity because it has to move everything that is after nLastIndex one place, because of that removal.
Worst case scenario (i.e. a string filled with tabs), this bit of code would need to shift N characters, then shift N-1 characters on the next iteration, then N-2, then N-3 and so forth.
The worst case scenario can be summarized as O( N * (N+1) / 2 ) which is very bad.

Add to this that the code tries to do it three times (newlines, tabs & spaces), you can see why it gets awful.
If the particular template implementation happens to be that after too many erase() calls it tries to "shrink to fit", it will have to reallocate memory, which will only trash memory and waste cycles here.

Second, find_last_not_of is a very useful function, but it's general use case is not optimized for this kind of abuse (searching for just one character inside a loop!). It should not be used here, specially if it's performance critical code.

Third, depending on the template implementation, there's a 99% chance substr will return a copy. This means if the string " started containing spaces at the beginning", the amount of memory used will be doubled. Ouch.
The memory being used itself is not much of a problem (unless the string is very large) but rather the waste of memory: using more memory can pollute & trash the cpu caches, allocating & deallocating memory isn't free, and it certainly hurts in a multithreaded environment unless the memory allocator is grabbing memory from a thread-local pool.

The following snippet is possible, much more efficient version:

void RemoveWhitespace( cString &szString )
{
    cString::iterator itor = szString.begin();
    cString::iterator end  = szString.end();

    cString::const_iterator it2 = szString.begin();

    // Advance until the first non-whitespace character
    while( it2 != end && *it2 == ' ' )
        ++it2;
    // Remove leading whitespace
    while( itor != end && it2 != end )
        *itor++ = *it2++;

    // Between itor & end there is now bogus data. We'll handle that during the erase
#ifdef USE_FIND_LAST_NOT_OF
    // Here's an acceptable usage because this is exactly intended usage.
    // We could do it by hand, but we're bordering microoptimization
    // sacrificing readability. Be sure to start from newSize, and not
    // from szString.size(), which now contains bogus data.
    size_t newSize = itor - szString.begin();
    const size_t nLastIndex = szString.find_last_not_of( _L(" \t\n"), newSize );
    if( nLastIndex != newSize )
        newSize = nLastIndex + 1;

    // Remove bogus data or since trailing whitespace, whatever comes first
    szString.erase( szString.begin() + newSize, szString.end() );
#else
    // Removing trailing spaces, more C'ish way, profiling will tell if it's faster
    // Note we're doing the same we did for leading whitespace, but in reverse direction
    size_t newSize = itor - szString.begin();
    // The paranthesis is very important! Otherwise it wouldn't be C++ standard compliant (because
    // the ritor iterator, could temporarily fall past end(), which is Undefined Behavior)
    // In this case it never wwill, but it's still a good practice & something to watch for
    cString::const_reverse_iterator ritor = szString.rbegin() + (szString.size() - newSize);
    cString::const_reverse_iterator rend  = szString.rend();

    while( ritor != rend && ( *ritor == ' ' || *ritor == '\t' || *ritor == '\n' ) )
        ++ritor;

    newSize = ritor.base() - szString.begin();
    szString.erase( szString.begin() + newSize, szString.end() );
#endif
}

This snippet is O( N ) in all scenarios, and doesn't waste ram at all, not a single reallocation needed; unless erase triggers a realloc, which isn't bad anyway (because it will be an intended shrink to fit)
We erase up until the end, so it takes O( 1 ), because there is nothing that comes after to move.
Much better than O( N * (N+1) / 2 ) multiple times, plus memory & cache trashing.
There are possibly more approaches to optimize it further.

I hope this serves as an example.
rygorous, feel free to correct me wherever you believe I'm wrong.
I've tested this code with multiple strings (empty strings, with & w/out leading spaces, w/ & w/out trailing spaces; 5 types of strings total) it compiles & works

I'm releasing this snippet to the public domain (where public domain is void, just use it as you like, commercial or purpose or not, whatever.). Released "AS IS" with no guarantees of any kind.
Credit / attribution is appreciated but not required.

IMPORTANT: My implementation is NOT functionally equivalent, because in the original snippet the strings:
RemoveWhitespace( "test\n\t " ) will output "test\n\t"
RemoveWhitespace( "test \t\n " ) will output "test"
while my snippet in all cases the result will be "test", which I presume was the intended behavior

Cheers
Matias Goldberg

@rygorous
Copy link
Author

For what it's worth, this code is used to read INI-like text files, and what I actually did to fix it was to stop copying strings around. My replacement function just takes a pair of iterators ("start" inclusive, "end" exclusive) and modifies it to strip whitespace:

static bool iswhite(int ch)
{
    return ch == _L(' ') || ch == _L('\t') || ch == _L('\n');
}

template<typename T>
static void RemoveWhitespace(T& start, T& end)
{
    while (start < end && iswhite(*start))
        ++start;

    while (end > start && iswhite(*(end - 1)))
        --end;
}

The code still eventually copies the extracted values into strings, but there's no reason to be creating tons of temporary strings in the meantime. For what it's worth, this is from a program that spent a significant amount of its (loading) time in RemoveWhitespace and its immediate callers, doing extremely inefficient text file parsing.

@dariomanesku
Copy link

It's probably not the topic, but C++ has a very elegant way of solving this problem. Here's how simple it is:

bool iswhite(int ch) 
{ 
    return ch == ' ' || ch == '\t' || ch == '\n';
}

void RemoveWhitespace(std::string &str)
{
    str.erase(std::remove_if(str.begin(), str.end(), iswhite), 
                  str.end());
}

@Poita
Copy link

Poita commented Jan 27, 2013

@dariomanesku: Your code removes all whitespace, not just leading and trailing.

@dariomanesku
Copy link

@Poita: Yes, that's exactly what it does, isn't that what's necessary? If not, I'm sorry, I haven't quite taken my time to look carefully at the code, I was just passing by. Also, then, the function name could be a little different.

@homeworkprod
Copy link

What about all the other whitespace characters? This targets only a rather small subset. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment