Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
[WordPress] Occasionally, I've been parsing out data provided to me by third-parties and there have been hidden ASCII characters that can muck up programmatically inserting data into the database. Here's a simple regex for stripping out everything *except* alphanumeric characters.
<?php
// Replace anything that is not an 'a-z', 'A-Z', or '0-9' from the given $value
$value = preg_replace( "/[^a-zA-Z0-9\s]/", "", $value );
/*
* Read the comments below to see some of the available functions WordPress provides for evaluating the validity of the characters in the input string.
* /
@KingYes

This comment has been minimized.

Copy link

KingYes commented Apr 2, 2013

And what's about Hebrew chars? or maybe Arabic, etc... ?

@thefuxia

This comment has been minimized.

Copy link

thefuxia commented Apr 2, 2013

You should really fix the parser and leave the file content as it is.

@tommcfarlin

This comment has been minimized.

Copy link
Owner Author

tommcfarlin commented Apr 3, 2013

@KingYes: You're right - international characters shouldn't be ignored - but this particular regex was for a very simple, narrowly defined text file.

@Tascho: The thing is, I'm not sold that the file content being handed over was accurate.

@tommcfarlin

This comment has been minimized.

Copy link
Owner Author

tommcfarlin commented Apr 3, 2013

For those who are looking for a WordPress-based solution (which is what this particular gist was used for), there's a nice function that someone mentioned in this comment.

Specifically, wp_check_invalid_utf8 which can be found [http://core.trac.wordpress.org/browser/tags/3.5.1/wp-includes/formatting.php#L499](in the source in Trac).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.