Pamblam/isSafeQuery.md

## isSafeQuery.md

      
    Raw
  

              isSafeQuery.md
            
          
Note: This is posted as an answer to a StackOverflow question. Keeping it here for my own convenience.

I wanted a safer, more robust solution that didn't involve fully tokenizing query. Based on my experience writing SQL parsers (here, and here), I can say this solution is pretty bulletproof without having to use a full-featured query parser.
This is the best answer because:


It does not require a new SQL user with limited permissions
It does not require a new DB connection with limited permissions
It does not break if the query contains a string or a comment with the word "delete"
It allows complex queries with nested queries
It allows the user to input arbitrary SQL
It's safe

All the other answers as of the time of writing this have one or more of these limitations.
Here's how it works


Remove all inline and multi-line comments
Remove all single and double quoted strings
Remove all symbols and numbers
Create a unique array of the remaining (key)words
If any of the remaining keywords are INSERT, UPDATE, DELETE, RENAME, DROP, CREATE, TRUNCATE, ALTER, COMMIT, ROLLBACK, MERGE, CALL, EXPLAIN, LOCK, GRANT, REVOKE, SAVEPOINT, TRANSACTION, or SET then it is a "dangerous" query and should not be run.

Note:

This function does not validate the SQL, it only ensures that the SQL will not alter your database in any way. You will still need to run the query in a try/catch to make sure the query is valid.
/**
 * Determine if an SQL statement could potentially alter the database in any way.
 * @param string $sql - An SQL statement
 * @return boolean - True if query could alter the database, else false
 */
function isDangerousQuery($sql){
	$sql = trim($sql);

	// Irrelevant tokens to be parsed out of the query
	// A comment or string may contain a word like "drop"
	// so comments and strings need to be removed from the query
	$token_types = [
		[	'name' => 'Single-Line Comment',
			'start' => "--",
			'end' => "\n"	],
		[	'name' => 'Multi-Line Comment',
			'start' => "/*",
			'end' => "*/"	],
		[	'name' => 'Double-quoted String',
			'start' => "\"",
			'end' => "\""	],
		[	'name' => 'Single-quoted String',
			'start' => "'",
			'end' => "'"	]
	];
	
	// This array will contain every character that is not part
	// of one of the above described irrelevant tokens
	$keywords_buffer = [];
	
	// If we are currently parsing one of the above token types
	// it's index is held here, else this will be false
	$current_token_type_index = false;
	
	// Loop through each character and reconstruct the query without the 
	// irrelevant token types. We need to loop rather than use a regex
	// because there could be quotes nested in comments and things like that
	// that would "trick" our regex
	$length = strlen($sql);
	for ($index = 0; $index < $length; $index++) {
		$chunk = substr($sql, $index);

		// If the current char is an escape char, skip the next char
		if($sql[$index] === '\\'){
			$index++;
			continue;
		}

		// Looking for all starting tokens
		if(false === $current_token_type_index){

			foreach($token_types as $token_type_index => $token_type){
				if(0 === strpos($chunk, $token_type['start'])){
					$current_token_type_index = $token_type_index;
				}
			}

			if(false === $current_token_type_index){
				$keywords_buffer[] = $sql[$index];
			}

		// Looking for ending token
		}else if(0 === strpos($chunk, $token_types[$current_token_type_index]['end'])){

			$index += strlen($token_types[$current_token_type_index]['end']);
			if(strpos($token_types[$current_token_type_index]['end'], "\n") !== false) $keywords_buffer[] = "\n";
			$current_token_type_index = false;

		}

	}
	
	// Reconstruct the sql without the irrelevant tokens
	$sql_cleaned = implode('', $keywords_buffer);
	
	// Remove all symbols from the sql leaving only keywords and numbers
	$sql_keywords_only = preg_replace("/[^a-zA-Z_0-9\s]/", ' ', $sql_cleaned);
	
	// Create an array of unique keywords in upper-case
	$sql_keywords = array_unique(preg_split("/\s+/", strtoupper($sql_keywords_only)));
	
	// Filter out numbers and empty strings to get actual keywords
	$sql_keywords_filtered = [];
	foreach($sql_keywords as $keyword){
		if(!empty($keyword) && !is_numeric($keyword)){
			$sql_keywords_filtered[] = $keyword;
		}
	}
	
	// list of forbidden/dangerous keywords
	$dangerous_keywords = [
		'INSERT',
        'UPDATE',
        'DELETE',
        'RENAME',
        'DROP',
        'CREATE',
        'TRUNCATE',
        'ALTER',
        'COMMIT',
        'ROLLBACK',
        'MERGE',
        'CALL',
        'EXPLAIN',
        'LOCK',
        'GRANT',
        'REVOKE',
        'SAVEPOINT',
        'TRANSACTION',
        'SET'
	];
	
	// Contains an array of dangerous keywords found
	// If this array is empty, query is safe
	$found_dangerous_keywords = array_intersect($dangerous_keywords, $sql_keywords_filtered);
	
	return count($found_dangerous_keywords) > 0;
}