Created
November 24, 2021 12:06
-
-
Save alexdowad/5dfa94202a0192b658e94ec86a2b2e7d to your computer and use it in GitHub Desktop.
Little bare-bones fuzzer for text conversion via PHP's mbstring library
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env ruby | |
$oldphp = File.join(__dir__, '../../sapi/cli/php-9308974f8c') | |
$newphp = File.join(__dir__, '../../sapi/cli/php-a14a5ef07f') | |
require 'fileutils' | |
include FileUtils | |
# UTF-8 | |
encodings = %w{UTF-7 UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE UCS-2 UCS-2BE UCS-2LE UCS-4 UCS-4BE UCS-4LE SJIS ISO-2022-JP EUC-JP} | |
def try_conversion(str, from, to) | |
File.open('/tmp/oldphp-conv', 'w') { |f| f.write("#!#{$oldphp}\n<?php $data = stream_get_contents(STDIN); $str = mb_convert_encoding($data, '#{to}', '#{from}'); echo $str;") } | |
File.open('/tmp/newphp-conv', 'w') { |f| f.write("#!#{$newphp}\n<?php $data = stream_get_contents(STDIN); $str = mb_convert_encoding($data, '#{to}', '#{from}'); echo $str;") } | |
chmod('+x', '/tmp/oldphp-conv') | |
chmod('+x', '/tmp/newphp-conv') | |
old_result = new_result = nil | |
IO.popen('/tmp/oldphp-conv', 'r+') do |pipe| | |
pipe.write(str) | |
pipe.close_write | |
old_result = pipe.read | |
end | |
IO.popen('/tmp/newphp-conv', 'r+') do |pipe| | |
pipe.write(str) | |
pipe.close_write | |
new_result = pipe.read | |
end | |
return [old_result, new_result] | |
end | |
def test_conversion(str, from, to) | |
result = try_conversion(str, from, to) | |
return result[0] == result[1] | |
end | |
def reduce_case(str, from, to) | |
return str if str.empty? | |
reduce_by = str.length / 2 | |
while reduce_by > 0 | |
shorter = str[0..-(reduce_by + 1)] | |
if !test_conversion(shorter, from, to) | |
return reduce_case(shorter, from, to) | |
end | |
shorter = str[reduce_by..-1] | |
if !test_conversion(shorter, from, to) | |
return reduce_case(shorter, from, to) | |
end | |
reduce_by /= 2 | |
end | |
return str | |
end | |
def known_case(str, from, to, new_result, old_result) | |
return (to == 'UTF-7' && old_result == '+') || | |
(from == 'ISO-2022-JP' && str == "\e" && old_result == '') | |
end | |
if __FILE__ == $0 | |
1000.times do | |
len = rand(1000) | |
str = len.times.collect { rand(256).chr }.join | |
from = encodings.sample | |
to = encodings.sample | |
puts "#{len}-byte string, from #{from} to #{to}" | |
old_result, new_result = try_conversion(str, from, to) | |
if old_result != new_result | |
str = reduce_case(str, from, to) | |
old_result, new_result = try_conversion(str, from, to) | |
if !known_case(str, from, to, new_result, old_result) | |
puts "Result didn't match" | |
puts "Input string:" | |
p str | |
puts "From old PHP:" | |
p old_result | |
puts "From new PHP:" | |
p new_result | |
exit | |
end | |
end | |
end | |
end |
reduce_by
usually manages to reduce 'bad' inputs to the smallest string which exhibits a problem, but not always. It would be more robust if we tried more different ways of reducing the size of a 'bad' input... but anyways, this is just a little throwaway script, and it's not the end of the world if a bad input is not reduced as much as it could be.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
As requested by @AustinLeath.
Notes:
php-<COMMITHASH>
. Then checked out my changed version, built it, and again cp'd the binary tophp-<COMMITHASH>
. Then put the paths of the two binaries (one pre-changes, the other post-changes) into$oldphp
and$newphp
at the top of the script. The idea of the fuzzer is to pump a bunch of random strings into the new and old versions and see if the output is the same or different. If it's different, that is suspect.encodings
are those affected by my changes.known_case
filters out positives caused by known problems with the old code (i.e. cases where I know that the old and new code produce different results and believe that the new code is correct).known_case
and then UTF-8 could be added back without creating so much noise.