Skip to content

Instantly share code, notes, and snippets.

@alexdowad
Created November 24, 2021 12:06
Show Gist options
  • Save alexdowad/5dfa94202a0192b658e94ec86a2b2e7d to your computer and use it in GitHub Desktop.
Save alexdowad/5dfa94202a0192b658e94ec86a2b2e7d to your computer and use it in GitHub Desktop.
Little bare-bones fuzzer for text conversion via PHP's mbstring library
#!/usr/bin/env ruby
$oldphp = File.join(__dir__, '../../sapi/cli/php-9308974f8c')
$newphp = File.join(__dir__, '../../sapi/cli/php-a14a5ef07f')
require 'fileutils'
include FileUtils
# UTF-8
encodings = %w{UTF-7 UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE UCS-2 UCS-2BE UCS-2LE UCS-4 UCS-4BE UCS-4LE SJIS ISO-2022-JP EUC-JP}
def try_conversion(str, from, to)
File.open('/tmp/oldphp-conv', 'w') { |f| f.write("#!#{$oldphp}\n<?php $data = stream_get_contents(STDIN); $str = mb_convert_encoding($data, '#{to}', '#{from}'); echo $str;") }
File.open('/tmp/newphp-conv', 'w') { |f| f.write("#!#{$newphp}\n<?php $data = stream_get_contents(STDIN); $str = mb_convert_encoding($data, '#{to}', '#{from}'); echo $str;") }
chmod('+x', '/tmp/oldphp-conv')
chmod('+x', '/tmp/newphp-conv')
old_result = new_result = nil
IO.popen('/tmp/oldphp-conv', 'r+') do |pipe|
pipe.write(str)
pipe.close_write
old_result = pipe.read
end
IO.popen('/tmp/newphp-conv', 'r+') do |pipe|
pipe.write(str)
pipe.close_write
new_result = pipe.read
end
return [old_result, new_result]
end
def test_conversion(str, from, to)
result = try_conversion(str, from, to)
return result[0] == result[1]
end
def reduce_case(str, from, to)
return str if str.empty?
reduce_by = str.length / 2
while reduce_by > 0
shorter = str[0..-(reduce_by + 1)]
if !test_conversion(shorter, from, to)
return reduce_case(shorter, from, to)
end
shorter = str[reduce_by..-1]
if !test_conversion(shorter, from, to)
return reduce_case(shorter, from, to)
end
reduce_by /= 2
end
return str
end
def known_case(str, from, to, new_result, old_result)
return (to == 'UTF-7' && old_result == '+') ||
(from == 'ISO-2022-JP' && str == "\e" && old_result == '')
end
if __FILE__ == $0
1000.times do
len = rand(1000)
str = len.times.collect { rand(256).chr }.join
from = encodings.sample
to = encodings.sample
puts "#{len}-byte string, from #{from} to #{to}"
old_result, new_result = try_conversion(str, from, to)
if old_result != new_result
str = reduce_case(str, from, to)
old_result, new_result = try_conversion(str, from, to)
if !known_case(str, from, to, new_result, old_result)
puts "Result didn't match"
puts "Input string:"
p str
puts "From old PHP:"
p old_result
puts "From new PHP:"
p new_result
exit
end
end
end
end
@alexdowad
Copy link
Author

alexdowad commented Nov 24, 2021

As requested by @AustinLeath.

Notes:

  1. Before running this, I checked out master, built PHP, and cp'd the binary to a new file with a name like php-<COMMITHASH>. Then checked out my changed version, built it, and again cp'd the binary to php-<COMMITHASH>. Then put the paths of the two binaries (one pre-changes, the other post-changes) into $oldphp and $newphp at the top of the script. The idea of the fuzzer is to pump a bunch of random strings into the new and old versions and see if the output is the same or different. If it's different, that is suspect.
  2. The text encodings in encodings are those affected by my changes.
  3. I removed UTF-8 because I was getting too many false positives and wanted to see more results from other encodings. Cause of false positives: For UTF-8, the new code sometimes produces a different number of error markers on invalid input. I am aware of this and am not worried about it.
  4. known_case filters out positives caused by known problems with the old code (i.e. cases where I know that the old and new code produce different results and believe that the new code is correct).
  5. Probably the issue with different numbers of error markers being produced on invalid UTF-8 input could be added to known_case and then UTF-8 could be added back without creating so much noise.

@alexdowad
Copy link
Author

  1. reduce_by usually manages to reduce 'bad' inputs to the smallest string which exhibits a problem, but not always. It would be more robust if we tried more different ways of reducing the size of a 'bad' input... but anyways, this is just a little throwaway script, and it's not the end of the world if a bad input is not reduced as much as it could be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment