Skip to content

Instantly share code, notes, and snippets.

@williballenthin
Last active July 14, 2022 21:10
Show Gist options
  • Save williballenthin/8e3913358a7996eab9b96bd57fc59df2 to your computer and use it in GitHub Desktop.
Save williballenthin/8e3913358a7996eab9b96bd57fc59df2 to your computer and use it in GitHub Desktop.
Extract ASCII and Unicode strings using Python.
import re
from collections import namedtuple
ASCII_BYTE = " !\"#\$%&\'\(\)\*\+,-\./0123456789:;<=>\?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\]\^_`abcdefghijklmnopqrstuvwxyz\{\|\}\\\~\t"
String = namedtuple("String", ["s", "offset"])
def ascii_strings(buf, n=4):
reg = "([%s]{%d,})" % (ASCII_BYTE, n)
ascii_re = re.compile(reg)
for match in ascii_re.finditer(buf):
yield String(match.group().decode("ascii"), match.start())
def unicode_strings(buf, n=4):
reg = b"((?:[%s]\x00){%d,})" % (ASCII_BYTE, n)
uni_re = re.compile(reg)
for match in uni_re.finditer(buf):
try:
yield String(match.group().decode("utf-16"), match.start())
except UnicodeDecodeError:
pass
def main():
import sys
with open(sys.argv[1], 'rb') as f:
b = f.read()
for s in ascii_strings(b, n=4):
print('0x{:x}: {:s}'.format(s.offset, s.s))
for s in unicode_strings(b):
print('0x{:x}: {:s}'.format(s.offset, s.s))
if __name__ == '__main__':
main()
@williballenthin
Copy link
Author

# ASCII strings from GNU strings
$ strings -n 4 -e l -a /bin/ls  | wc -l
1

# Unicode strings from GNU strings
$ strings -n 4 -a /bin/ls  | wc -l                                                               
1293

# ASCII and Unicode strings from this snippet
$ env/bin/python strings.py /bin/ls | wc -l
1294

@jedimasterbot
Copy link

the issue is also present here, mandiant/flare-floss#347
Need to convert all the regex to bytes type 'rb'

@jedimasterbot
Copy link

if you are getting the mandiant/flare-floss#347 issue, i have fixed the code.
Try this,

https://gist.github.com/jedimasterbot/39ef35bc4324e4b4338a210298526cd0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment