Skip to content

Instantly share code, notes, and snippets.

@magnetikonline
Last active March 9, 2022 05:38
Show Gist options
  • Save magnetikonline/7a21ec5f5bcdbf7adb92f9d617e6198f to your computer and use it in GitHub Desktop.
Save magnetikonline/7a21ec5f5bcdbf7adb92f9d617e6198f to your computer and use it in GitHub Desktop.
Python function - test if given file is considered binary.

Python function - is file binary?

Function which determines if a given file is binary.

Test is based on the following algorithm (similar to that implemented within Perl):

  • Empty files are considered text.
  • If not empty, read up to 512 bytes as a buffer. File will be binary if:
    • Null byte is encountered.
    • More than 30% of the buffer consists of "non text" characters.
  • Otherwise, file is text.

Reference

#!/usr/bin/env python
class IsFileBinary:
READ_BYTES = 512
CHAR_THRESHOLD = 0.3
TEXT_CHARACTERS = ''.join(
[chr(code) for code in range(32,127)] +
list('\b\f\n\r\t')
)
def test(self,file_path):
# read chunk of file
fh = open(file_path,'r')
file_data = fh.read(IsFileBinary.READ_BYTES)
fh.close()
# store chunk length read
data_length = len(file_data)
if (not data_length):
# empty files considered text
return False
if ('\x00' in file_data):
# file containing null bytes is binary
return True
# remove all text characters from file chunk, get remaining length
binary_length = len(file_data.translate(None,IsFileBinary.TEXT_CHARACTERS))
# if percentage of binary characters above threshold, binary file
return (
(float(binary_length) / data_length) >=
IsFileBinary.CHAR_THRESHOLD
)
def main():
is_file_binary = IsFileBinary()
print('Is binary file: {0}'.format(is_file_binary.test('./first')))
print('Is binary file: {0}'.format(is_file_binary.test('./second')))
print('Is binary file: {0}'.format(is_file_binary.test('./third')))
if (__name__ == '__main__'):
main()
@shanewa
Copy link

shanewa commented Mar 9, 2022

if sys.version_info < (3, 0):
    fh = open(file_path, 'r')
else:
    fh = open(file_path, 'r', encoding="ISO-8859-1")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment