Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save mysticBliss/fb40d8f64b66b2047aa56d09a340bcc7 to your computer and use it in GitHub Desktop.
Save mysticBliss/fb40d8f64b66b2047aa56d09a340bcc7 to your computer and use it in GitHub Desktop.
Basic File Processing in Python

File Types:

Two major types of files that are stored on a computer

  1. Text Files
  2. Binary Files

Text Files

  • Contains lines of characters.
  • Each line ends with of EOL (End Of Line) character.
  • In Python, a common EOL character is \n.
  • A text file can be read by a program like notepad.

Binary Files

  • All files that are not text files are binary files.
  • To process a binary file, the file format must be known by the application.
  • Examples of binary files include: pdf files, png files, doc files, exe file

Methods

Open() Method

To process a file (read or write to a file), the file must first be opened. When opening a file a minimum of one parameter must be supplied. The name of the file, if the mode parameter is not supplied, the file will be opened in read only mode.

Example:

  open('assignmment.txt') # Default mode is Read mode.

file name: this is the actual name of the file on the disk, including its file extension.

mode: the mode is how you are going to process the file.

Here are 5 modes:

  • 'r' : to read a file (the default mode)
  • 'w' : to write to a file, the contents of the file are deleted before writing to the file
  • 'x' : open for exclusive writing. If the file already exists, the open() statement will fail.
  • 'a' : to append to a file, append will add data to the end of a file, keeping all of the original data
  • '+' : read and write the the same file
  • 'r+' : same as '+'

CAUTION: binary files can be corrupted in a Windows environment, if the file is not opened as a binary file

For binary files:

  • 'b' : used to append binary files (needed in a Windows environment)
  • 'rb', wb', 'r+b' are also used with binary files in a Window environment.

For more detail, see python.org documentation file.

Syntax:
file_object = open(filename, mode)

file_object: the name that will be used in the program to process the file.

filename: a string that contains the name of the actual file on the disk

mode: a string that will be a 'r', 'w', 'a', 'b' or 'x'

Examples:

  • For Reading :
f_input = open('myData.txt', 'r')
  • For Writing:
f_output = open('newData.txt','w')
  • For Reading and Writing:
f_output = open('newData.txt','rw+')

If you write only '+' sign below is the error you get.

ValueError: mode string must begin with one of 'r', 'w', 'a' or 'U', not '+'
  • For Appending:
f_output = open('newData.txt','a')

close() Method

When you open a file, it must also be closed. Closing a file after you are finished with it is the same for reading, writing or appending.

file_object.close()

file_object: the name that will be used in the program to process the file.

This will close the f_data file object:

f_data.close()

Better file Handling in python with with() Statement

with()

with is a multi-line statement. All statements that are indented after the with statement are part of the with statement. When opening a file using a with statement, the file is automatically closed when the block of code is complete.

Therefore, there is no need to close the file after processing is complete. An error is not issued if the file is closed.

  • This will open the myData.txt file as f_in for reading:

    with open('myData.txt', 'r') as f_in:
  • This will open the newData.txt file as f_out for writing:

    with open('newData.txt','w')as f_out:
  • This will open the myData.txt file as f_in for reading and newData.txt file as f_out for writing all within the same block of code:

with open('myData.txt', 'r') as f_in, open('newData.txt','w')as f_out:

Reading Text Files

File Object Methods of Reading a Text File:

  • .read(): read the file once into one long string

    When the read() method is called, it returns some data from the file. If read() is not given a parameter, the entire file is read and placed into the variable that is assigned to it. If the EOF (End Of File) has been reached, read() will return an empty string "" (or '', either a pair of single or double quotes can be used)

Read an entire file in to a single variable:

 fileToRead = 'textFile.txt'
 f_in = open(fileToRead,'r')
 print ('Reading file ' + fileToRead)
 data = f_in.read()
 # all the data in the file is now the variable data
 # you can use any valid variable name in place of data
 # this is where the data would be processed
 print(data)
 f_in.close()

Read an entire file in to a single variable removing the EOL characters at the same time:

# all EOL characters are removed and each line is placed in a list
f = open(fname, 'r')
lines = f.read().splitlines()
print (lines )
f.close()
  • .readline(): read the file one line at a time

    When the .readline() method is called, it reads one line from a text file. The EOL (End Of Line) is represented by the \n character. All lines in a text file contain an EOL character except the last line in the file. If the EOF (End Of File) has been reached, .readline() will return an empty string ""

This is a very traditional method of reading a file. the logic used here is call "priming the loop". The first line of the file is read before the while loop. The while loop checks to see if the EOF has been reached. Providiing the EOF has not been reached the data is processed and another line is read.

fname = 'textFile.txt'
f_input = open(fname,'r')
# f_input is the file object created by the open() statement
# read the first line of the file
one_line_of_data = f_input.read()
while '' != one_line_of_data: # keep looping while the EOF has not been read
    print (one_line_of_data)
    one_line_of_data = f_input.read()
print ('done')
f_input.close()

This method uses an implied readline(). The for statement does the reading and quits when the EOF is reached.

fname = 'textFile.txt'
f_input = open(fname,'r')
# implied reading line by line -- no readline required
# line is a variable and can be any valid variable name
# f_input is the file object created by the open() statement
for line in f_input:
   # process the data in line here
   # NOTICE that the data is printed double spaced. 
   # One of the EOL is in the file
   # the other is created by the print statement
   # the last line is not double spaced as the last line in the file
   # does not have an EOL character
   print (line)
print ('done')
f_input.close()
  • .readlines(): read all lines into a list, with each line as an element in the list.

    When the .readlines() method is called, the entire file is read into a list with each line being one element of the list. If the EOF (End Of File) has been reached, .readlines() will return an empty string ""

fileToRead = 'textFile.txt'
f_in = open(fileToRead, 'r')
print('Reading file ' + fileToRead)
list_of_data = f_in.readlines()
# the data in the file can now be processed in the variable list_of_data
# the EOL \n is in the string
print(list_of_data)
print("done")
f_in.close()

Writing Text Files

  • .write(): when the .write() method is called, the contents of the string variable is written to the file. It expects a string as argument and writes it to the file. If you provide a list of strings, it will raise an exception (by the way, show errors to us!)

Write keyboard input to a file. Stop input when no data is entered.

def writef(fname ='outfile.txt'):
    f = open(fname, 'w')
    line = input('Enter some data for a line:  ')
    while line != '':
        f.write(line+ '\n')
        line = input('Enter some data for a line:  ')
    f.close()

write() will raise an error, we cannot pass like below.

   textdoc.write(line1 + "\n" + line2 + ....) 

Instead we use writelines()

  • writelines(): expects an iterable as argument (an iterable object can be a tuple, a list, a string, or an iterator in the most general sense). Each item contained in the iterator is expected to be a string.
lines = ['line1', 'line2']
with open('filename.txt', 'w') as f:
    f.writelines("%s\n" % l for l in lines)

Gist

  • open: prepare a file for processing
  • close: close a file after processing
  • with: a code block to help with processing a file, closing a file object is implicit.
Copy unique content from one file to another
unQline = []        
with open('to_be_copied_file.txt','r')  as copyFile, open('new_created_file','w') as newFile:
    
    # Reading from a File
    for line in copyFile:            
        if line.rstrip('\n') not in unQline:
            unQline.append(line.rstrip('\n'))
    
    # Writing to another File
    for line in unQline:
        newFile.write(line + '\n')
        
print('Copied Unique Content')
Copied Unique Content
NOTE:

If we dont specify mode for Filename , read mode is default, writing may not be possible.

  • rstrip() - strips \n (new line character from right hand side)
  • .write() - Write a line to File object

Explain all the file processing modes supported by Python ?

Python allows you to open files in one of the three modes. They are: read-only mode, write-only mode, read-write mode, and append mode by specifying the flags r, w, rw, a respectively.

A text file can be opened in any one of the above said modes by specifying the option t along with r, w, rw, and a, so that the preceding modes become rt, wt, rwt, and at.

A binary file can be opened in any one of the above said modes by specifying the option b along with r, w, rw, and a so that the preceding modes become rb, wb, rwb, ab.

Explain the shortest way to open a text file and display its contents.?

The shortest way to open a text file is by using “with” command as follows:

with open("file-name", "r") as fp:
fileData = fp.read()
# to print the contents of the file print(fileData)

How to display the contents of text file in reverse order?

  1. convert the given file into a list.
  2. reverse the list by using reversed()
reversedLine= reversed(list(open('C:\Users\staml\Desktop\d.txt','r')))
for rline in reversedLine:
    print(rline)

Name the File-related modules in Python?

Python provides libraries / modules with functions that enable you to manipulate text files and binary files on file system. Using them you can create files, update their contents, copy, and delete files.

The libraries are : os, os.path, and shutil.

Here, os and os.path – modules include functions for accessing the filesystem

shutil – module enables you to copy and delete the files.

How do you check the file existence and their types in Python?

os.path.exists() – use this method to check for the existence of a file. It returns True if the file exists, false otherwise. Eg:

import os; os.path.exists(‘/etc/hosts’)

os.path.isfile() – this method is used to check whether the give path references a file or not. It returns True if the path references to a file, else it returns false. Eg:

import os; os.path.isfile(‘/etc/hosts’)

os.path.isdir() – this method is used to check whether the give path references a directory or not. It returns True if the path references to a directory, else it returns false. Eg:

import os; os.path.isfile(‘/etc/hosts’)

os.path.getsize() – returns the size of the given file os.path.getmtime() – returns the timestamp of the given path.

Differentiate between .py and .pyc files?

Both .py and .pyc files holds the byte code. .pyc is a compiled version of Python file. This file is automatically generated by Python to improve performance. The .pyc file is having byte code which is platform independent and can be executed on any operating system that supports .pyc format.

Note: there is no difference in speed when program is read from .pyc or .py file; the only difference is the load time.

If we want to remove leading and ending spaces, use str.strip():

sentence = ' hello  apple'
sentence.strip()
>>> 'hello  apple'  
    
# Here we were not able to remove in between spaces

If we want to remove duplicated spaces, use str.split():

    
    sentence = ' hello  apple'
    " ".join(sentence.split())
    >>> 'hello apple'
    

If we want to remove all spaces, use str.replace():

    
    sentence = ' hello  apple'
    sentence.replace(" ", "")
    >>> 'helloapple'
    

Another way is by importing string function.

Whitespace includes space, tabs and CRLF. So an elegant and one-liner string function we can use is translate.

    
    ' hello  apple'.translate(None, ' \n\t\r')
    

Or

    
    import string
    ' hello  apple'.translate(None, string.whitespace)
    

Using Regular Expressions

    import re    
    sentence = ' hello  apple'
    re.sub(' ','',sentence) #helloworld (remove all spaces)
    re.sub('  ',' ',sentence) #hello world (remove double spaces)
Why am I unable to use a string for a newline in write() but I can use it in writelines()?

The idea is the following: if you want to write a single string you can do this with write(). If you have a sequence of strings you can write them all using writelines().

write(arg) expects a string as argument and writes it to the file. If you provide a list of strings, it will raise an exception (by the way, show errors to us!).

writelines(arg) expects an iterable as argument (an iterable object can be a tuple, a list, a string, or an iterator in the most general sense). Each item contained in the iterator is expected to be a string. A tuple of strings is what you provided, so things worked.

The nature of the string(s) does not matter to both of the functions, i.e. they just write to the file whatever you provide them. The interesting part is that writelines() does not add newline characters on its own, so the method name can actually be quite confusing. It actually behaves like an imaginary method called write_all_of_these_strings(sequence).

What follows is an idiomatic way in Python to write a list of strings to a file while keeping each string in its own line:

lines = ['line1', 'line2']
with open('filename.txt', 'w') as f:
    f.write('\n'.join(lines))

This takes care of closing the file for you. The construct '\n'.join(lines) concatenates (connects) the strings in the list lines and uses the character '\n' as glue. It is more efficient than using the + operator.

Starting from the same lines sequence, ending up with the same output, but using writelines():

lines = ['line1', 'line2']
with open('filename.txt', 'w') as f:
    f.writelines("%s\n" % l for l in lines)

This makes use of a generator expression and dynamically creates newline-terminated strings. writelines() iterates over this sequence of strings and writes every item.

Edit: Another point you should be aware of:

write() and readlines() existed before writelines() was introduced. writelines() was introduced later as a counterpart of readlines(), so that one could easily write the file content that was just read via readlines():

outfile.writelines(infile.readlines())

Really, this is the main reason why writelines has such a confusing name. Also, today, we do not really want to use this method anymore. readlines() reads the entire file to the memory of your machine before writelines() starts to write the data. First of all, this may waste time. Why not start writing parts of data while reading other parts? But, most importantly, this approach can be very memory consuming. In an extreme scenario, where the input file is larger than the memory of your machine, this approach won't even work. The solution to this problem is to use iterators only. A working example:

with open('inputfile') as infile:
with open('outputfile') as outfile:
    for line in infile:
    outfile.write(line)

This reads the input file line by line. As soon as one line is read, this line is written to the output file. Schematically spoken, there always is only one single line in memory (compared to the entire file content being in memory in case of the readlines/writelines approach).

REFERENCE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment