Skip to content

Instantly share code, notes, and snippets.

@shubhamagarwal92
Created April 7, 2017 14:42
Show Gist options
  • Save shubhamagarwal92/7a49e169e0261e547b93079949e39866 to your computer and use it in GitHub Desktop.
Save shubhamagarwal92/7a49e169e0261e547b93079949e39866 to your computer and use it in GitHub Desktop.
This script converts string (sequence of characters) to a sequence of integer values of the unicode characters
# Shubham Agarwal, April 2017
# This script converts string (sequence of characters) to a sequence of integer values of the
# unicode characters
#
# To run this script provide data and output file path. Also provide fileEncoding
# Run as:
# python convertTextToSequency.py --readFilePath='path/to/text/' --writeFilePath='path/to/output'
# Default to 'utf-8' encoding
# Or use --fileEncoding='utf-8'
import os
import sys
import argparse
parser = argparse.ArgumentParser(
description="Convert text file to sequence of integers.")
parser.add_argument(
"--readFilePath",
dest="readFilePath",
type=str,
help="Source file path")
parser.add_argument(
"--writeFilePath",
dest="writeFilePath",
type=str,
help="Output file path")
parser.add_argument(
"--fileEncoding",
dest="fileEncoding",
type=str,
default='utf-8',
help="Encoding of file (default to utf-8)"
)
args = parser.parse_args()
def textToSeq(readFilePath,writeFilePath,fileEncoding):
readFile = open(readFilePath,'r')
writeFile = open(writeFilePath, 'w')
for line in readFile:
line = line.decode(fileEncoding)
joinCharacter = ' '
tokens = list(line.strip())
# ord provides the integer value of unicode character in python
writeLine = joinCharacter.join([str(ord(_)) for _ in tokens])
writeFile.write(writeLine + '\n')
readFile.close()
writeFile.close()
textToSeq(args.readFilePath,args.writeFilePath,args.fileEncoding)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment