Skip to content

Instantly share code, notes, and snippets.

@seLc7
Last active October 17, 2017 03:12
Show Gist options
  • Save seLc7/9bd99365380716f5a3e712a1369a5eda to your computer and use it in GitHub Desktop.
Save seLc7/9bd99365380716f5a3e712a1369a5eda to your computer and use it in GitHub Desktop.
背景:某一个固定目录下定时产生大量小文件,每个文件的内容为每行一个中文人名,因某种原因,人名中可能包含多余的英文字符。要求:对目录下新产生的文件进行监控,多线程实现对文件的处理,并对文件中的人名进行去杂、分析,周期性产出一段时间内的每个人名出现次数的统计结果。
import os
import sys
import time
import re
def detector(dir, sec):
"""可以采用多线程
一个时间间隔扫描同一个文件夹,返回不同的内容"""
origin = set([_f[2] for _f in os.walk(dir)][0])
time.sleep(sec)
final = set([_f[2] for _f in os.walk(dir)][0])
return final.difference(origin)
def handler(set):
s = u"中文bab#$%$#%#$"
name_dict = {}
for file_name in set:
file = open(file_name)
try:
read_lines = file.readlines()
for row in read_lines:
row_filter = row.sub( # 去掉除中文的其他字符
"[A-Za-z0-9\[\`\~\!\@\#\$\^\&\*\(\)\=\|\{\}\'\:\;\'\,\[\]\.\<\>\/\?\~\!\@\#\\\&\*\%]", "", s)
if row_filter in name_dict.keys():
name_dict......#字典添加名字个数
except Exception as e:
pass
# print("something wrong:" + row)
finally:
file.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment