upepo/Data_Wrangling.md

## Data_Wrangling.md

      
    Raw
  

              Data_Wrangling.md
            
          
    Data Wrangling (데이터 전처리)

Text Processing

데이터마이닝에 필요한 텍스느 전처리 기술 소개. 리눅스 환경에서 bash와 python을 이용하여 텍스트 데이터를 필요한 형태로 처리하는데 중점을 둔다.
추천 링크


고급 Bash 스크립팅 가이드: Bash를 이용한 쉘 스크립팅 완전 가이드
(http://wiki.kldp.org/HOWTO/html/Adv-Bash-Scr-HOWTO/textproc.html)

Examples

Awk

# input_file.txt 의 1번째 컬럼만 output_file.txt에 쓰기 
$ cat  input.txt |  awk ‘{print $1}’ > output_file.txt 

# 입력파일의 1번째 컬럼에 ‘apple’이 포함된 라인의 2,4번째 컬럼만 출력 ( tab delimiter) 
$ cat input.txt | awk ‘BEGIN {FS=“\t”; OFS=“\t”} { if(index($1,”apple”) > 0 ) print $2,$4}’ 

sort

# -t$’\t’ : tab delimiter
# -k2 : 2번째 컬럼으로 정렬
# -nr : 숫자형, 역순정렬 
$ cat input.txt | sort -t$’\t’ -k2 -nr 
tee

파일쓰기와 STDOUT 출력을 동시에..
$ cat xxx | something.py | tee result.txt
Print a file skipping X lines in Bash

$ tail -n+<first line to print> <filename>
python

python basic

추천링크 : http://maxburstein.com/blog/python-shortcuts-for-the-python-beginner/
stdin, stdout, stderr 활용


입출력 파일이 클 경우에도 메모리 사용이 적음
테스트 및 사용이 용이
멀티프로세스 사용이 용이

import sys

sum = 0;
for line in sys.stdin().readlines():
	try:
		el = line.strip.split('\t')
		sum += int(el[0])
	except Exception, e:
		sys.stderr.write('Except: ' + str(e) + '\n')

print sum 

# test & use 
$ head input.txt | ./test.py
$ cat input.txt | ./test.py > result.txt
parallel


parallel - build and execute shell command lines from standard input in parallel
http://www.gnu.org/software/parallel/man.html

Example

   ls ./*log.gz | parallel --max-procs=4 gzip -dc {}  \ # 4개 프로세스로 압축 해제
   | parallel --pipe --max-procs=4 awk -f ./par.awk \ # 4개 프로세스로 awk
   | parallel --pipe --max-procs=4 ./pre_filter.py  \ # 4개 프로세스로 필터링 수행 (표준 입출력 사용)
   > result.txt
option

-k : 입력순서대로 출력순서 보장 옵션 (디폴트로 보장 안함)
$ parallel -j4 sleep {}\; echo {} ::: 2 1 4 3
$ parallel -j4 -k sleep {}\; echo {} ::: 2 1 4 3

Text Processing Tips


Know about sort and uniq (including uniq's -u and -d options).
Know about cut, paste, and join to manipulate text files. Many people use cut but forget about join.
It is remarkably helpful sometimes that you can do set intersection, union, and difference of text files via sort/uniq.
Suppose a and b are text files that are already uniqued.
This is is fast, and works on files of arbitrary size, up to many gigabytes.
(Sort is not limited by memory, though you may need to use the -T option if /tmp is on a small root partition.)

# sort & uniq
$ cat a b | sort | uniq > c   # c is a union b
$ cat a b | sort | uniq -d > c   # c is a intersect b
$ cat a b b | sort | uniq -u > c   # c is set difference a - b

Know basic awk for simple data munging.
For example, summing all numbers in the third column of a text file: awk '{ x += $3 } END { print x }'.
This is probably 3X faster and 3X shorter than equivalent Python.
Know sort's options. Know how keys work (-t and -k).
In particular, watch out that you need to write -k1,1 to sort by only the first field; -k1 means sort according to the whole line.
Stable sort (sort -s) can be useful.
For example, to sort first by field 2, then secondarily by field 1, you can use sort -k1,1 | sort -s -k2,2