Skip to content

Instantly share code, notes, and snippets.

@max6cn
Last active August 29, 2015 14:18
Show Gist options
  • Save max6cn/2ca8d8f441da2584db63 to your computer and use it in GitHub Desktop.
Save max6cn/2ca8d8f441da2584db63 to your computer and use it in GitHub Desktop.

Demo A. Text Search ( Introdution to programming and parallel computing)

Summary:

Search keywords in a text file

Tools used:

Hadoop(hdfs), Apache Spark, gcc

Language:

C,Shell Script, Python

Task I : Searching keywords in file and counting its frequency

  • Oldschool / Standard way Link
int wc(char* file_path, char* word){
    FILE *fp;
    int count = 0;
    int ch, len;
    if(NULL==(fp=fopen(file_path, "r")))
        return -1;
    len = strlen(word);
    for(;;){
        int i;
        if(EOF==(ch=fgetc(fp))) break;
        if((char)ch != *word) continue;
        for(i=1;i<len;++i){
            if(EOF==(ch = fgetc(fp))) goto end;
            if((char)ch != word[i]){
                fseek(fp, 1-i, SEEK_CUR);
                goto next;
            }
        }
        ++count;
        next: ;
    }
end:
    fclose(fp);
    return count;
}

int main(){//testestest : count 2
    char key[] = "test"; // the string I am searching for
    int wordcount = 0;

    wordcount = wc("input.txt", key);
    printf("%d",wordcount);
    return 0;
}
$ scp wc.c user@serverip
$ scp input.txt user@serverip
$ ssh user@serverip
$ gcc wc.c
$ ./#wc
   xxxxx
  • Baseline
# Counting the "string" occurrence in a file
def count_string_occurrence():
    string = "test"
    f = open("result_file.txt")
    contents = f.read()
    f.close()
    print "Number of '" + string + "' in file", contents.count("foo")
  • Parallel implementation
from pyspark import SparkContext, SparkConf
file = sc.textFile("hdfs:///user/mark/wiki.txt")
tests = file.filter(lambda line: "test" in line)
# Count all the test
tests.count()
# Count errors mentioning case
tests.filter(lambda line: "case" in line).count()
# Fetch the test case as an array of strings
tests.filter(lambda line: "case" in line).collect()

Task II : Speed up processing using in memory caching

errors.cache()

Task III : Wordcount, in Python(with Spark)

from pyspark import SparkContext, SparkConf
file = sc.textFile("hdfs:///user/mark/wiki.txt")
counts = file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs:///user/mark/wc.txt")

Demo B:IPython Notebook System

Summary:

Python Notebook System

Tools used:

ipython notebook, matplotlib

Language:

Markdown, Python

Task B.I : Write Documents in Markdown

  • styling
  • image
  • code
  • equation
  • link

Task B.II : Execute code on Notebook, plot a sine wave

import matplotlib.pyplot as plt

x = linspace(0, 5, 10)
y = x ** 2
fig = plt.figure()
axes = fig.add_axes([0.1, 0.1, 0.8, 0.8]) # left, bottom, width, height (range 0 to 1)
axes.plot(x, y, 'r')
axes.set_xlabel('x')
axes.set_ylabel('y')
axes.set_title('title');

Demo C. Chanllege!

  • Task : use word_cloud to generate a image like this:

@Troyer-x
Copy link

Troyer-x commented Apr 7, 2015

Thanks, pretty interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment