{{ message }}

Instantly share code, notes, and snippets.

# ceteri/0.textrank_init.py

Last active Sep 18, 2015
 TEXT = """ Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types. """ # construct a fully tagged document that is segmented into sentences sent = sc.parallelize(TOKENIZER.tokenize(TEXT)).map(pos_tag).cache() base = list(np.cumsum(np.array(sent.map(len).collect()))) base.insert(0, 0) base.pop() lens = sc.parallelize(base) tagged_doc = lens.zip(sent).map(wrap_words).cache() tagged_doc.collect() # apply a sliding window to construct a graph of the relevant word pairs tiled = tagged_doc.flatMap(lambda s: sliding_window(s, 3)).flatMap(link_words).filter(keep_pair) t0 = tiled.map(lambda link: (link["stem"], link["stem"],)) t1 = tiled.map(lambda link: (link["stem"], link["stem"],)) neighbors = t0.union(t1) # visualize in a graph import matplotlib.pyplot as plt import networkx as nx G = nx.Graph() for a, b in neighbors.collect(): G.add_edge(a, b, weight=1.0) edges = [ (u,v) for (u,v,d) in G.edges(data=True) ] pos = nx.spring_layout(G) nx.draw_networkx_nodes(G, pos, node_size=700) nx.draw_networkx_edges(G, pos, edgelist=edges, width=6) nx.draw_networkx_labels(G, pos, font_size=20, font_family='sans-serif') plt.axis('off') plt.savefig("weighted_graph.png") plt.show() # run TextRank, then extract the top-ranked keyphrases rank = sc.parallelize(text_rank_graph(neighbors)) tags = tagged_doc.flatMap(lambda x: x).map(lambda w: (w["stem"], w,)) l = tags.join(rank).map(glean_rank).sortByKey().collect() for rank, phrase in sorted(set(extract_phrases(l)), reverse=True): print "%0.2f %s" % (rank, phrase,) ## also try monitoring http://localhost:4040/ while this is running, ## to see the dynamics of Spark in action ## NLTK has some bugs in terms of word segmentation, and its PoS taggers ## could be much better, but this gives a general idea of how TextRank works ## In practice, a combination of TF-IDF and TextRank make a good blend

### akovacsBK commented Aug 14, 2014

 \ \\\``````

### ceteri commented Aug 15, 2014

 Looks like a markdown for cuneiform? :)

### ashish2303 commented Jun 26, 2015

 hi, can we do unsupervised sentiment analysis using nltk or textbob packages of python over spark that is pyspark . i mean suppose i have different rows of sentence then with entire pre processing like tokenization ,stop word removal ,pos tagging etc.. can we tag different sentence as postive,negative or neutral in pyspark.? standalone we can use python with textblob or ntlk package we can do this task. will it possible to do this task over spark with python.,? please reply if you have any idea