Playing around with Big Data!

Maziyar Panahi

Playing around with Big Data!
Vivek Gupta Sep 2nd, 2020 at 10:02 AM
I am new to sparknlp. I am writing a custom transformer which will remove tokens from text whose length is <=2. Transformer is working and doing its job. But it is not giving proper structure as an output. Instead it is returning only Array of String. I am struggling to get output in following structure -
StructField("annotatorType", StringType(), False),
StructField("begin", IntegerType(), False),
StructField("end", IntegerType(), False),
StructField("result", StringType(), False),
StructField("metadata", MapType(StringType(), StringType()), True)
View wikipedia-iso-country-codes.csv
English short name lower case Alpha-2 code Alpha-3 code Numeric code ISO 3166-2
Afghanistan AF AFG 004 ISO 3166-2:AF
Åland Islands AX ALA 248 ISO 3166-2:AX
Albania AL ALB 008 ISO 3166-2:AL
Algeria DZ DZA 012 ISO 3166-2:DZ
American Samoa AS ASM 016 ISO 3166-2:AS
Andorra AD AND 020 ISO 3166-2:AD
Angola AO AGO 024 ISO 3166-2:AO
Anguilla AI AIA 660 ISO 3166-2:AI
Antarctica AQ ATA 010 ISO 3166-2:AQ
Created Sep 4, 2019 — forked from baraldilorenzo/
VGG-16 pre-trained model for Keras

##VGG16 model for Keras

This is the Keras model of the 16-layer network used by the VGG team in the ILSVRC-2014 competition.

It has been obtained by directly converting the Caffe model provived by the authors.

Details about the network architecture can be found in the following arXiv paper:

Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan, A. Zisserman
View zeppelin-pyspark-yarn.txt
DEBUG [2019-02-18 11:27:25,397] ({YARN application state monitor}[invoke]:249) - Call: getApplicationReport took 2ms
DEBUG [2019-02-18 11:27:25,878] ({FIFOScheduler-Worker-1}[processLine]:81) - Interpreter output:import org.apache.spark.sql.functions._
INFO [2019-02-18 11:27:25,931] ({pool-6-thread-2}[getStatus]:818) - job:null
DEBUG [2019-02-18 11:27:25,931] ({pool-6-thread-2}[getProperty]:204) - key: zeppelin.spark.concurrentSQL, value: false
INFO [2019-02-18 11:27:25,931] ({pool-6-thread-2}[getStatus]:818) - job:null
INFO [2019-02-18 11:27:25,931] ({pool-6-thread-2}[getStatus]:818) - job:null
INFO [2019-02-18 11:27:25,931] ({pool-6-thread-2}[getStatus]:818) - job:org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob@f7c36f41
INFO [2019-02-18 11:27:25,931] ({pool-6-thread-2} RemoteInterpreterServer.
View zeppelin-pyspark-yarn-client.txt
INFO [2019-02-06 22:23:16,364] ({main}[<init>]:148) - Starting remote interpreter server on port 0, intpEventServerAddress: IP_ADDRESS:36131
INFO [2019-02-06 22:23:16,384] ({main}[<init>]:175) - Launching ThriftServer at IP_ADDRESS:46727
INFO [2019-02-06 22:23:16,549] ({pool-6-thread-1}[createInterpreter]:333) - Instantiate interpreter org.apache.zeppelin.spark.SparkInterpreter
INFO [2019-02-06 22:23:16,553] ({pool-6-thread-1}[createInterpreter]:333) - Instantiate interpreter org.apache.zeppelin.spark.SparkSqlInterpreter
INFO [2019-02-06 22:23:16,556] ({pool-6-thread-1}[createInterpreter]:333) - Instantiate interpreter org.apache.zeppelin.spark.DepInterpreter
INFO [2019-02-06 22:23:16,560] ({pool-6-thread-1}[createInterpreter]:333) - Instantiate interpreter org.apache.zeppelin.spark.PySparkInterpreter
INFO [2019-02-06 22:23:16,563] ({pool
View yarn-cluster-error.txt
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2338)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
View gist:aee182aab3e320749fbc9a81031deab3
"error": {
"root_cause": [
"type": "mapper_parsing_exception",
"reason": "Root mapping definition has unsupported parameters: [namespace : {dynamic=false, properties={wiki={analyzer=keyword, type=text, index_options=docs}, name={analyzer=near_match_asciifolding, type=text, index_options=docs}}}] [archive : {dynamic=false, properties={wiki={analyzer=keyword, type=text, index_options=docs}, namespace={type=long}, title={search_analyzer=text_search, similarity=BM25, analyzer=text, position_increment_gap=10, type=text, fields={trigram={similarity=BM25, analyzer=trigram, type=text, index_options=docs}, prefix_asciifolding={search_analyzer=near_match_asciifolding, similarity=BM25, analyzer=prefix_asciifolding, type=text, index_options=docs}, plain={search_analyzer=plain_search, similarity=BM25, analyzer=plain, position_increment_gap=10, type=text}, prefix={search_analyzer=near_match, similarity=BM25, analyzer=prefix, type=text, index_options=docs}, keyword={s
View Spark-NLP-POS.scala
import com.johnsnowlabs.nlp.{DocumentAssembler, Finisher}
import com.johnsnowlabs.nlp.annotators.{Normalizer, Stemmer, Tokenizer}
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.util.Benchmark
import{StopWordsRemover, IDF, HashingTF, CountVectorizer, Word2Vec}
maziyarpanahi / tours.json
Created Feb 4, 2018
JSON array of demo Tours for MongoDB
View tours.json
"tourBlurb" : "Big Sur is big country. The Big Sur Retreat takes you to the most majestic part of the Pacific Coast and show you the secret trails.",
"tourName" : "Big Sur Retreat",
"tourPackage" : "Backpack Cal",
"tourBullets" : "\"Accommodations at the historic Big Sur River Inn, Privately guided hikes through any of the 5 surrounding national parks, Picnic lunches prepared by the River Inn kitchen, Complimentary country breakfast, Admission to the Henry Miller Library and the Point Reyes Lighthouse \"",
"tourRegion" : "Central Coast",
"tourDifficulty" : "Medium",
"tourLength" : 3,
"tourPrice" : 750,
maziyarpanahi / top-500-enwiki.txt
Created Oct 22, 2017
Top 500 phrases in English Wikipedia
View top-500-enwiki.txt
Phrases were extracted by Stanford CoreNLP/Spark 2.2 (6minutes) from English Wikipeida (+5 million pages)
+---------------------------+-----+ [441/9895]
|value |count|
|square miles |59821|
|unique feature |46463|
|id form |46101|
|administrative district |45963|
|first time |41423|