Skip to content

Instantly share code, notes, and snippets.

View dplutcho's full-sized avatar

Darin Plutchok dplutcho

  • TechTarget
View GitHub Profile
<style>
@import url('https://fonts.googleapis.com/css2?family=DM+Serif+Display:ital@0;1&family=DM+Sans:wght@300;400;500&display=swap');
* { box-sizing: border-box; margin: 0; padding: 0; }
:root {
--scene-bg: var(--color-background-primary);
--surface: var(--color-background-secondary);
--border: var(--color-border-tertiary);
@dplutcho
dplutcho / pyspark_nlp_error_11_22_2019
Created November 25, 2019 22:36
Pyspark character issue
ipdb> n
> <ipython-input-73-5e4223e8e1f9>(31)tokenizer_unigram()
30
---> 31 print("Groubpy aggregat on mongoid.")
32 df_tokes = df_tokes.groupBy('_id').agg(collect_list('finished_tokes').alias('finished_tokes'))
ipdb> df_tokes.count()
*** py4j.protocol.Py4JJavaError: An error occurred while calling o2186.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 23.0 failed 1 times, most recent failure: Lost task 10.0 in stage 23.0 (TID 130, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$dfAnnotate$1: (array<array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>>) => array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)