Skip to content

Instantly share code, notes, and snippets.

@ebernhardson
Created January 15, 2019 19:53
Show Gist options
  • Save ebernhardson/54e3bb60f234a0f956d824c55d673cb0 to your computer and use it in GitHub Desktop.
Save ebernhardson/54e3bb60f234a0f956d824c55d673cb0 to your computer and use it in GitHub Desktop.
MLR Pipeline Sequence Diagram
@startuml
== click log generation ==
oozie -> oozie: schedule label generation
note left
arrow signify initiator
of communication, not
data flow
end note
activate oozie
database hdfs
oozie -> hdfs: retrieve click data
oozie -> hdfs: retrieve query data
oozie -> oozie: compute search click
oozie -> hdfs: store search click
deactivate oozie
== sampling and labeling ==
actor operator
operator -> "mjolnir (spark)": start mjolnir
activate "mjolnir (spark)"
"mjolnir (spark)" -> hdfs: retrieve search click
"mjolnir (spark)" -> "mjolnir (spark)": grouping queries (1st pass, stemming)
"mjolnir (spark)" -> "kafka": grouping queries (2nd pass, clustering)
"inactive search cluster (codfw)" -> "kafka": retrieve queries to be run
"inactive search cluster (codfw)" --> "kafka": send query results back
"mjolnir (spark)" -> "kafka": retrieve results of grouping queries
"mjolnir (spark)" -> "mjolnir (spark)": sampling
"mjolnir (spark)" -> "mjolnir (spark)": label generation\nwith DBN click model
== feature vector retrieval ==
database kafka
"mjolnir (spark)" -> kafka: send queries for feature vectors
"inactive search cluster (codfw)"-> kafka: retrieve queries to be analyzed
"inactive search cluster (codfw)"--> kafka: send feature vectors back
"mjolnir (spark)" -> kafka: retrieve feature vectors
"mjolnir (spark)" -> "mjolnir (spark)": feature selection
"mjolnir (spark)" -> hdfs: store query x feature vectors matrix\n(training data)
== machine learning ==
"mjolnir (spark)" -> hdfs: retrieve query x feature vectors matrix
"mjolnir (spark)" -> "mjolnir (spark)": create decision trees with xgboost\n
"mjolnir (spark)" -> operator: store decision trees
deactivate "mjolnir (spark)"
== upload to production ==
operator -> "elasticsearch\ncirrus": upload decision trees to production
note right
upload to production
isn't automated yet
end note
@enduml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment