psyyz10/KeystoneML

## KeystoneML
Transfer Baogang project to KeystoneML and Spark
====
Contents
--
[TOC]

Useful Links
--
[KeystoneML Source Code](https://github.com/amplab/keystone)
[An KeystoneML Example](https://github.com/amplab/keystone-example)
[Programming Guide](http://keystone-ml.org/programming_guide.html)

Some Concepts
--
### Piplelines
A Pipeline is a dataflow that takes some input data and maps it to some output data through a series of `nodes`.
```scala
package workflow

trait Pipeline[A, B] {
  // ...
  def apply(in: A): B
  def apply(in: RDD[A]): RDD[B]
  //...
  final def andThen[C](next: Pipeline[B, C]): Pipeline[A, C] = //...
}
```
### Nodes
Nodes come in two flavors: `Transformers` and `Estimators`.
#### Transformers
It takes an input, and deterministically transforms it into an output
``` scala
package workflow

abstract class Transformer[A, B : ClassTag] extends TransformerNode[B] with Pipeline[A, B] {
  def apply(in: A): B
  def apply(in: RDD[A]): RDD[B] = in.map(apply)
  //...
}
```
#### Estimators
That is `Estimator` takes in training data as an `RDD` to its `fit()` method, and outputs a Transformer
```scala
package workflow

abstract class Estimator[A, B] extends EstimatorNode {
  protected def fit(data: RDD[A]): Transformer[A, B]
  // ...
}
```
#### Chaining Nodes and Building Pipelines
```scala
val labels: RDD[Vector[Double]] = //...
val trainImages: RDD[Image] = //...

val pipe = GrayScaler andThen
  ImageVectorizer andThen
  (LinearMapEstimator(), trainImages, trainLabels) andThen
  MaxClassifier
```
Some useful packages for our project:
---
Nodes
: **images** nodes useful for image processing
: **learning** extends Estimator, as estimators for learning process
: **stats** nodes useful for statistics
: **nlp**
: **util** : provides some utility functions

Loaders ： load data (extends transformer)
:

Evaluation: some evaluation criterion
:

Pipelines (the running program)
:

Utils

:	images
	:  **ImageUtils** : some utility functions (we should write some functions in this class)

Pipelines
: **images** : provides some examples for doing image classification pipelines


Some Methods Implementation Tips
--
Fft2: breeze.signal.fourierTr
Ifft2: breeze.signal.IfourierTr
Convolution: breeze.signal.convolve, however it can only support vector convolution. It has matrix signature, but only in a to do list.
ConnectedComponentLabeler: Implement with using DFS
BinaryConverter: check if it is gray, if not, convert to gray then convert to binary.

**Pipeline problem**:
How can I the non-linear pipeline?
For  example, If I have an image call ‘im: [Image]’, and I put it to a pipeline ‘p: Pipeline[Image, Seq[Box]]’, then the output should be boxes:Seq[Box]. The Box means the coordinate group I want to crop from the Image. Then I want to use apply another image crop transformer, which need the im. How can I pass the im to the imageCroper?  In other words, how can I add an edge from the initial node to the ImageCropper node?

One way to accomplish this is to build your pipeline piecewise:

e.g.
val data: RDD[Image]
val prefix: Pipeline[Image, Image] //Call this some preprocessing steps - I'm assuming this is the logic you want to avoid duplicating.
val boxExtractor: Pipeline[Image, Seq[Box]] //This is your `p`.
val imageCropper: Transformer[Image, Image]
val pipe1 = prefix andThen boxExtractor
val pipe2 = prefix andThen imageCropper

val combinedPipeline: Pipeline[Image, Seq[Any]] = Pipeline.gather(Seq(pipe1, pipe2))

The result of running combinedPipeline on an input image will be a Seq[Any] which in this case will be a sequence of size 2 where the elements are a Seq[Box], and a (cropped) Image.


Pipeline Designe
--

### KeystoneML Baogang Pipeline Graph：

![KeystoneML Baogang Pipeline Graph](https://lh3.googleusercontent.com/-M_RzGusBS7c/V4RCiVAFtqI/AAAAAAAAAEE/jAKmKxrGxbIPba2zo2-5GrEMDyeGeNxAQCLcB/s0/Baogang.PNG "Baogang.PNG")

There are two pipelines in our project, in which, one is for training and the other one is for inference.

For the training part,  the Pseudocode is shown in  Code 1.1 below.

**Code 1.1**
```scala
object BaogangTraining extends Logging{
	        def run(sc: SparkContext, config: BaogangConfig): Pipeline[Image, Int] = {
                val numClasses = 2
                val trainData = BaogangLoader(sc, config.trainLocation).cache()

                val trainImages = ImageExtractor(trainData)
                val labelExtractor = LabelExtractor andThen
		                       ClassLabelIndicatorsFromIntLabels(numClasses) andThen
		                       new Cacher[DenseVector[Double]]
                val trainLabels = labelExtractor(trainData)

                val predictor = ImageReScaler(a,a) andThen
                                GrayScaler andThen
                                new Cacher[DenseVector[Double]] andThen
                                ImageVectorizer andThen
                                // (new StandardScaler, trainImages) andThen
                                ConvolutionalNormalization andThen
                                new Cacher[DenseVector[Double]] andThen
                                (ConvolutionalTrainer(conf), trainImages, trainLabels)


                val testData = ImageExtractor(BaogangLoader(sc, config.trainLocation)).cache()
                val processedTestImage = usefulImageExtractor.apply(testData)
                val testPredicted = predictor(testParsedImgs)
        }
}
```
It is easy to find that the wrokflow is:
ImageExtractor -> ImageScalar -> GrayScaler -> ImageVectorizer -> ConvolutioanalNormalization （-> StandardScaler) -> ConvolutionalNormalization -> ConvolutionalTrainer

The signature of the corresponding objects are shown in Code1.3.

For the inference part,  the Pseudocode is shown in  Code 1.2 below. There are two sub-pipelines in the image processing part.

**Code 1.2**
```scala
object BaogangInferrence extends Logging{
        def run(sc: SparkContext, config: BaogangConfig): Pipeline[Image, Int] = {
                val testImages = ImageExtractor(BaogangLoader(sc, config.testLocation)).cache()

                val scrapProcessor = new Container(GrayScaler andThen
                                new ImFilter("replicate", k1) andThen
                                new ImFilter("replicate", k2) andThen
                                new BinaryConvertor(0.1) andThen
                                ConnectedComponentLabeler andThen
                                new UsefulBoxExtractor) andThen
                                new ImageBoxCropper


                val acidProcessor = new Container(GrayScaler andThen
                                ImageReScaler andThen
                                ImageMatrixizer andThen
                                FFT2 andThen
                                ToSaltMaper andThen
                                MatToGrayConvertor andThen
                                new UsefulBoxExtractor) andThen
                                new ImageBoxCropper


                val processedTestImage = usefulImageExtractor.apply(testImages)
				val usefulImageExtractor = Pipeline.gather { scrapProcessor :: acidProcessor :: Nil} andThen Combiner

				val predictor = LoadBaogangPredictor()
                val predict = predictor(processedTestImage)
        }
}
```
The wrokflow is:
p1 = ImageExtractor -> ImFilter("replicate", k1) -> ImFilter("replicate", k2) -> BinaryConvertor(0.1) -> ConnectedComponentLabeler ->BoxExtractor -> ImageBoxCropper
p2 = GrayScaler ->  ImageMatrixizer -> FFT2 -> ToSaltMaper -> MatToGrayConvertor -> BoxExtractor -> ImageBoxCropper

p1 + p2 -> LoadBaogangPredictor-> predict

The signature of the corresponding objects are shown in Code1.3.

**Code 1.3**
``` scala
ImFilter(imName: String) extends Transformer[Image,Image]
BinaryConvertor(threshold: Int) extends Transformer[Image,Image]
ConnectedLabeler extends Transformer[Image, DenseMatrix[Double]]
BoxExtractor extends  Transformer[DenseMatrix[Double], Seq[BoundingBox]] {
        overide apply(labelMatrix: DenseMatrix[Double]) =
                boundingBoxGroups(labelMatrix)

        def boundingBoxGroups(labelMatrix: DenseMatrix[Double]) : Seq[BoundingBox]
}
LengthFilter(length: Int) extends  Transformer[Seq[BoundingBox], Seq[BoundingBox]]
ImageBoxCropper extends  Transformer[(Image, Seq[BoundingBox]), (Seq[Image], Seq[BoundingBox])] {
	def apply(in: (Image, Seq[BoundingBox]): Seq[Image] = {
		in => in._1.map(box => cropImage(box, in._2))
	}

	def cropImage(box: BoundingBox, image: Image) : Image
}

ImageReScaler extends  Transformer[Image,Image]
ImageMatrixizer extends  Transformer[Image,DenseMatrix[Double]]
case class FFT2 extends  Transformer[DenseMatrix[Double], DenseMatrix[Double]]
SaltMaper extends  Transformer[DenseMatrix[Double], DenseMatrix[Double]]
MatToGrayConvertor extends  Transformer[DenseMatrix[Double], Image]
Combiner extends  Transformer[Seq[Seq[Image]], Image]
class ContainerA(p1: Pipeline[A,B]) extends Pipeline{
	override def apply(in: RDD[A]) : RDD[(A,B)]={
		return (in.zip(p1.apply(in)))
	}
}

class ConvolutionalPredictor(Cofigurations ...) extends LabelEstimator[DenseVector[Double], DenseVector[Double], DenseVector[Double]]

class Displayer extends Transformer[(BoundingBox, Image, Int), Unit]


```


## keystoneml
Transfer Baogang project to KeystoneML and Spark
====
Contents
--
[TOC]

Useful Links
--
[KeystoneML Source Code](https://github.com/amplab/keystone)
[An KeystoneML Example](https://github.com/amplab/keystone-example)
[Programming Guide](http://keystone-ml.org/programming_guide.html)

Some Concepts
--
### Piplelines
A Pipeline is a dataflow that takes some input data and maps it to some output data through a series of `nodes`.
```scala
package workflow

trait Pipeline[A, B] {
  // ...
  def apply(in: A): B
  def apply(in: RDD[A]): RDD[B]
  //...
  final def andThen[C](next: Pipeline[B, C]): Pipeline[A, C] = //...
}
```
### Nodes
Nodes come in two flavors: `Transformers` and `Estimators`.
#### Transformers
It takes an input, and deterministically transforms it into an output
``` scala
package workflow

abstract class Transformer[A, B : ClassTag] extends TransformerNode[B] with Pipeline[A, B] {
  def apply(in: A): B
  def apply(in: RDD[A]): RDD[B] = in.map(apply)
  //...
}
```
#### Estimators
That is `Estimator` takes in training data as an `RDD` to its `fit()` method, and outputs a Transformer
```scala
package workflow

abstract class Estimator[A, B] extends EstimatorNode {
  protected def fit(data: RDD[A]): Transformer[A, B]
  // ...
}
```
#### Chaining Nodes and Building Pipelines
```scala
val labels: RDD[Vector[Double]] = //...
val trainImages: RDD[Image] = //...

val pipe = GrayScaler andThen
  ImageVectorizer andThen
  (LinearMapEstimator(), trainImages, trainLabels) andThen
  MaxClassifier
```
Some useful packages for our project:
---
Nodes
: **images** nodes useful for image processing
: **learning** extends Estimator, as estimators for learning process
: **stats** nodes useful for statistics
: **nlp**
: **util** : provides some utility functions

Loaders ： load data (extends transformer)
:

Evaluation: some evaluation criterion
:

Pipelines (the running program)
:

Utils

:	images
	:  **ImageUtils** : some utility functions (we should write some functions in this class)

Pipelines
: **images** : provides some examples for doing image classification pipelines


Some Methods Implementation Tips
--
Fft2: breeze.signal.fourierTr
Ifft2: breeze.signal.IfourierTr
Convolution: breeze.signal.convolve, however it can only support vector convolution. It has matrix signature, but only in a to do list.
ConnectedComponentLabeler: Implement with using DFS
BinaryConverter: check if it is gray, if not, convert to gray then convert to binary.

**Pipeline problem**:
How can I the non-linear pipeline?
For  example, If I have an image call ‘im: [Image]’, and I put it to a pipeline ‘p: Pipeline[Image, Seq[Box]]’, then the output should be boxes:Seq[Box]. The Box means the coordinate group I want to crop from the Image. Then I want to use apply another image crop transformer, which need the im. How can I pass the im to the imageCroper?  In other words, how can I add an edge from the initial node to the ImageCropper node?

One way to accomplish this is to build your pipeline piecewise:

e.g.
val data: RDD[Image]
val prefix: Pipeline[Image, Image] //Call this some preprocessing steps - I'm assuming this is the logic you want to avoid duplicating.
val boxExtractor: Pipeline[Image, Seq[Box]] //This is your `p`.
val imageCropper: Transformer[Image, Image]
val pipe1 = prefix andThen boxExtractor
val pipe2 = prefix andThen imageCropper

val combinedPipeline: Pipeline[Image, Seq[Any]] = Pipeline.gather(Seq(pipe1, pipe2))

The result of running combinedPipeline on an input image will be a Seq[Any] which in this case will be a sequence of size 2 where the elements are a Seq[Box], and a (cropped) Image.


Pipeline Designe
--

### KeystoneML Baogang Pipeline Graph：

![KeystoneML Baogang Pipeline Graph](https://lh3.googleusercontent.com/-M_RzGusBS7c/V4RCiVAFtqI/AAAAAAAAAEE/jAKmKxrGxbIPba2zo2-5GrEMDyeGeNxAQCLcB/s0/Baogang.PNG "Baogang.PNG")

There are two pipelines in our project, in which, one is for training and the other one is for inference.

For the training part,  the Pseudocode is shown in  Code 1.1 below.

**Code 1.1**
```scala
object BaogangTraining extends Logging{
	        def run(sc: SparkContext, config: BaogangConfig): Pipeline[Image, Int] = {
                val numClasses = 2
                val trainData = BaogangLoader(sc, config.trainLocation).cache()

                val trainImages = ImageExtractor(trainData)
                val labelExtractor = LabelExtractor andThen
		                       ClassLabelIndicatorsFromIntLabels(numClasses) andThen
		                       new Cacher[DenseVector[Double]]
                val trainLabels = labelExtractor(trainData)

                val predictor = ImageReScaler(a,a) andThen
                                GrayScaler andThen
                                new Cacher[DenseVector[Double]] andThen
                                ImageVectorizer andThen
                                // (new StandardScaler, trainImages) andThen
                                ConvolutionalNormalization andThen
                                new Cacher[DenseVector[Double]] andThen
                                (ConvolutionalTrainer(conf), trainImages, trainLabels)


                val testData = ImageExtractor(BaogangLoader(sc, config.trainLocation)).cache()
                val processedTestImage = usefulImageExtractor.apply(testData)
                val testPredicted = predictor(testParsedImgs)
        }
}
```
It is easy to find that the wrokflow is:
ImageExtractor -> ImageScalar -> GrayScaler -> ImageVectorizer -> ConvolutioanalNormalization （-> StandardScaler) -> ConvolutionalNormalization -> ConvolutionalTrainer

The signature of the corresponding objects are shown in Code1.3.

For the inference part,  the Pseudocode is shown in  Code 1.2 below. There are two sub-pipelines in the image processing part.

**Code 1.2**
```scala
object BaogangInferrence extends Logging{
        def run(sc: SparkContext, config: BaogangConfig): Pipeline[Image, Int] = {
                val testImages = ImageExtractor(BaogangLoader(sc, config.testLocation)).cache()

                val scrapProcessor = new Container(GrayScaler andThen
                                new ImFilter("replicate", k1) andThen
                                new ImFilter("replicate", k2) andThen
                                new BinaryConvertor(0.1) andThen
                                ConnectedComponentLabeler andThen
                                new UsefulBoxExtractor) andThen
                                new ImageBoxCropper


                val acidProcessor = new Container(GrayScaler andThen
                                ImageReScaler andThen
                                ImageMatrixizer andThen
                                FFT2 andThen
                                ToSaltMaper andThen
                                MatToGrayConvertor andThen
                                new UsefulBoxExtractor) andThen
                                new ImageBoxCropper


                val processedTestImage = usefulImageExtractor.apply(testImages)
				val usefulImageExtractor = Pipeline.gather { scrapProcessor :: acidProcessor :: Nil} andThen Combiner

				val predictor = LoadBaogangPredictor()
                val predict = predictor(processedTestImage)
        }
}
```
The wrokflow is:
p1 = ImageExtractor -> ImFilter("replicate", k1) -> ImFilter("replicate", k2) -> BinaryConvertor(0.1) -> ConnectedComponentLabeler ->BoxExtractor -> ImageBoxCropper
p2 = GrayScaler ->  ImageMatrixizer -> FFT2 -> ToSaltMaper -> MatToGrayConvertor -> BoxExtractor -> ImageBoxCropper

p1 + p2 -> LoadBaogangPredictor-> predict

The signature of the corresponding objects are shown in Code1.3.

**Code 1.3**
``` scala
ImFilter(imName: String) extends Transformer[Image,Image]
BinaryConvertor(threshold: Int) extends Transformer[Image,Image]
ConnectedLabeler extends Transformer[Image, DenseMatrix[Double]]
BoxExtractor extends  Transformer[DenseMatrix[Double], Seq[BoundingBox]] {
        overide apply(labelMatrix: DenseMatrix[Double]) =
                boundingBoxGroups(labelMatrix)

        def boundingBoxGroups(labelMatrix: DenseMatrix[Double]) : Seq[BoundingBox]
}
LengthFilter(length: Int) extends  Transformer[Seq[BoundingBox], Seq[BoundingBox]]
ImageBoxCropper extends  Transformer[(Image, Seq[BoundingBox]), (Seq[Image], Seq[BoundingBox])] {
	def apply(in: (Image, Seq[BoundingBox]): Seq[Image] = {
		in => in._1.map(box => cropImage(box, in._2))
	}

	def cropImage(box: BoundingBox, image: Image) : Image
}

ImageReScaler extends  Transformer[Image,Image]
ImageMatrixizer extends  Transformer[Image,DenseMatrix[Double]]
case class FFT2 extends  Transformer[DenseMatrix[Double], DenseMatrix[Double]]
SaltMaper extends  Transformer[DenseMatrix[Double], DenseMatrix[Double]]
MatToGrayConvertor extends  Transformer[DenseMatrix[Double], Image]
Combiner extends  Transformer[Seq[Seq[Image]], Image]
class ContainerA(p1: Pipeline[A,B]) extends Pipeline{
	override def apply(in: RDD[A]) : RDD[(A,B)]={
		return (in.zip(p1.apply(in)))
	}
}

class ConvolutionalPredictor(Cofigurations ...) extends LabelEstimator[DenseVector[Double], DenseVector[Double], DenseVector[Double]]

class Displayer extends Transformer[(BoundingBox, Image, Int), Unit]


```


## keystoneml.md

      
    Raw
  

              keystoneml.md
            
          
    Transfer Baogang project to KeystoneML and Spark

Contents

[TOC]
Useful Links

KeystoneML Source Code
An KeystoneML Example
Programming Guide
Some Concepts

Piplelines

A Pipeline is a dataflow that takes some input data and maps it to some output data through a series of nodes.
package workflow

trait Pipeline[A, B] {
  // ...
  def apply(in: A): B
  def apply(in: RDD[A]): RDD[B]
  //...
  final def andThen[C](next: Pipeline[B, C]): Pipeline[A, C] = //...
}
Nodes

Nodes come in two flavors: Transformers and Estimators.
Transformers

It takes an input, and deterministically transforms it into an output
package workflow

abstract class Transformer[A, B : ClassTag] extends TransformerNode[B] with Pipeline[A, B] {
  def apply(in: A): B
  def apply(in: RDD[A]): RDD[B] = in.map(apply)
  //...
}
Estimators

That is Estimator takes in training data as an RDD to its fit() method, and outputs a Transformer
package workflow

abstract class Estimator[A, B] extends EstimatorNode {
  protected def fit(data: RDD[A]): Transformer[A, B]
  // ...
}
Chaining Nodes and Building Pipelines

val labels: RDD[Vector[Double]] = //...
val trainImages: RDD[Image] = //...

val pipe = GrayScaler andThen 
  ImageVectorizer andThen 
  (LinearMapEstimator(), trainImages, trainLabels) andThen 
  MaxClassifier
Some useful packages for our project:

Nodes
: images nodes useful for image processing
: learning extends Estimator, as estimators for learning process
: stats nodes useful for statistics
: nlp
: util : provides some utility functions
Loaders ： load data (extends transformer)
:
Evaluation: some evaluation criterion
:
Pipelines (the running program)
:
Utils
:	images
:  ImageUtils : some utility functions (we should write some functions in this class)
Pipelines
: images : provides some examples for doing image classification pipelines
Some Methods we can use

Fft2: breeze.signal.fourierTr
Ifft2: breeze.signal.IfourierTr
...
What we need to write:

Image Processing:

A DataLoarder transformer using ImageLoaderUtils class which includes an ImageExtractor  and an LabelExtractor
angle： Angle phase of a complex matrix
im2bw：Convert image to binary image, based on threshold  (should be easy to implement)
bwlabel：Label connected components in 2-D binary image
RegiponGroups(imageLabel, ‘'BoundingBox'’) ：Returns the smallest rectangle containing the region
Ismember(A,B):  Array elements that are members of set array
Mat2gray : Convert matrix to grayscale image
strel
imerode
repmat :
Create Learning Estimator

In utils, create a WebScaleMLUtl, which shuld include breezeVectorToTensor, tensorToDenseBreeze
/* To Do */
Pipeline Designe

There are two pipelines in our project, in which, one is for training and the other one is for inference.
For the training part,  the Pseudocode is shown in  Code 1.1 below.
Code 1.1
object BaogangTraining extends Logging{ 
        def run(sc: SparkContext, config: BaogangConfig): Pipeline[Image, Int] = { 
                val numClasses = 2 
                val trainData = BaogangLoader(sc, config.trainLocation).cache() 
 
                val trainImages = ImageExtractor(trainData) 
                val labelExtractor = LabelExtractor andThen 
		                       ClassLabelIndicatorsFromIntLabels(numClasses) andThen 
		                       new Cacher[DenseVector[Double]] 
                val trainLabels = labelExtractor(trainData) 
 
                val predictor = ImageReScaler(a,a) andThen
                                GrayScaler andThen 
                                new Cacher[DenseVector[Double]] andThen 
                                ImageVectorizer andThen
                                // (new StandardScaler, trainImages) andThen 
                                ConvolutionalNormalization andThen
                                new Cacher[DenseVector[Double]] andThen 
                                (ConvolutionalTrainer(conf), trainImages, trainLabels) 
                 
               
                val testData = ImageExtractor(BaogangLoader(sc, config.trainLocation)).cache()           
                val processedTestImage = usefulImageExtractor.apply(testData)    
                val testPredicted = predictor(testParsedImgs)                                   
        } 
} 
It is easy to find that the wrokflow is:
ImageExtractor -> ImageScalar -> GrayScaler -> ImageVectorizer -> ConvolutioanalNormalization （-> StandardScaler) -> ConvolutionalNormalization -> ConvolutionalTrainer
The signature of the corresponding objects are shown in Code1.3.
For the inference part,  the Pseudocode is shown in  Code 1.2 below. There are two sub-pipelines in the image processing part.
Code 1.2
object BaogangInferrence extends Logging{ 
        def run(sc: SparkContext, config: BaogangConfig): Pipeline[Image, Int] = { 
                val usefulImageExtractor = Pipeline.gather {Seq(scrapProcessor, acidProcessor)} andThen Combiner 
                val testImages = ImageExtractor(BaogangLoader(sc, config.trainLocation)).cache()         
 
                val scrapProcessor = new Container(GrayScaler andThen 
                                new ImFilter("replicate", k1) andThen 
                                new ImFilter("replicate", k2) andThen 
                                new BinaryConvertor(0.1) andThen 
                                BwLabel andThen
                                new UsefulBoxExtractor) andThen  
                                new ImageBoxCropper
				
 
                val acidProcessor = new Container(GrayScaler andThen
                                ImageReScaler andThen
                                ImageMatrixizer andThen
                                FFT2 andThen
                                ToSaltMaper andThen
                                MatToGrayConvertor andThen
                                new UsefulBoxExtractor) andThen
                                new ImageBoxCropper


                val processedTestImage = usefulImageExtractor.apply(testImages)
				val usefulImageExtractor = Pipeline.gather { scrapProcessor :: acidProcessor :: Nil} andThen Combiner 

				val predictor = LoadBaogangPredictor()
                val predict = predictor(processedTestImage)
        }
}
The wrokflow is:
p1 = ImageExtractor -> ImFilter("replicate", k1) -> ImFilter("replicate", k2) -> BinaryConvertor(0.1) -> BwLabel ->UsefulBoxExtractor -> ImageBoxCropper
p2 = GrayScaler ->  ImageMatrixizer -> FFT2 -> ToSaltMaper -> MatToGrayConvertor -> UsefulBoxExtractor -> ImageBoxCropper
p1 + p2 -> LoadBaogangPredictor-> predict
The signature of the corresponding objects are shown in Code1.3.
Code 1.3
ImFilter(imName: String) extends Transformer[Image,Image]
BinaryConvertor(threshold: Int) extends Transformer[Image,Image]
BwLabel extends Transformer[Image, DenseMatrix[Double]]
UsefulBoxExtractor extends  Transformer[DenseMatrix[Double], Seq[BoundingBox]] {
        overide apply(labelMatrix: DenseMatrix[Double]) =
                boundingBoxGroups(labelMatrix)

        def boundingBoxGroups(labelMatrix: DenseMatrix[Double]) : Seq[BoundingBox]
}
FilterSize(length: Int) extends  Transformer[Seq[BoundingBox], Seq[BoundingBox]]
ImageBoxCropper extends  Transformer[(Image, Seq[BoundingBox]), (Seq[Image], Seq[BoundingBox])] {
	override apply(in: RDD[(Image, Seq[BoundingBox])]: Seq[Image] = {
		in.map(pair => pair._1.map(box => cropImage(box, pair._2)) 
	}

	def cropImage(box: BoundingBox, image: Image) 
}

ImageReScaler extends  Transformer[Image,Image]
ImageMatrixizer extends  Transformer[Image,DenseMatrix[Double]]
case class FFT2 extends  Transformer[DenseMatrix[Double], DenseMatrix[Double]]
SaltMaper extends  Transformer[DenseMatrix[Double], DenseMatrix[Double]]
MatToGrayConvertor extends  Transformer[DenseMatrix[Double], Image]
Combiner extends  Transformer[Seq[Seq[Image]], Image]
class ContainerA(p1: Pipeline[A,B]) extends Pipeline{	
	override def apply(in: RDD[A]) : B={
		return (in.zip(p1.apply(in)))
	}
}

class ContainerB(p1: Pipeline[A,B]) extends Pipeline{	
	override def apply(in: RDD[A]) : B={
		return (p1.apply(in), attachment)
	}
}

class ConvolutionalTrainer(Cofigurations ...) extends LabelEstimator[DenseVector[Double], DenseVector[Double], DenseVector[Double]]
	Transfer Baogang project to KeystoneML and Spark
	====
	Contents
	--
	[TOC]

	Useful Links
	--
	[KeystoneML Source Code](https://github.com/amplab/keystone)
	[An KeystoneML Example](https://github.com/amplab/keystone-example)
	[Programming Guide](http://keystone-ml.org/programming_guide.html)

	Some Concepts
	--
	### Piplelines
	A Pipeline is a dataflow that takes some input data and maps it to some output data through a series of `nodes`.
	```scala
	package workflow

	trait Pipeline[A, B] {
	// ...
	def apply(in: A): B
	def apply(in: RDD[A]): RDD[B]
	//...
	final def andThen[C](next: Pipeline[B, C]): Pipeline[A, C] = //...
	}
	```
	### Nodes
	Nodes come in two flavors: `Transformers` and `Estimators`.
	#### Transformers
	It takes an input, and deterministically transforms it into an output
	``` scala
	package workflow

	abstract class Transformer[A, B : ClassTag] extends TransformerNode[B] with Pipeline[A, B] {
	def apply(in: A): B
	def apply(in: RDD[A]): RDD[B] = in.map(apply)
	//...
	}
	```
	#### Estimators
	That is `Estimator` takes in training data as an `RDD` to its `fit()` method, and outputs a Transformer
	```scala
	package workflow

	abstract class Estimator[A, B] extends EstimatorNode {
	protected def fit(data: RDD[A]): Transformer[A, B]
	// ...
	}
	```
	#### Chaining Nodes and Building Pipelines
	```scala
	val labels: RDD[Vector[Double]] = //...
	val trainImages: RDD[Image] = //...

	val pipe = GrayScaler andThen
	ImageVectorizer andThen
	(LinearMapEstimator(), trainImages, trainLabels) andThen
	MaxClassifier
	```
	Some useful packages for our project:
	---
	Nodes
	: images nodes useful for image processing
	: learning extends Estimator, as estimators for learning process
	: stats nodes useful for statistics
	: nlp
	: util : provides some utility functions

	Loaders ： load data (extends transformer)
	:

	Evaluation: some evaluation criterion
	:

	Pipelines (the running program)
	:

	Utils

	: images
	: ImageUtils : some utility functions (we should write some functions in this class)

	Pipelines
	: images : provides some examples for doing image classification pipelines


	Some Methods Implementation Tips
	--
	Fft2: breeze.signal.fourierTr
	Ifft2: breeze.signal.IfourierTr
	Convolution: breeze.signal.convolve, however it can only support vector convolution. It has matrix signature, but only in a to do list.
	ConnectedComponentLabeler: Implement with using DFS
	BinaryConverter: check if it is gray, if not, convert to gray then convert to binary.

	Pipeline problem:
	How can I the non-linear pipeline?
	For example, If I have an image call ‘im: [Image]’, and I put it to a pipeline ‘p: Pipeline[Image, Seq[Box]]’, then the output should be boxes:Seq[Box]. The Box means the coordinate group I want to crop from the Image. Then I want to use apply another image crop transformer, which need the im. How can I pass the im to the imageCroper? In other words, how can I add an edge from the initial node to the ImageCropper node?

	One way to accomplish this is to build your pipeline piecewise:

	e.g.
	val data: RDD[Image]
	val prefix: Pipeline[Image, Image] //Call this some preprocessing steps - I'm assuming this is the logic you want to avoid duplicating.
	val boxExtractor: Pipeline[Image, Seq[Box]] //This is your `p`.
	val imageCropper: Transformer[Image, Image]
	val pipe1 = prefix andThen boxExtractor
	val pipe2 = prefix andThen imageCropper

	val combinedPipeline: Pipeline[Image, Seq[Any]] = Pipeline.gather(Seq(pipe1, pipe2))

	The result of running combinedPipeline on an input image will be a Seq[Any] which in this case will be a sequence of size 2 where the elements are a Seq[Box], and a (cropped) Image.


	Pipeline Designe
	--

	### KeystoneML Baogang Pipeline Graph：

	![KeystoneML Baogang Pipeline Graph](https://lh3.googleusercontent.com/-M_RzGusBS7c/V4RCiVAFtqI/AAAAAAAAAEE/jAKmKxrGxbIPba2zo2-5GrEMDyeGeNxAQCLcB/s0/Baogang.PNG "Baogang.PNG")

	There are two pipelines in our project, in which, one is for training and the other one is for inference.

	For the training part, the Pseudocode is shown in Code 1.1 below.

	Code 1.1
	```scala
	object BaogangTraining extends Logging{
	def run(sc: SparkContext, config: BaogangConfig): Pipeline[Image, Int] = {
	val numClasses = 2
	val trainData = BaogangLoader(sc, config.trainLocation).cache()

	val trainImages = ImageExtractor(trainData)
	val labelExtractor = LabelExtractor andThen
	ClassLabelIndicatorsFromIntLabels(numClasses) andThen
	new Cacher[DenseVector[Double]]
	val trainLabels = labelExtractor(trainData)

	val predictor = ImageReScaler(a,a) andThen
	GrayScaler andThen
	new Cacher[DenseVector[Double]] andThen
	ImageVectorizer andThen
	// (new StandardScaler, trainImages) andThen
	ConvolutionalNormalization andThen
	new Cacher[DenseVector[Double]] andThen
	(ConvolutionalTrainer(conf), trainImages, trainLabels)


	val testData = ImageExtractor(BaogangLoader(sc, config.trainLocation)).cache()
	val processedTestImage = usefulImageExtractor.apply(testData)
	val testPredicted = predictor(testParsedImgs)
	}
	}
	```
	It is easy to find that the wrokflow is:
	ImageExtractor -> ImageScalar -> GrayScaler -> ImageVectorizer -> ConvolutioanalNormalization （-> StandardScaler) -> ConvolutionalNormalization -> ConvolutionalTrainer

	The signature of the corresponding objects are shown in Code1.3.

	For the inference part, the Pseudocode is shown in Code 1.2 below. There are two sub-pipelines in the image processing part.

	Code 1.2
	```scala
	object BaogangInferrence extends Logging{
	def run(sc: SparkContext, config: BaogangConfig): Pipeline[Image, Int] = {
	val testImages = ImageExtractor(BaogangLoader(sc, config.testLocation)).cache()

	val scrapProcessor = new Container(GrayScaler andThen
	new ImFilter("replicate", k1) andThen
	new ImFilter("replicate", k2) andThen
	new BinaryConvertor(0.1) andThen
	ConnectedComponentLabeler andThen
	new UsefulBoxExtractor) andThen
	new ImageBoxCropper


	val acidProcessor = new Container(GrayScaler andThen
	ImageReScaler andThen
	ImageMatrixizer andThen
	FFT2 andThen
	ToSaltMaper andThen
	MatToGrayConvertor andThen
	new UsefulBoxExtractor) andThen
	new ImageBoxCropper


	val processedTestImage = usefulImageExtractor.apply(testImages)
	val usefulImageExtractor = Pipeline.gather { scrapProcessor :: acidProcessor :: Nil} andThen Combiner

	val predictor = LoadBaogangPredictor()
	val predict = predictor(processedTestImage)
	}
	}
	```
	The wrokflow is:
	p1 = ImageExtractor -> ImFilter("replicate", k1) -> ImFilter("replicate", k2) -> BinaryConvertor(0.1) -> ConnectedComponentLabeler ->BoxExtractor -> ImageBoxCropper
	p2 = GrayScaler -> ImageMatrixizer -> FFT2 -> ToSaltMaper -> MatToGrayConvertor -> BoxExtractor -> ImageBoxCropper

	p1 + p2 -> LoadBaogangPredictor-> predict

	The signature of the corresponding objects are shown in Code1.3.

	Code 1.3
	``` scala
	ImFilter(imName: String) extends Transformer[Image,Image]
	BinaryConvertor(threshold: Int) extends Transformer[Image,Image]
	ConnectedLabeler extends Transformer[Image, DenseMatrix[Double]]
	BoxExtractor extends Transformer[DenseMatrix[Double], Seq[BoundingBox]] {
	overide apply(labelMatrix: DenseMatrix[Double]) =
	boundingBoxGroups(labelMatrix)

	def boundingBoxGroups(labelMatrix: DenseMatrix[Double]) : Seq[BoundingBox]
	}
	LengthFilter(length: Int) extends Transformer[Seq[BoundingBox], Seq[BoundingBox]]
	ImageBoxCropper extends Transformer[(Image, Seq[BoundingBox]), (Seq[Image], Seq[BoundingBox])] {
	def apply(in: (Image, Seq[BoundingBox]): Seq[Image] = {
	in => in._1.map(box => cropImage(box, in._2))
	}

	def cropImage(box: BoundingBox, image: Image) : Image
	}

	ImageReScaler extends Transformer[Image,Image]
	ImageMatrixizer extends Transformer[Image,DenseMatrix[Double]]
	case class FFT2 extends Transformer[DenseMatrix[Double], DenseMatrix[Double]]
	SaltMaper extends Transformer[DenseMatrix[Double], DenseMatrix[Double]]
	MatToGrayConvertor extends Transformer[DenseMatrix[Double], Image]
	Combiner extends Transformer[Seq[Seq[Image]], Image]
	class ContainerA(p1: Pipeline[A,B]) extends Pipeline{
	override def apply(in: RDD[A]) : RDD[(A,B)]={
	return (in.zip(p1.apply(in)))
	}
	}

	class ConvolutionalPredictor(Cofigurations ...) extends LabelEstimator[DenseVector[Double], DenseVector[Double], DenseVector[Double]]

	class Displayer extends Transformer[(BoundingBox, Image, Int), Unit]



	```