aslamplr/The Deep Learning Project.md

## The Deep Learning Project.md

      
    Raw
  

              The Deep Learning Project.md
            
          
Disclaimer: This is an unfinished document which is not been published yet! Not even proof read sometimes! If you are reading this you are one of the author or a reviewer of this document. In that case, please review and submit your comments and feedbacks or corrections and additions in the channel where you received this document. Also, please let the author know if in case you are to be mentioned as an author of this document. This document might contain sensitive information. Your actions under your own descretion

==In Draft== ==In Progress==
Deep learning project

[TOC]
Introduction

Deep learning project Un-named (TODO:need to be renamed to the one new name that the team agrees for this project.). This project is a means to understand and learn deep learning in general and related tools and technologies, methodologies, interface between existing systems and technology stacks. etc. The goals are to be outlined in the following section on the goals of this project. And finally the first epic feature to be delivered by this project is "Image captioning". In simple terms this should be a web application where user could post/upload an image and the application should provide a text caption. Since the project has multi-level goals to be achieved one becomes means for the other one in someway.
About this document

This is the initial documentation before getting into the details.
Background and inception

The idea of this project came as part of an informal discussion that happened between @Rahees, @Sreenath and @Aslam on Friday, Jul 5th 2019.

@Sreenath was the one to point out that, never ending discussions on random AI and Deep learning/ML topics are not helping us. Instead we should be doing something solid and hands on, some sort of a real world project. That would provide some hands on experience and will allow us to understand the possibilities and limitations of deep learning. The group felt this to be something that need to be perceived and decided on doing that. @Aslam came up with the project of doing image captioning. And by doing this project the group is aiming to understand several concepts like Feature extraction, Encoder Decoder models, Attention, `Deploying a deep learning model for . And the group agreed to spend 5 hours/week for the fulfilment of this project. And @Aslam has agreed to come-up with an outline for the project and breakdowns and plans.

Scope


What are in the scope of this document?


This document briefs and covers mostly everything that could possibly be coceived at this point in time.
This document will not and should not elaborate or detail anything other than the overall scope of the project.
Name the project (This should be done after discussing with the team).
Provide a brief pointer into the background and inception of this project.
Identifyng the goals of the project.
Define a baseline accuracy for the model we are going to develop/create.
Define the target accuracy for the model we are going to develop/create.
Briefing the requirements of the project.
Separating out the whole project into different tangible well scoped phases.
Technology stack.
Defining the minimum viable product.


What are not in the scope of this document?


Complete technical design documentation.
UI/UX design.
Deployment plan.
Elaborating the requirements of the project.
Project management and execution plan.

The items not in scope of this document need to be addressed separately as deemed necessary. Once those documents are in place(if any). Could be referenced back to this document.
Goals


Learn the possibilities of deep learning by designing, writing, and deploying an application that would caption images. ==Must have==
Learn and understand different evaluation matrices such as BLEU, METEOR, ROUGE, CIDEr. These matrics are more than just accuracy scores, a high level understanding of these matrices should be enough for the purpose of evaluation. ==Must have==
The above goal should be achieved incrementally in tangible well scoped phases. ==Must have==
Create and train a deep learning model, that will do image captioning. Integrate the model into a web application and deploy it for inference. ==Must have==
Transfer learning to create a new model that will do image caption on a specific domain. For example babies. ==Nice to have==

Prepare dataset for the purpose of transfer learning for baby image captioning.


Once the model is ready and web interface is working, package the APIs and some Additional UI/UX as a multi tenant platform as a service product. The platform user's should be able to access the UI/UX and should be able to upload their images and captions for training and should be able to use the API's in their mobile or web applications. (This again is a distant goal, may not be considered in any of the planning or part of the minimum viable product). ==Only on dreams for now==
To be able to re purpose the trained model, to specific image captioning domain. For example for captioning images from a football game and caption it based on the action just like from a newspaper. By then the model should be able to identify players etc. (This is a dream goal, may not be considered in any of the planning or part of the minimum viable product). ==Only on dreams for now==

Requirements in brief

Model could be considered a function takes some input and returns some output. In our case, the model takes in an image file and returns a string which is our target caption. And application is the part where the model is being embedded and being used. Application should have a nice UI and UX. The applications requirement focus should be more on the user interfacing and performance side. Meanwhile the training of the model could be more focused on the accuracy of the model. Eventhough application embeds the requirement for the model, they are to be developed separately and have different requirement focus points. This may even be because the technology and the team develops both these may be different. For this reason the requirement section is split to two, Deep learning model and Web application. There can possibly be some glue requirements which can be added part of either Model or Application requirements.
These requirements could not be described without technical details leaking in, since most of these requirements are utlimately closely related to technology. Otherwise, we will endup without any details and lot of possibilities for interpretation. And could end up in deviations.
Deep learning model


Create a deep learning model that takes in an image(digital image file that could be .jpg, .png, etc. files) and spit out meaningful captions relevant to the image that has been provided as input. A person(human) should be able to read the text output from the model and agree that the text could be a possible caption to the image provided. (The short term goal is to create a model that atleast provide some caption might not be relevant enough, but atleast we know we are somewhere close to building our model right). ==Phase1==

The model takes an image file as input.
In reality the model takes in a vectorised form of the image.
The preprocessing and image vectorisation logic should be developed part of the deep learning model requirement. And this part of the code can be considered to be part of the glue between application and model. ==Phase 1==

In some cases, where the application in which the model is going to get embedded is written in a different one other than python(which the model is going to get developed). The image file preprocessing and vectorization logic need to be re-implemented in the language of the application(this may be C# or Javascript, in case of python, we may be able to re-use the specific logic). Since the preprocessing and vectorization logic are not heavy ones, this part of the application code is duplicated across model development and application development.


Similar to that of preprocessing and vectorisation logic, there need a logic to convert the output of the model (vectorised text) to string. ==Phase 1==

This is again duplicated across model development and application development.


The first phase of the model will have a pre-trained image feature extraction part, for this purpose we could utilise existing image classification models trained on imagenet etc. The models to be considered could be VGG, Inception, RestNet, DenseNet etc. ==Phase 1==
Port the pre-trained image feature extraction part into javascript using tensorflow.js and make it part of the browser side workflow. The image file will be processed in the browser itself, and the server side will receive the vectorised input for further processing. This requirement is to further evaluate the possibilities in distributing and scaling the workflow and load. ==Phase 3==
Further we could explore possibilities of using image semantic segmentation models to provide input for the caption model (This is not yet considered or such research is not yet publicly available). This one should be treated as a something dream requirement! ==Phase 4==


While creating and training the above model, we could utilize datasets like flikr8k and MS-COCO. ==Phase1==
The model should achieve a target accuracy of above 80%. This statement is not based on any accuracy matrics or scores as of now. But a value that has been chosen to show a practical model that can be considered useable to some extend, 80% accuracy means such a model.

For ==Phase1==, we could develop a model that might not be accurate above 80%. And this first model will be considered the baseline. Eventhough the accuracy is low for this model, the interface sould be working and final.
Later we need to work on BLEU,METEOR, ROUGE, CIDEr scores to improve our model.==Phase 2==


Web application


Web application is planned only in the ==Phase 2== of the development, so as not to loose focus on the ==Phase 1== development of the model.


Create a web application interface that would have a single page with an UI/UX where anyone who is able to access the page should be able to upload an imge file(.jpg, .png, etc. TODO: Need to identify possible input validations necessary). And post it to the model being served for inference, and the text retruned from the model (caption) should be displayed back in the web page for the user to read. ==Phase 2==


Need to elaborate and add further points and requirements.

Technology stack

Deep learning model


Python
Tensorflow/Keras
Numpy/Pandas
OpenCV
Jupyter notebook
Google colab

Web application


Vue/React
Typescript
.NET(Web API/WPF)/Node(Express)/Python(Flask)
Tensorflow bindings (in chosen language).
Tensorflow.js

Further documentation


Once these document drafts are out, these tasks should be marked for completion and the links to the document will be provided.

Further documentations to be created -

 Technical design document.
 UI/UX design document.
 Infrastructure design and architecture diagram.
 Project management and execution related documents.

Action Items


Action items, in the scope of this document.


 Re-structure the document!

References


Exploring Image Captioning Datasets
MS-COCO Dataset
How to Develop a Deep Learning Photo Caption Generator from Scratch
How to Use Small Experiments to Develop a Caption Generation Model in Keras
Image Captioning with Attention - Tensorflow tutorials
COCO 2015 Image Captioning Task
Microsoft COCO Captions: Data Collection and Evaluation Server [arxiv paper pdf]
Show and Tell: A Neural Image Caption Generator [arxiv paper pdf]
Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge [arxiv paper pdf]
Can Active Memory Replace Attention? [arxiv paper pdf]

Leaderboard: Microsoft COCO Image Captioning Challenge -
![image-20190708122939836](/Users/aslam/Library/Application Support/typora-user-images/image-20190708122939836.png)

The original markdown source could be accessed from here
This is a markdown document is created using Typora and the Github theme.