shagunsodhani/Learning to execute.md

## Learning to execute.md

      
    Raw
  

              Learning to execute.md
            
          
    Learning To Execute

Problem Statement


Evaluating if LSTMs can express and learn short, simple programs (linear time, constant memory) in the sequence-to-sequence framework.
Link to paper

Approach


Formulate program evaluation task as a sequence-to-sequence learning problem using RNNs.
Train on short programs that can be evaluated in linear time and constant memory - RNN can perform only a single pass over the data and its memory is limited.
Two parameters to control the difficulty of the program:

length : Number of digits in the integer that appears in the program.
nesting : Number of times operations can be combined with each other.


LSTM reads the input program, one character at a time and produces output, one character at a time.

Additional Learning Tasks


Addition Task - Given two numbers, the model learns to add them. This task becomes the baseline for comparing performance on other tasks.
Memorization Task - Give a random number, the model memorizes it and outputs it. Following techniques enhance the accuracy of the model:

Input reversing - Reversing the order of input, while keeping the output fixed introduces many short-term dependencies that help LSTM in learning the process.
Input doubling - Presenting the same input to the network twice enhances the performance as the model gets to look at the input twice.


Curriculum learning

Gradually increase the difficulty of the program fed to the system.

No Curriculum (baseline) - Fixed length and fixed nesting programs are fed to the system.
Naive Curriculum - Start with length = 1 and nesting = 1 and keep increasing the values iteratively.
Mix Strategy - Randomly choose length and nesting to generate a mix of easy and difficult examples.
Combined Strategy - Each training example is obtained either by Naive curriculum strategy or mix strategy.

Network Architecture


2 layers, unrolled for 50 steps.
400 cells per layer.
Parameters initialized uniformly in [-0.08, 0.08]
minibatch size 100
norm of gradient normalized to be less than 5
start with learning rate = 0.5, further decreased by 0.8 after reaching target accuracy of 95%

Observations

Teacher forcing technique used for computing accuracy ie when predicting the i^th digit, the correct first i-1 digits of the output are provided as input to the LSTM.
The general trend is (combine, mix) > (naive, baseline).
In certain cases for program evaluation, baseline performs better than naive curriculum strategy. Intuitively, the model would use all its memory to store patterns for a given size input. Now when a higher size input is provided, the model would have to restructure its memory patterns to learn the output for this new class of inputs. The process of memory restructuring may be causing the degraded performance of the naive strategy. The combined strategy combines the naive and mix strategy and hence reduces the need to restructure the memory patterns.
While LSTMs can learn to map the character level representation of simple programs to their correct output, the idea can not extend to arbitrary programs due to the runtime limitations of conventional RNNs and LSTM. Moreover, while learning is essential, the optimal curriculum strategy needs to be understood further.