Evgeniy Polyakov bioothod

## Mountain Car solution
I tested various approaches and found that properly tuned DQN plus cross-entropy pool solves
this problem in the fastest way.

By DQN+CE I mean common DQN technique, but batches sampled each time for experience reply
are selected proportionally to how good their appropriate episode was compared to
the worst one with -200 total reward.

In common cross-entropy we basically select the best episodes and learn network
to correctly predict action based on those steps. This drops experience for the
wrong/non-existing steps and actions, which might be good to learn too.

## vagrant_elliptics
#!/usr/bin/env bash

set -x

apt-get update
apt-get install -y git-core devscripts gcc g++ equivs gdb

BASE_DIR=`pwd`

ulimit -c unlimited

## gist:9940510
{
    "id": "8a5f4640935...",
    "csum": "a15cf7eee7ba4fd90f...",
    "filename": "/tmp/blob3/data-0.0",
    "size": 9,
    "offset-within-data-file": 144,
    "mtime": {
        "time": "2013-12-05 MSK 19:40:35.731166",
        "time-raw": "1386258035.731166"
    },
	I tested various approaches and found that properly tuned DQN plus cross-entropy pool solves
	this problem in the fastest way.

	By DQN+CE I mean common DQN technique, but batches sampled each time for experience reply
	are selected proportionally to how good their appropriate episode was compared to
	the worst one with -200 total reward.

	In common cross-entropy we basically select the best episodes and learn network
	to correctly predict action based on those steps. This drops experience for the
	wrong/non-existing steps and actions, which might be good to learn too.
	#!/usr/bin/env bash

	set -x

	apt-get update
	apt-get install -y git-core devscripts gcc g++ equivs gdb

	BASE_DIR=`pwd`

	ulimit -c unlimited
	{
	"id": "8a5f4640935...",
	"csum": "a15cf7eee7ba4fd90f...",
	"filename": "/tmp/blob3/data-0.0",
	"size": 9,
	"offset-within-data-file": 144,
	"mtime": {
	"time": "2013-12-05 MSK 19:40:35.731166",
	"time-raw": "1386258035.731166"
	},