toidiu/Thumbs Migration

## Thumbs Migration
## mongo queries for updating the new thumbs location with the old thumbs data


```
# Timeline of data migration and dual-writing thumbs data

                       event 1            event 2
  __________________________________________________________________
  |         region 1      |    region 2      |    region 3         |
  |                       |                  |                     |


  [------------old data----------------------]
                          [------------new data--------------------]


```

`region 1` was a time were we were only writing to the OLD-DB. `region 2` was a
time were we were dual-writing to the old and NEW-DB. `region 3` is when we
stopped writing to the OLD-DB and were writing to the NEW-DB.


During `event 2` we migrated the old thumbs data to the NEW-DB but not
everything was moved due to the thumbs data being large. Therefore the task is
to migrate the old data in its entirity to the NEW-DB. During the migration we
used the DataWarehouse(DWH) for the migration and it seems like the DWH is
also missing data (from a certain point in history)

We can assume that since `event 2` all data was correctly witten to the NEW-DB.
Therefore all we are conscerned with is the data in the OLD-DB that is not in
the NEW-DB which is `region 1`.


OLD {
  _id : sid
  uPIdH
  tuTrks
  tdTrks
}


NEW {
  sid
  cid
  type
  uPIdH
}


def main() {

  val OLD_DB = ...
  val NEW_DB = ...
  val TIME_BEFORE_DUAL_WRITE = ...

  val batch_size = 100
  val itemOffset = 0

  while(true) {

    // do a paged query so as not to pull in all the data
    val oldData = select data in OLD_DB
                    limit = batch_size
                    offset = (itemOffset * batch_size)

    // we are done if there is no more data to process.
    if (oldData.isEmpty) {
      break;
    }

    // filter the data so we only have items which
    // have thumbs data
    val oldItemList = oldData.filter{ data => data.thumbsData.exists }

    // iterate over `oldItemList` and see which ones dont exist in the new
    // data. These are the ones that we want to insert into the new data.
    for oldItem in oldItemList {

      // query new data for items that match the stationID and uPIdH.
      // this will contain all all new data we are concerned with
      val newDataBySid = select data in NEW_DB
                      where sid = oldItem._id  // _id == sid in old data
                      and uPIdH = oldItem.uPIdH

      // create a set of cids that exists in the new data
      // for the particular stationID
      val newItemCidSet = newDataBySid.map( _.cid ).toSet

      // finally filter the oldItemList that dont exists in the new
      // data based on cid. These were never migrated over and
      // and we will migrate them now.
      val insertItems = (oldItem.tdTrks ++ oldItem.tuTrks)
                            .filterNot{ cid =>
                              newItemCidSet.contains(cid)
                            }.map{ thumb =>
                              thumb.copy( lm = TIME_BEFORE_DUAL_WRITE )
                            }

      NEW_DB.insert(insertItems)
    }

    itemOffset = itemOffset + 1
  }
}
	## mongo queries for updating the new thumbs location with the old thumbs data



	```
	# Timeline of data migration and dual-writing thumbs data

	event 1 event 2
	__________________________________________________________________
	\| region 1 \| region 2 \| region 3 \|
	\| \| \| \|


	[------------old data----------------------]
	[------------new data--------------------]


	```

	`region 1` was a time were we were only writing to the OLD-DB. `region 2` was a
	time were we were dual-writing to the old and NEW-DB. `region 3` is when we
	stopped writing to the OLD-DB and were writing to the NEW-DB.


	During `event 2` we migrated the old thumbs data to the NEW-DB but not
	everything was moved due to the thumbs data being large. Therefore the task is
	to migrate the old data in its entirity to the NEW-DB. During the migration we
	used the DataWarehouse(DWH) for the migration and it seems like the DWH is
	also missing data (from a certain point in history)

	We can assume that since `event 2` all data was correctly witten to the NEW-DB.
	Therefore all we are conscerned with is the data in the OLD-DB that is not in
	the NEW-DB which is `region 1`.









	OLD {
	_id : sid
	uPIdH
	tuTrks
	tdTrks
	}


	NEW {
	sid
	cid
	type
	uPIdH
	}


	def main() {

	val OLD_DB = ...
	val NEW_DB = ...
	val TIME_BEFORE_DUAL_WRITE = ...

	val batch_size = 100
	val itemOffset = 0

	while(true) {

	// do a paged query so as not to pull in all the data
	val oldData = select data in OLD_DB
	limit = batch_size
	offset = (itemOffset * batch_size)

	// we are done if there is no more data to process.
	if (oldData.isEmpty) {
	break;
	}

	// filter the data so we only have items which
	// have thumbs data
	val oldItemList = oldData.filter{ data => data.thumbsData.exists }

	// iterate over `oldItemList` and see which ones dont exist in the new
	// data. These are the ones that we want to insert into the new data.
	for oldItem in oldItemList {

	// query new data for items that match the stationID and uPIdH.
	// this will contain all all new data we are concerned with
	val newDataBySid = select data in NEW_DB
	where sid = oldItem._id // _id == sid in old data
	and uPIdH = oldItem.uPIdH

	// create a set of cids that exists in the new data
	// for the particular stationID
	val newItemCidSet = newDataBySid.map( _.cid ).toSet

	// finally filter the oldItemList that dont exists in the new
	// data based on cid. These were never migrated over and
	// and we will migrate them now.
	val insertItems = (oldItem.tdTrks ++ oldItem.tuTrks)
	.filterNot{ cid =>
	newItemCidSet.contains(cid)
	}.map{ thumb =>
	thumb.copy( lm = TIME_BEFORE_DUAL_WRITE )
	}

	NEW_DB.insert(insertItems)
	}

	itemOffset = itemOffset + 1
	}
	}