Rakshith Vasudev rakshithvasudev

## new_approach.py
"""
This script demonstrates an optimized pipeline.
This is not full code, this is merely a snippet.

1. Gets the absolute list of filenames.
2. Builds a dataset from the list of filenames using from_tensor_slices()
3. Sharding is done ahead of time.
4. The dataset is shuffled during training.
5. The dataset is then parallelly interleaved, which is basically interleaving and processing multiple files (defined by cycle_length) to transform them to create TFRecord dataset.
6. The dataset is then prefetched. The buffer_size defines how many records are prefetched, which is usually the mini batch_size of the job.

## old_approach.py
"""
This snippet demonstrates a non optimized tf data pipeline.
This is not full code, this is merely a snippet.

1. Gets the absolute list of filenames.
2. Builds a dataset from the list of filenames using TFRecordDataset()
3. Create a new dataset that loads and formats images by preprocessing them.
4. Shard the dataset.
5. Shuffle the dataset when training.
6. Repeat the dataset.

## Map Reduce Design Patterns
Hadoop Commands
# test code
cat testfile | ./mapper.py | sort | ./reducer.py

# run a job
hs mapper.py reducer.py input_folder output_folder

# view the results
hadoop fs -cat output_folder/part-00000 | less

## onehot-dataset
╔════════════╦═════════════════╦════════╗
║ CompanyName Categoricalvalue ║ Price  ║
╠════════════╬═════════════════╣════════║
║ VW         ╬      1          ║ 20000  ║
║ Acura      ╬      2          ║ 10011  ║
║ Honda      ╬      3          ║ 50000  ║
║ Honda      ╬      3          ║ 10000  ║
╚════════════╩═════════════════╩════════╝

## RecursiveBinarySearch.py
def search(numbers, target, first, last):
    mid = (first + last) // 2
    if first > last:
        return -1
    elif target == numbers[mid]:
        return mid
    elif target < numbers[mid]:
        return search(numbers, target, first, mid - 1)
    else:
        return search(numbers, target, mid + 1, last)
	"""
	This script demonstrates an optimized pipeline.
	This is not full code, this is merely a snippet.

	1. Gets the absolute list of filenames.
	2. Builds a dataset from the list of filenames using from_tensor_slices()
	3. Sharding is done ahead of time.
	4. The dataset is shuffled during training.
	5. The dataset is then parallelly interleaved, which is basically interleaving and processing multiple files (defined by cycle_length) to transform them to create TFRecord dataset.
	6. The dataset is then prefetched. The buffer_size defines how many records are prefetched, which is usually the mini batch_size of the job.
	"""
	This snippet demonstrates a non optimized tf data pipeline.
	This is not full code, this is merely a snippet.

	1. Gets the absolute list of filenames.
	2. Builds a dataset from the list of filenames using TFRecordDataset()
	3. Create a new dataset that loads and formats images by preprocessing them.
	4. Shard the dataset.
	5. Shuffle the dataset when training.
	6. Repeat the dataset.
	Hadoop Commands
	# test code
	cat testfile \| ./mapper.py \| sort \| ./reducer.py

	# run a job
	hs mapper.py reducer.py input_folder output_folder

	# view the results
	hadoop fs -cat output_folder/part-00000 \| less
	╔════════════╦═════════════════╦════════╗
	║ CompanyName Categoricalvalue ║ Price ║
	╠════════════╬═════════════════╣════════║
	║ VW ╬ 1 ║ 20000 ║
	║ Acura ╬ 2 ║ 10011 ║
	║ Honda ╬ 3 ║ 50000 ║
	║ Honda ╬ 3 ║ 10000 ║
	╚════════════╩═════════════════╩════════╝
	def search(numbers, target, first, last):
	mid = (first + last) // 2
	if first > last:
	return -1
	elif target == numbers[mid]:
	return mid
	elif target < numbers[mid]:
	return search(numbers, target, first, mid - 1)
	else:
	return search(numbers, target, mid + 1, last)