Skip to content

Instantly share code, notes, and snippets.

@rahilb
Created February 17, 2018 10:58
Show Gist options
  • Save rahilb/13fe9e319331fbc83f692a671921d36b to your computer and use it in GitHub Desktop.
Save rahilb/13fe9e319331fbc83f692a671921d36b to your computer and use it in GitHub Desktop.
TensorFlow Notes

Make sure related data is put into the same bucket

  • Assign related items to the same data partition
    • e.g. chunks of the same file, audio spoken by the same person etc

This will make sure your network has been contaminated by seeing testing samples during the traing phase.

e.g. the following function from tensorflow examples assigns files stably to partitions, ignoring a regex in the file name:

validation_percentage = 10.0
testing_percentage = 10.0
MAX_NUM_WAVS_PER_CLASS = 10000000000

def which_set(filename, validation_percentage, testing_percentage):
  base_name = os.path.basename(filename)
  # We want to ignore anything after '_nohash_' in the file name when
  # deciding which set to put a wav in, so the data set creator has a way of
  # grouping wavs that are close variations of each other.
  hash_name = re.sub(r'\d+', '', base_name, 1)
  # This looks a bit magical, but we need to decide whether this file should
  # go into the training, testing, or validation sets, and we want to keep
  # existing files in the same set even if more files are subsequently
  # added.
  # To do that, we need a stable way of deciding based on just the file name
  # itself, so we do a hash of that and then use that to generate a
  # probability value that we use to assign it.
  hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
  percentage_hash = ((int(hash_name_hashed, 16) %
                      (MAX_NUM_WAVS_PER_CLASS + 1)) *
                     (100.0 / MAX_NUM_WAVS_PER_CLASS))
  if percentage_hash < validation_percentage:
    result = 'validation'
  elif percentage_hash < (testing_percentage + validation_percentage):
    result = 'testing'
  else:
    result = 'training'
  return result
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment