rahilb/General.md

## General.md

      
    Raw
  

              General.md
            
          
    Make sure related data is put into the same bucket


Assign related items to the same data partition

e.g. chunks of the same file, audio spoken by the same person etc


This will make sure your network has been contaminated by seeing testing samples during the traing phase.
e.g. the following function from tensorflow examples assigns files stably to partitions, ignoring a regex in the file name:
validation_percentage = 10.0
testing_percentage = 10.0
MAX_NUM_WAVS_PER_CLASS = 10000000000

def which_set(filename, validation_percentage, testing_percentage):
  base_name = os.path.basename(filename)
  # We want to ignore anything after '_nohash_' in the file name when
  # deciding which set to put a wav in, so the data set creator has a way of
  # grouping wavs that are close variations of each other.
  hash_name = re.sub(r'\d+', '', base_name, 1)
  # This looks a bit magical, but we need to decide whether this file should
  # go into the training, testing, or validation sets, and we want to keep
  # existing files in the same set even if more files are subsequently
  # added.
  # To do that, we need a stable way of deciding based on just the file name
  # itself, so we do a hash of that and then use that to generate a
  # probability value that we use to assign it.
  hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
  percentage_hash = ((int(hash_name_hashed, 16) %
                      (MAX_NUM_WAVS_PER_CLASS + 1)) *
                     (100.0 / MAX_NUM_WAVS_PER_CLASS))
  if percentage_hash < validation_percentage:
    result = 'validation'
  elif percentage_hash < (testing_percentage + validation_percentage):
    result = 'testing'
  else:
    result = 'training'
  return result