Skip to content

Instantly share code, notes, and snippets.

@fhueske
Created June 7, 2017 19:23
Show Gist options
  • Save fhueske/f35377a5fc27f015037696044bc087f2 to your computer and use it in GitHub Desktop.
Save fhueske/f35377a5fc27f015037696044bc087f2 to your computer and use it in GitHub Desktop.
Simple data generator with skew
import random
numRecords = 10000000
numProds = 100
numCust = 1000
# probability that a parameter combination is in each class
freqProb = 0.1
medProb = freqProb + 0.35
rareProb = 1 - medProb
# ratio of records in the table for each class
freqRatio = 0.5
medRatio = 0.35
rareRatio = 1 - freqRatio - medRatio
# number of records to generate for each distinct combination per class
freqCnt = int((numRecords * freqRatio) / (numProds * numCust * freqProb))
medCnt = int((numRecords * medRatio) / (numProds * numCust * medProb))
rareCnt = int((numRecords * rareRatio) / (numProds * numCust * rareProb))
tid = 0
for p in range(0, numProds):
for c in range(0, numCust):
r = random.uniform(0, 1.0)
if r < freqProb:
# frequent
for i in range(0, freqCnt):
print "{} {} {}".format(p, c, tid)
tid += 1
elif r < medProb:
# medium
for i in range(0, medCnt):
print "{} {} {}".format(p, c, tid)
tid += 1
else:
#rare
for i in range(0, rareCnt):
print "{} {} {}".format(p, c, tid)
tid += 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment