Skip to content

Instantly share code, notes, and snippets.

@blackwatertepes
Last active August 29, 2015 14:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save blackwatertepes/80b6df42c63636c5bac9 to your computer and use it in GitHub Desktop.
Save blackwatertepes/80b6df42c63636c5bac9 to your computer and use it in GitHub Desktop.
Job Report Sizes

Generating a distribution of report sizes

Jobs for the last # days

First, you'll need to generate a list of job.id's from reports generated in the last # days. Run the following in Mixpanel...

SELECT job_id
FROM public.reports AS reports
WHERE reports.updated_at > '2015-06-29 00:00:00'
  AND reports.kind = 1 /* Full Reports == 1 */

Size of reports

The size of reports can be obtained using Fog, in Make's production environment...

  • SSH into make web, and open a console
  • Create a new fog storage object...

storage = Fog::Storage.new(provider: 'AWS', aws_access_key_id: ENV.fetch('AWS_ACCESS_KEY_ID'), aws_secret_access_key: ENV.fetch('AWS_SECRET_ACCESS_KEY'))

  • Copy and transform the job id's from Mixpanel into an array object that can be digested by the console
  • Map the job id's with the following...

jobs.map{|job_id| [job_id, storage.get_object('crowdflower_prod', "f#{job_id}.csv.zip").headers["Content-Length"].to_i] }

Create a distribution

  • Copy the jobs array from the console, and transform it into a string resembling the following, and save it as a text file.

[1234, 1234], [1234, 1234], [1234, 1234]

  • Run report_size_distro.rb against the text file
DISTRO_WIDTH = 50
jobs = []
File.open('report_sizes.txt').each do |line|
jobs = line.strip.split(/\], ?\[/).map do |job|
job.split(/, ?/).map do |n|
n.gsub('[', '').to_i
end
end
end
jobs.sort!{|a, b| a.last <=> b.last }
puts "\n* All values are represented in Kb\n\n"
puts "Smallest report: #{jobs.first.last / 1000}"
puts "Largest report: #{jobs.last.last / 1000}"
puts "Median report: #{jobs[jobs.length / 2].last / 1000}"
puts "Average report: #{jobs.inject(0){|acc, job| acc += job.last } / jobs.length / 1000}"
puts "\n-- Report Distro --\n\n"
1.upto(DISTRO_WIDTH) do |n|
n = n.to_f / DISTRO_WIDTH
puts "#{(n * 100).to_i}% #{jobs[(jobs.length * n).ceil - 1].last / 1000}"
end
* All values are represented in Kb
Smallest report: 0
Largest report: 20211
Median report: 5
Average report: 255
-- Report Distro --
2% 0
4% 0
6% 0
8% 0
10% 0
12% 1
14% 1
16% 1
18% 1
20% 1
22% 1
24% 1
26% 1
28% 1
30% 1
32% 1
34% 1
36% 2
38% 2
40% 3
42% 3
44% 3
46% 4
48% 4
50% 5
52% 6
54% 7
56% 8
57% 9
60% 10
62% 13
64% 15
66% 18
68% 21
70% 23
72% 25
74% 29
76% 33
78% 38
80% 45
82% 52
84% 62
86% 81
88% 136
90% 214
92% 414
94% 564
96% 791
98% 2286
100% 20211
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment