Below is input File format(*.txt):
userID | month | date | hour | totalTW | totalQs | result |
---|---|---|---|---|---|---|
21535110 | 05 | 01 | 02 | 3 | 2 | 1 |
21535110 | 05 | 01 | 03 | 3 | 2 | 1 |
21535110 | 05 | 01 | 06 | 1 | 0 | 0 |
21535110 | 05 | 02 | 02 | 1 | 0 | 0 |
21535110 | 05 | 03 | 05 | 3 | 2 | 0 |
21535112 | 05 | 01 | 05 | 1 | 1 | 1 |
totally there are 28,000,000 lines in the file, and I have 6 this kind of files.
write script to process the input data, to:
for each user, sum up the data (totalTW, totalQS, result) within same month, same day of the week, same hour.
lets say:
there are lines like this(year is 2012):
userID | month | date | hour | totalTW | totalQs | result |
---|---|---|---|---|---|---|
21535110 | 05 | 01 | 02 | 3 | 2 | 1 |
21535110 | 05 | 08 | 02 | 2 | 1 | 0 |
then this 2 data points should sum since they both belong to tue of May and hour is 02
userID | month | day | hour | totalTW | totalQs | result |
---|---|---|---|---|---|---|
21535110 | 05 | Tue | 02 | 5 | 3 | 1 |
the week.py script I added in this gist is working, the problem is, it seems too slow.
I used lab server to run it for ~20 hours and it is currently processing at 2,300,000 (about 10% ! only)
Is there any way to optimize this script?