Skip to content

Instantly share code, notes, and snippets.

@jorgesancha
Last active March 21, 2024 00:06
Show Gist options
  • Star 19 You must be signed in to star a gist
  • Fork 24 You must be signed in to fork a gist
  • Save jorgesancha/2a8027e5a89a2ea1693d63a45afdd8b6 to your computer and use it in GitHub Desktop.
Save jorgesancha/2a8027e5a89a2ea1693d63a45afdd8b6 to your computer and use it in GitHub Desktop.
Python code test - CARTO

What follows is a technical test for this job offer at CARTO: https://boards.greenhouse.io/cartodb/jobs/705852#.WSvORxOGPUI

Build the following and make it run as fast as you possibly can using Python 3 (vanilla). The faster it runs, the more you will impress us!

Your code should:

All of that in the most efficient way you can come up with.

That's it. Make it fly!

Copy link

ghost commented May 25, 2017

Hola @jorgesancha, me lo estoy bajando con aws cli, y me pone que son 1.6 GB
Estoy bajándomelo con el comando:
aws s3 cp s3://carto-1000x/data/yellow_tripdata_2016-01.csv .
Hay algún error?

@guillemborrell
Copy link

1.6 GB también

@midiadiaIT
Copy link

import time
import pandas as pd

filename = 'data.csv'

t0 = time.time()
n = sum(1 for line in open(filename))
print('Number of lines: ', n)
print('Elapsed time : ', time.time() - t0)

df = pd.read_csv(filename)
t0 = time.time()
print ('Average of tip_amount column: ', df['tip_amount'].sum() / n)
print('Elapsed time : ', time.time() - t0)

@ddelizia
Copy link

ddelizia commented Jun 9, 2018

Can Pandas be used for this test? you said using Python 3 (vanilla), does this means we can use external libraries?

@thinkdifferent
Copy link

What (if any) external libraries are allowed? Pandas?

@MiguelBarriosAl
Copy link

Method 1

import pandas as pd
import time

data = pd.read_csv("2018_Yellow_Taxi_Trip_Data.csv")

t0 = time.time() # Initial Count

row = (data.shape)[0] # Total Rows
sumColumn = (data["tip_amount"].sum())/row # Mean of "tip_amount" column

t1 = time.time() # Final Count
tiempo = round(t1-t0,0) # Total Time

print("Número de filas: \n", row)
print("Suma Total tip_amount: \n", meanColumn)
print("Tiempo total de ejecución: \n", tiempo)

Method 2

import pandas as pd
import time

t0 = time.time() # Initial Count

data = pd.read_csv("2018_Yellow_Taxi_Trip_Data.csv")

n=0
tipAmount1 = 0
for index, i in data.iterrows():

tipAmount = i['tip_amount']
tipAmount1 = tipAmount1 + tipAmount # Value of "tipAmount"
n = n + 1 # Rows

meanTipAmount = tipAmount1/n # Mean of "tipAmount"
t1 = time.time() # Final Count
tiempo = round(t1-t0,0) # Total Time

print("Suma Total tip_amount: \n", meanTipAmount)
print("Número de filas: \n", n)
print("Tiempo total de ejecución: \n", tiempo)

@manolinux
Copy link

It takes more or less same processing time as previous Pandas solutions (Vanilla Python):

import time

file = '/home/manolinux/yellow_tripdata_2016-01.csv'
field = 'tip_amount'
no_more_data_to_process = False
separator = ','

#https://stackoverflow.com/questions/1883980/find-the-nth-occurrence-of-substring-in-a-string
def findnth(string, substring, n):
parts = string.split(substring, n + 1)
if len(parts) <= n + 1:
return -1
return len(string) - len(parts[-1]) - len(substring)

t0 = time.time()

#Get number of field 'tip_amount'
with open(file) as f:
first_line = f.readline().strip()

l = [idx for idx, item in enumerate(first_line) if separator in item]
fieldNumber = 0

for commaPosition in l:
#First position?
if (field + separator) == first_line[0:len(field+separator)]:
break
#Last
if (separator + field) == first_line[-len(separator + field)]:
break
#Middle
if (separator + field) == first_line[commaPosition:commaPosition+len(separator+field)]:
break
fieldNumber+=1

#Processing
countLines = 0
accumulatedTips = 0

#Buffering option of open seems not to do much
with open(file,"r",163840) as f:
#skip first line
f.readline()
for line in f:
posInLine = findnth(line,separator,fieldNumber)
endField = posInLine+(line[posInLine+1:].index(separator))
countLines+=1
#print (posInLine,endField,line[posInLine:])
accumulatedTips+=float(line[posInLine+1:endField+1])

print("Number of lines:",countLines)
print("Average tip:",accumulatedTips/countLines)
print('Elapsed time : ', time.time() - t0)

@boriel
Copy link

boriel commented Aug 10, 2020

Simple (vanila) Python (not even csv is needed):

#!/usr/bin/python3
import sys

SEPARATOR = ','  # Some CSV formats use other separators like tabs or ';'
generator = enumerate(open(sys.argv[1], 'rt', encoding='utf-8'))
index, header = next(generator)
field_index = header.split(SEPARATOR).index('tip_amount')
tip_amount_acc = 0

for i, line in generator:
   tip_amount_acc += float(line.split(SEPARATOR)[field_index])

print("Total lines: {}".format(i + 1))
print("Tip amount average: {}".format(tip_amount_acc / i))

Of note: SEPARATOR can be configurable. The field position is automatically determined (field_index)

@blackrez
Copy link

blackrez commented Jul 17, 2021

Hello,

It was fun to play with, there is a lot of solution but I like this 2.
I think streaming is the future of data and I hate to download big file.

Solution 1

import csv
import urllib.request
import codecs

url = "https://s3.amazonaws.com/carto-1000x/data/yellow_tripdata_2016-01.csv"
stream = urllib.request.urlopen(url)
csvfile = csv.DictReader(codecs.iterdecode(stream, 'utf-8'))
count = 0
z = 0.0
for line in csvfile:
    z = float(line['tip_amount']) + z
    count = count + 1

print("final")
print(z)
print(count)
avg = z/count
print(avg)

Also I think people should be lazy and use some else computing capacities (I cheated I used boto3 but I don't have the time to rewrite a SDK) and in this case AWS and S3.

Solution 2

import boto3
s3 = boto3.client('s3')

resp = s3.select_object_content(
    Bucket='carto-1000x',
    Key='data/yellow_tripdata_2016-01.csv',
    ExpressionType='SQL',
    Expression="SELECT avg(cast(tip_amount as float)) , count(1) FROM s3object s",
    InputSerialization = {'CSV': {"FileHeaderInfo": "Use", 'FieldDelimiter': ',','RecordDelimiter': '\n'}, 'CompressionType': 'NONE'},
    OutputSerialization = {'CSV': {}},
)
for event in resp['Payload']:
    if 'Records' in event:
         records = event['Records']['Payload'].decode('utf-8')
         print(records)

I think I will an article on this solutions.

edit I forget the count in solution 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment