Skip to content

Instantly share code, notes, and snippets.

@yfujieda
Last active October 3, 2019 18:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yfujieda/6ea8f6d3143aacf29c8fc28d6a3c4923 to your computer and use it in GitHub Desktop.
Save yfujieda/6ea8f6d3143aacf29c8fc28d6a3c4923 to your computer and use it in GitHub Desktop.
python-dask-example
# using dask to handle the big dataset
import dask.dataframe as dd
# use numpy to conver the dask dataframe object to array object
import numpy as np
# load the dataset file
file_name = 'sample_data_set.csv'
#row_count to output
row_count = int(10)
# read the csv file and convert it to dask dataframe
df = dd.read_csv(file_name, error_bad_lines=False)
# using nlargest to return the first n rows ordered by columns in descending order.
# in this case, use 'NUM_VALUE' column as a column to order by
# reference: http://docs.dask.org/en/latest/dataframe-api.html?highlight=nlargest#dask.dataframe.DataFrame.nlargest
df2 = df.nlargest(row_count, 'NUM_VALUE')
# get the UIDs of n rows
l = df2['UID'].values
# convert the extracted UIDs to array object
n = np.array(l)
# output the UIDs
for x in n:
print(x)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment