Skip to content

Instantly share code, notes, and snippets.

@MarkDana
Last active June 22, 2022 14:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save MarkDana/e7d9663a26091585eb6882170108485e to your computer and use it in GitHub Desktop.
Save MarkDana/e7d9663a26091585eb6882170108485e to your computer and use it in GitHub Desktop.
count-unique-in-array-performance

Given a non-negative integers array a with different len(a) (scale) and different max(a) (amplitude), what is the most efficient way in Python to count the unique elements in a?

In this experiment, we fix the scale len(a) to be 20000, and focus specifically on the amplitude max(a) of the array.

Here is the code:

def get_runtime_stat(AMPLITUDE:int):
    def _test1():
        counts = np.zeros((AMPLITUDE + 1, ), dtype=np.int32)
        # np.int16 is already enough for sample size 20000 < 32767
        for ind in a: counts[ind] += 1
        counts = counts[counts != 0] # this is because some value in range may be missing
        return counts
    
    def _test2():
        uniq_vals, uniq_cnts = np.unique(a, return_counts=True)
        return uniq_cnts # uniq_vals are sorted

    def _test3():
        counts = np.bincount(a, minlength=AMPLITUDE+1)
        counts = counts[counts != 0]
        return counts

    TIMES = [[], [], []]
    for repeat in range(10):
        SCALE = 20000
        a = np.random.randint(0, AMPLITUDE, (SCALE,))

        tic = time.time(); ct1 = _test1(); tac = time.time() # or use tracemalloc.get_traced_memory() to get memory peak
        time1 = tac - tic
        tic = time.time(); ct2 = _test2(); tac = time.time()
        time2 = tac - tic
        tic = time.time(); ct3 = _test3(); tac = time.time()
        time3 = tac - tic
        assert np.all(ct1 == ct2) and np.all(ct1 == ct3)
        del ct1, ct2, ct3
        TIMES[0].append(time1)
        TIMES[1].append(time2)
        TIMES[2].append(time3)

    return TIMES

For AMPLITUDE in [5, 10, 50, 100, ..., 1e10], here is the results of time usage and memory usage:

time memory

So in my case, when AMPLITUDE is small (<1e5) I should use np.bincount, and otherwise np.unique.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment