Don't use Standard Loops

When we are transforming millions of rows, it is necessary to optimize our code to minimize the processing time. For small datasets, it's just seconds, but in the world of Big Data, it can be hours or days.

In Python, there are many ways to make a loop or iteration over a dataframe, dictionary or list. However, when there is a lot of data with many operations, not all of them are a good option.

Next, different ways to perform operations between two lists will be shown to check which method is the fastest. I promise that the final results will surprise you.

Create random lists

First, we have to create the lists with some values that we will use. In this case, Numpy lists with 10 million random values selected from a uniform distribution.

numpy_list_1 = np.random.uniform(-1,1,size=10000000)
numpy_list_2 = np.random.uniform(-1,1,size=10000000)

Create function

The following function is just an example where are used two types of operation that increase the execution time:

  1. Exponential operation

  2. If-Else operation

def calc_fun(input_list_1, input_list_2):
        if input_list_1 > input_list_2:
            return (input_list_1*input_list_1)**(input_list_2*input_list_2)
        elif input_list_1 == input_list_2:
            return (input_list_1*input_list_2)**(input_list_1*input_list_2)
        else:
            return (input_list_2*input_list_2)**(input_list_1*input_list_1)

Tests

We will use the function implemented previously with the most used methods nowadays. In addition, the code has been executed 1000 times using '%timeit' to avoid random results.

Test 1: Standard Loop

This is the most used form because it is similar to other programming languages.

new_list = []
for i in range(len(numpy_list_1)):
    new_list.append(calc_fun(numpy_list_1[i], numpy_list_2[i]))

Time (mean): 8.55 s

Test 2: List Comprehension

This form is very good since it does not need external packages and gives us more freedom when selecting the format.

new_list = [calc_fun(numpy_list_1[i], numpy_list_2[i]) for i in range(len(numpy_list_1))]

Time (mean): 7.57 s

Test 3: Zip

Zip() is a function that allows us to perform operations between lists or arrays.

new_list = [calc_fun(v1,v2) for v1, v2 in zip(numpy_list_1, numpy_list_2)]

Time (mean): 6.43 s

Test 4: Map

It is a function that returns an iterator called map with the results of each operation.

new_list = list(map(calc_fun, numpy_list_1, numpy_list_2))

Time (mean): 5.99 s

Test 5: Numpy Vectorize

A method that allows you to vectorize a function and not execute the loop sequentially.

vectorized_func = np.vectorize(calc_fun, cache=False)
new_list = list(vectorized_func(numpy_list_1, numpy_list_2))

Time (mean): 4.45 s

Analysis

As shown in the figure, each of the methods we have used improves the standard loop processing time.

Histogram comparison

Also, the Numpy Vectorize reduce the time by a percentage near to 50% the Standard Loop.

Histogram decrease

Although it depends on the transformation of the data you have, it's important to note that the vectorization of Numpy will have in most cases better performance. This is because vectorization avoids the execution of the loop cause full advantage of Numpy, such as parallel execution or locality of reference in memory. Also, one influence could be that most of the code is implemented in C, however, it is not very noticeable.

Finally, it is recommended to use Numpy arrays and avoid Python lists, firstly because of his memory occupation and speed and secondly because of its vectorization. However, if you don't want to use functions like Map, Filter, Reverse, Zip among others, it's recommended to change the standard loop by list comprehension.

 

References

https://numpy.org/

 

 


Your subscription could not be saved. Please try again.
Your subscription has been successful. Thank you for joining this great data world.

GET OUR NEWSLETTER

You'll get the latest posts delivered to your inbox.