When we are transforming millions of rows, it is necessary to optimize our code to minimize the processing time. For small datasets, it's just seconds, but in the world of Big Data, it can be hours or days.
In Python, there are many ways to make a loop or iteration over a dataframe, dictionary or list. However, when there is a lot of data with many operations, not all of them are a good option.
Next, different ways to perform operations between two lists will be shown to check which method is the fastest. I promise that the final results will surprise you.
Create random lists
First, we have to create the lists with some values that we will use. In this case, Numpy lists with 10 million random values selected from a uniform distribution.
numpy_list_1 = np.random.uniform(-1,1,size=10000000) numpy_list_2 = np.random.uniform(-1,1,size=10000000)
The following function is just an example where are used two types of operation that increase the execution time:
def calc_fun(input_list_1, input_list_2): if input_list_1 > input_list_2: return (input_list_1*input_list_1)**(input_list_2*input_list_2) elif input_list_1 == input_list_2: return (input_list_1*input_list_2)**(input_list_1*input_list_2) else: return (input_list_2*input_list_2)**(input_list_1*input_list_1)
We will use the function implemented previously with the most used methods nowadays. In addition, the code has been executed 1000 times using '%timeit' to avoid random results.
Test 1: Standard Loop
This is the most used form because it is similar to other programming languages.
new_list =  for i in range(len(numpy_list_1)): new_list.append(calc_fun(numpy_list_1[i], numpy_list_2[i]))
Time (mean): 8.55 s
Test 2: List Comprehension
This form is very good since it does not need external packages and gives us more freedom when selecting the format.
new_list = [calc_fun(numpy_list_1[i], numpy_list_2[i]) for i in range(len(numpy_list_1))]
Time (mean): 7.57 s
Test 3: Zip
Zip() is a function that allows us to perform operations between lists or arrays.
new_list = [calc_fun(v1,v2) for v1, v2 in zip(numpy_list_1, numpy_list_2)]
Time (mean): 6.43 s
Test 4: Map
It is a function that returns an iterator called map with the results of each operation.
new_list = list(map(calc_fun, numpy_list_1, numpy_list_2))
Time (mean): 5.99 s
Test 5: Numpy Vectorize
A method that allows you to vectorize a function and not execute the loop sequentially.
vectorized_func = np.vectorize(calc_fun, cache=False) new_list = list(vectorized_func(numpy_list_1, numpy_list_2))
Time (mean): 4.45 s
As shown in the figure, each of the methods we have used improves the standard loop processing time.
Also, the Numpy Vectorize reduce the time by a percentage near to 50% the Standard Loop.
Although it depends on the transformation of the data you have, it's important to note that the vectorization of Numpy will have in most cases better performance. This is because vectorization avoids the execution of the loop cause full advantage of Numpy, such as parallel execution or locality of reference in memory. Also, one influence could be that most of the code is implemented in C, however, it is not very noticeable.
Finally, it is recommended to use Numpy arrays and avoid Python lists, firstly because of his memory occupation and speed and secondly because of its vectorization. However, if you don't want to use functions like Map, Filter, Reverse, Zip among others, it's recommended to change the standard loop by list comprehension.