I will explain the differences between Scala and PySpark to better understand which language is better for you.
Surely when getting into the world of Big Data you have heard about frameworks such as Spark. These technologies can be programmed in Python (PySpark) or Scala, although with the new versions they also support R.
Which programming language is better? To answer this question, we must take into account different factors.
First, if our team needs to continually update Spark to incorporate new improvements that appear in the market, Scala is the choice. This is because Spark source code is based on Scala, and both the community and external companies have a major update.
Some examples of libraries that have taken months to appear for PySpark are XGBoost (library to improve machine learning models using boosting) or CosmosDB (library that allows asynchronous writing for CosmosDB).
Availability of packages
Although Scala allows us to use updated Spark without breaking our code, it has far fewer libraries than PySpark. Since PySpark is based on Python, it has all the libraries for text processing, deep learning and visualization that Scala does not.
So, if you need libraries to avoid your own implementation of each algorithm. PySpark is the best choice.
Spark allows you to create custom UDF's to use an asynchronous function over a dataframe. But, when the dataset is very large, the performance is much worse in PySpark. This is because many of the libraries are written in C and C++, so PySpark has to interpret these libraries in the backend and apply them on each row of our dataframe. However, it always depends on the algorithms we want to implement, because there are libraries that most of their data are applied only on Python objects.
Also, most of the calls made to the Spark cluster use the Spark context and Spark drivers found in JVM, so Python has to make a call first to the Spark context, however in Scala this first call can be avoided and use the context immediately.
In the case of performance, the choice is more difficult because it depends on which model or transformation we want to apply on the data. But, for large amounts of data, usually the one that offers better performance is Scala, although the difference gets smaller with each update.
Python is an interpreter language and a language that has become very fashionable. Because of this, it is a language that many people know and is easy to read quickly. As a result, when you start using PySpark the environment is more familiar because the calls to Spark are via functions that are very similar to those in Python.
In addition to the readability itself, the PySpark learning curve is better since it doesn't have the complex Scala syntax needed to define operators.
From a more personal point of view, I consider PySpark easier to learn and read, however for production Scala is a much "prettier" programming language. For this reason, many people program in PySpark and when all the code is validated, they go to production in Scala.