In this post, I will show you the steps to mount an architecture that optimizes the resources used in Big Data problems with a Netflix example.
Surely many times you have had to program a process of Extraction, Transformation and, Loading of data (ETL) and you have not known which tools select between the thousands and thousands that there are on the Internet.
This decision is very important since it can mean either the failure or success of a project since a fast and cheap ETL can become a slow and expensive ETL due to its architecture.
Although it is necessary to analyze in-depth the origin of the data, the type of data, the update frequency, among others, I will show below some steps that can be very useful to help us make a decision.
We will use Netflix as an example of a real-world problem.
1. What is your project about?
The most important thing is to understand very well the project we are working on and which requirements must be carried out to ensure success.
Netflix has a recommendation system that continually engages us in using their platform to watch movies or series that suit our needs. This can be checked very easily at the end of a video since it allows us to leave or start a new recommended video.
What requirements do we need?
> A database where all the movies or series that each person has watched and their personal information will be stored.
> A recommendation system.
> A event streaming platform to save every update.
> A stream processing tool
Then we must choose the best tool for each requirement.
2. Select tools
To select the database, you have to take into account the CAP theorem which specifies the characteristics that the database can have and how you can get only two of them. For example, a consistent database with high availability usually has partition tolerance problems.
Like Netflix, we want the recommendation system to be a product that is available and that even if it falls at some point, allows us to show the results correctly because we will have many people using this service at the same time. For this last reason, consistency is not the most important thing, because the user can be shown a recommendation that is not constant, since it has more priority to show a movie that is more or less similar to its recommendations than to show nothing and lose the opportunity for the user (client) to continue enjoying the movies.
The most used databases that guarantee the fulfillment of these services are CouchDB, Cassandra, Riakm and, DynamoDB. For more information about it: Packthub
My recommendation is to use Cassandra or DynamoDB, but my choice for this problem would be Cassandra since it has slightly higher availability.
2.2. Recommendation System
To code a recommendation system there are many python packages, however, for large amounts of data, it would be advisable to use Hadoop or Spark. If you choose Hadoop, you would have to program the recommendation algorithm from scratch, since it has not that system. On the other hand, if you choose Spark, you can use its Spark Mllib module to create this model.
To create this system, you would choose Spark for two main reasons.
> Firstly for the easy way to create the model because Spark Mllib has its own collaborative filtering that helps us to reduce the preprocessing for a real-time system.
> Because ingesting data from the Cassandra database source is very simple with its default connectors.
2.3. Event Streaming
Kafka will be used to ingest the data in real-time into the database so that a more accurate recommendation can be made. There are other options such as Flume, however, this one is more difficult to scale because the topology must be changed completely and, what is very important, it does not automatically replicate between the other nodes.
2.4. Stream Processing
Using Kafka, real-time data processing is required to write the data into the database in the appropriate format for post-analysis and model training. To carry out this function, Spark Streaming is the best option, which has an exceptional functionality with Kafka for streaming processing.
3. Draw your architecture
It is very important to draw graphically the system you want to implement, to check if it makes sense, if it is the best design and if it meets the objectives.
4. Implement and configure
Once the tools have been chosen for each module, each one must be installed and configured to have an integration between them.
There are different techniques and tools for these configurations:
> You can use Docker to virtualize each tool and have greater control over its resources.
> You can use Ansible to automate the creation and modification of local/cloud resources.
In this small post, we have been used Netflix as an example, similar to if we were working at Netflix. This company actually uses many more types of services and databases to provide optimal service. However, they actually use Cassandra to store this type of data to have a model similar to the one defined above.
Choosing a good architecture for a service is an expensive job that must be done by a large team since you need to know the variety that exists for each resource and its functionality at a technical level.
If you take the requirements your system needs and check which tools integrate best in the system, the result will be a guaranteed success.