Data is not an isolated entity. It needs to collect from some application or system, and then needs to be stored in some storage with the efficient format and after building the model on it, that model also needs to be exposed as an API to integrate with other systems. Sometimes this API needs to be available within specific latency around the globe. So there is much engineering involved in building an effective intelligent system and in today startup world which itself is a billion dollar sector, an organization cannot effort to hire so many experts to build an original feature in his product. So the data scientist needs to be a full stack analytic professional in the startup world. So in this chapter, we discuss some essential architectural patterns which every data scientist should know.
Potato Anti Pattern:
Tom is hired as a data scientist to an online company to build a real-time analytics product. So the very first step is to collect the data from their application. They make their storage auto-scaled using the cloud, and from the application, they push the data directly to the database. Everything looks beautiful in the test environment. They use a TCP connection to make sure there should not be any data loss. However, when they go live though they do not make any change in the main application, it goes down. The company faces a massive loss within half an hour, and Tom gets real-time feedback for his first step of the real-time analytic system, he is fired.
Now, the question is why the main application goes down when there is no change in it. If we look at it’s from classic computer science points o view, this is known as a busy consumer problem. Here the main application is the sender of data, and the database is the consumer. Now when the consumer is busy, which is a widespread scenario in any database lots of query running in it, it is unable to process the incoming data. Now, as TCP connection grantees the delivery data, the sender sends the data again and again and which load back the sender and here It is the main application. The situation is very similar when one person giving a potato to another person and receiver sending back to the sender and it is happening iteratively. That’s why it is called Potato Anti-Pattern. Below sequence diagram explain the situation visually.
The problem has two aspects. If the data which flows between sender and receiver is not necessary, then we can use UDP protocol which drops the data is unable to deliver. It is one reason why all network monitoring protocol like SNMP, Net-flow based on UDP. It does not load the device to do monitor. However, if the data is essential like the financial sector, then we have to put a messaging queue between sender and receiver. It acts as a buffer to track data when the receiver unable to process. However, if the queue memory becomes full, then it loses the data or put the load in the sender. There is a something called zero messaging queues or ZMQ which is nothing but UDP socket.
There are many readymade solutions in cloud platforms; we discuss detail in our chapter “Essential Cloud Pattern for Data Scientist. Below Node JS code is an example of a collector using Rabit-MQ exposed as REST API to sender and here receiver is Google Big Query.
Proxy Pattern and Layering:
Tom joins a new company. The company is big, so no job insecurity. Here he does not take the risk of collecting the data. Data is in a MySql server. Before that, Tom has no idea about the database. Very enthusiastically, he learned MySql. Write many queries in his code. The owner of the database is some other team and their manager like much R&D. So every Monday Tom gets a call the database changes to Mysql to Mongo then Mongo to SQL Server, and Tom has to make changes all over the code. Now Tom is not jobless, but every day he returns from office at 12 o clock night.
I think everyone says the solution is to organize the code correctly. However, I think the knowledge of Proxy and Layering pattern is handy. In the proxy pattern, instead of using raw Mysql or Mongo connector in your code, use a wrapper class as a proxy. In layering pattern, organize your code in multiple layers where a layer use method only form it’s the next lower layer. In this case, database configuration things should come in the lowest layers or core layer. In above that database utility layer which contains the queries to the database. Above that business entity layer which uses those database queries. Below python code give you a more clear picture. Now Tom know if there are any changes in database level, he has to look into core layer, if there are any changes in query he has to look into database utility layer and if there are any changes in business actors he has to look into entity layer. So his life is easy now.
Before We End:
Before we end, we put a footnote for Tom’s manager for which database is suitable for which kind of scenario. When data is highly structured, and entities have a clear and strict relationship, then relational database (Mysql, Oracle, SQL Server) is a better choice. However, when data is unstructured and unorganized, Mongo is a better choice. When data has a long textual field, and we are firing lot search in a substring of that Elastic text Search, or Solr is a better choice. Elastic Search also provides a free data visualization tool Kibana and ETL tool Logstash with it. So it is fashionable to become a full stack solution for data analytics. Sometimes data needs to be model as a graph. In that case, we require a graph database. Neo4j is very popular in the graph database as it also provides a lot of utility tool with it at a little cost. Some time we need application is speedy. In that case, we can use the in-memory database like SQLite. However, if you need to update your database from remote host SQLite does not support that. . If you want more detail please read the book “Advance Data Analytics in Python” written by Sayan Mukhopadhyayay,(link) we have a separate chapter with details of these DBs.