true

Learn Machine Learning from the Best Tutors

Affordable fees
1-1 or Group class
Flexible Timings
Verified Tutors

Search in

Essential Architectural Patterns for a Data Scientist

10/09/2019 0 0

Data is not an isolated entity. It needs to collect from some application or system, and then needs to be stored in some storage with the efficient format and after building the model on it, that model also needs to be exposed as an API to integrate with other systems. Sometimes this API needs to be available within specific latency around the globe. So there is much engineering involved in building an effective intelligent system and in today startup world which itself is a billion dollar sector, an organization cannot effort to hire so many experts to build an original feature in his product. So the data scientist needs to be a full stack analytic professional in the startup world. So in this chapter, we discuss some essential architectural patterns which every data scientist should know.

Potato Anti Pattern:

Tom is hired as a data scientist to an online company to build a real-time analytics product. So the very first step is to collect the data from their application. They make their storage auto-scaled using the cloud, and from the application, they push the data directly to the database. Everything looks beautiful in the test environment. They use a TCP connection to make sure there should not be any data loss. However, when they go live though they do not make any change in the main application, it goes down. The company faces a massive loss within half an hour, and Tom gets real-time feedback for his first step of the real-time analytic system, he is fired.

Now, the question is why the main application goes down when there is no change in it. If we look at it’s from classic computer science points o view, this is known as a busy consumer problem. Here the main application is the sender of data, and the database is the consumer. Now when the consumer is busy, which is a widespread scenario in any database lots of query running in it, it is unable to process the incoming data. Now, as TCP connection grantees the delivery data, the sender sends the data again and again and which load back the sender and here It is the main application. The situation is very similar when one person giving a potato to another person and receiver sending back to the sender and it is happening iteratively. That’s why it is called Potato Anti-Pattern. Below sequence diagram explain the situation visually.

The problem has two aspects. If the data which flows between sender and receiver is not necessary, then we can use UDP protocol which drops the data is unable to deliver. It is one reason why all network monitoring protocol like SNMP, Net-flow based on UDP. It does not load the device to do monitor. However, if the data is essential like the financial sector, then we have to put a messaging queue between sender and receiver. It acts as a buffer to track data when the receiver unable to process. However, if the queue memory becomes full, then it loses the data or put the load in the sender. There is a something called zero messaging queues or ZMQ which is nothing but UDP socket.

There are many readymade solutions in cloud platforms; we discuss detail in our chapter “Essential Cloud Pattern for Data Scientist. Below Node JS code is an example of a collector using Rabit-MQ exposed as REST API to sender and here receiver is Google Big Query.

Proxy Pattern and Layering:

Tom joins a new company. The company is big, so no job insecurity. Here he does not take the risk of collecting the data. Data is in a MySql server. Before that, Tom has no idea about the database. Very enthusiastically, he learned MySql. Write many queries in his code. The owner of the database is some other team and their manager like much R&D. So every Monday Tom gets a call the database changes to Mysql to Mongo then Mongo to SQL Server, and Tom has to make changes all over the code. Now Tom is not jobless, but every day he returns from office at 12 o clock night.

I think everyone says the solution is to organize the code correctly. However, I think the knowledge of Proxy and Layering pattern is handy. In the proxy pattern, instead of using raw Mysql or Mongo connector in your code, use a wrapper class as a proxy. In layering pattern, organize your code in multiple layers where a layer use method only form it’s the next lower layer. In this case, database configuration things should come in the lowest layers or core layer. In above that database utility layer which contains the queries to the database. Above that business entity layer which uses those database queries. Below python code give you a more clear picture. Now Tom know if there are any changes in database level, he has to look into core layer, if there are any changes in query he has to look into database utility layer and if there are any changes in business actors he has to look into entity layer. So his life is easy now.

Before We End:

Before we end, we put a footnote for Tom’s manager for which database is suitable for which kind of scenario. When data is highly structured, and entities have a clear and strict relationship, then relational database (Mysql, Oracle, SQL Server) is a better choice. However, when data is unstructured and unorganized, Mongo is a better choice. When data has a long textual field, and we are firing lot search in a substring of that Elastic text Search, or Solr is a better choice. Elastic Search also provides a free data visualization tool Kibana and ETL tool Logstash with it. So it is fashionable to become a full stack solution for data analytics. Sometimes data needs to be model as a graph. In that case, we require a graph database. Neo4j is very popular in the graph database as it also provides a lot of utility tool with it at a little cost. Some time we need application is speedy. In that case, we can use the in-memory database like SQLite. However, if you need to update your database from remote host SQLite does not support that. . If you want more detail please read the book “Advance Data Analytics in Python” written by Sayan Mukhopadhyayay,(link) we have a separate chapter with details of these DBs.

0 Like 0 Dislike

Follow 2

Other Lessons for You

Decision Tree or Linear Model For Solving A Business Problem

When do we use linear models and when do we use tree based classification models? This is common question often been asked in data science job interview. Here are some points to remember: We can use any...

Ashish R.

0 0

Learn the secret of mastering machine learning fast.

There are many ways to master machine learning but let me give you the secret of doing it fast. The general technique for mastering machine learning is learning Python or R first. This is not the right...

Vishal Singh

0 0

Regularisation in Machine Learning

Regularization In Machine Learning, Regularization is the concept of shrinking or regularizing the coefficients towards zero. It helps the model to prevent overfitting. Overfitting in Machine Learning...

Talla Veerendranath

0 0

Machine Learning With Python

1. Course description: Machine Learning with Python has been designed for the provision of having strong hold in creating Machine learning algorithms with the base of Python. This has been preferred as...

Johnnie Clark

0 0

Different Data File Formats in Big Data

Overview In this lesson I will be explaining the different kinds of Data File formats used in Big Data, These are widely used but unspoken of. Anyone aspiring to be a Data Engineer/Data Analyst/ML...

Raghunandana S K

0 0

Find Machine Learning near you

Looking for Machine Learning ?

Learn from Best Tutors on UrbanPro.

Are you a Tutor or Training Institute?

Join UrbanPro Today to find students near you

Machine Learning Questions

I'm looking for a freelance data science trainer to get the training as most of the institutions are...

22 Answers

What is cost of machine learning training online?

7 Answers

What's the difference between Machine Learning and AI?

5 Answers

Is that possible to do machine learning course after b.com,mba Finance and marketing?

24 Answers

What is machine learning algorithm?

5 Answers

Looking for Machine Learning Classes?

The best tutors for Machine Learning Classes are on UrbanPro

Select the best Tutor
Book & Attend a Free Demo
Pay and start Learning

Learn Machine Learning with the Best Tutors

The best Tutors for Machine Learning Classes are on UrbanPro

I am a Student I am a Tutor
Name*	Please enter your full name. Please enter institute name.
Email*	Please enter your email address.
Phone*	Please enter a valid phone number.
Location*	Please enter a pincode or area name.
City*	Please enter city name.
Category*	Please enter category.
Gender*	Male Female Please select your gender.
Email ID/ Mobile No.*	Please enter either mobile no. or email.
Enter Password*	Please enter OTP Please enter Password Sorry, this phone number is not verified, Please login with your email Id.