书名：Mastering Scala Machine Learning
作者名：Alex Kozlov
本章字数：565字
更新时间：2025-02-25 21:14:11

Chapter 2. Data Pipelines and Modeling

We have looked at basic hands-on tools for exploring the data in the previous chapter, thus we now can delve into more complex topics of statistical model building and optimal control or science-driven tools and problems. I will go ahead and say that we will only touch on some topics in optimal control since this book really is just about ML in Scala and not the theory of data-driven business management, which might be an exciting topic for a book on its own.

In this chapter, I will stay away from specific implementations in Scala and discuss the problem of building a data-driven enterprise at a high level. Later chapters will address how to solve these smaller pieces of the puzzle. A special emphasis will be given to handing uncertainty. Uncertainty usually comes in several favors: first, there can be noise in the information we are provided with. Secondly, the information can be incomplete. The system may have some degree of freedom in filling the missing pieces, which results in uncertainty. Finally, there may be variations in the interpretation of the models and the resulting metrics. The final point is subtle, as most classic textbooks assume that we can measure things directly. Not only the measurements may be noisy, but the definition of the measure may change in time—try measuring satisfaction or happiness. Certainly, we can avoid the ambiguity by saying that we can optimize only measurable metrics, as people usually do, but it will significantly limit the application domain in practice. Nothing prevents the scientific machinery from handling the uncertainty in the interpretation into account as well.

The predictive models are often built just for data understanding. From the linguistic derivation, model is a simplified representation of the actual complex buildings or processes for exactly the purpose of making a point and convincing people, one or another way. The ultimate goal for predictive modeling, the modeling I am concerned about in this book and this chapter specifically, is to optimize the business processes by taking the most important factors into account in order to make the world a better place. This was certainly a sentence with a lot of uncertainty entrenched, but at least it looks like a much better goal than optimizing a click-through rate.

Let's look at a traditional business decision-making process: a traditional business might involve a set of C-level executives making decisions based on information that is usually obtained from a set of dashboards with graphical representation of the data in one or several DBs. The promise of an automated data-driven business is to be able to automatically make most of the decisions provided the uncertainties eliminating human bias. This is not to say that we no longer need C-level executives, but the C-level executives will be busy helping the machines to make the decisions instead of the other way around.

In this chapter, we will cover the following topics:

Going through the basics of influence diagrams as a tool for decision making
Looking at variations of the pure decision making optimization in the context of adaptive Markov Decision making process and Kelly Criterion
Getting familiar with at least three different practical strategies for exploration-exploitation trade-off
Describing the architecture of a data-driven enterprise
Discussing major architectural components of a decision-making pipeline
Getting familiar with standard tools for building data pipelines