The True Meaning of Alignment: Decision Modeling in Data Science

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

I. What does "Decision Modeling" mean?

Decision modeling is part of the data science process---actually it should be part of it, and overlooking it has populated the graveyard of data science failures. It is a mistake to limit the modeling scope just to prediction, as suggested by the term "Predictive Analytics", and this leads to fundamental confusion about how to model a situation where decisions are involved --- and invariably they are involved. This should be obvious to any student who has studied organizational behavior or decision-making theory in economics, and is a cautionary tale to anyone wanting to become a competent data scientist.

This article explains how to formulate a decision model from a predictive machine learning (ML) model. This is illustrated with a kind of causal diagram known as a Bayes network. A decision model needs to model---that is to align--- decisions, uncertainties, and outcomes. Data Scientists will (or should) find modeling uncertainties with probabilities familiar. The corresponding value model over outcomes is equally important, supported by an equally rich theory, but unfortunately it is not given the same attention. Often one is advised to "be sure to solve the business problem", and it's left at that. I'll make the case for value modeling, then show how, when decision variables are identified, the two models combine into one optimization model. Owning the decision model implies yet another competency required of a data scientist.

II. Kinds of variables: Decisions, Uncertainties, and Values

The decision modeling process begins by setting the problem scope. This makes it possible to ascertain which variables---quantities susceptible to measurement---to include. We partition variables into 3 types:

choice variables that make up a decision,
uncertainties that describe the world, and
values that quantify outcomes.

These three variable classes comprise the extent of a decision model. Modeling is the task of defining these variable and integrating them into one consistent model, as I will discuss. First let's define these three classes.

Decisions: actions, choices, and alternatives

By "decision" I mean a set of actions to chose from, identified by their tangible effects. Confusingly the term gets bandied about; one may speak of "deciding on one's values", but that's not what I mean. I mean a commitment that affects money, people, and things of that sort. In the model decisions refer to the variable or variables that are under your control. Identifying them is the first step to scoping the model---only those outcomes and values affected by one's decisions are relevant, and conversely, any alternatives that do not affect outcomes can be left out of the model, since they will have no effect.

Decision variables are inputs to the predictive model: Thinking that a model should "predict a decision" is a fundamental confusion. Since the decision is a variable under one's control it is an input, not an output of the model. Once I've mapped out how the parts of a decision model come together it will become clear exactly how decision variables fit in the complete model, and how this model, with prediction included, leads to recommended decisions---the policy. The important point is that one should start model formulation by asking "What are the variables under one's control?" and then move on to what can be modeled that is relevant.

Uncertainties and outcomes

The predictive model takes current known quantities and computes an estimate of outcomes. It is by the model's outcomes that values can be assigned, and the relative merits of decision alternatives imputed.

The danger is that the set of outcomes included does not include all the considerations by which decisions need to be judged. Invariably there are tradeoffs to consider---as noted by the current attention paid to responsible and ethical AI. In businesses at first glance it may seem we can just borrow the current operations "Objective and Key Result" (OKR) as outcome variables. However the value model needs to be comprehensive so that, for one, it captures concerns not apparent in an OKR. An OKR's purpose is simply to track progress toward a goal---and it is from the goal that the model outcomes are derived.

An outcome variables' uncertainty needs to be modeled as part of the prediction. It is the rare case where a prediction can be made with such accuracy that its implication is obvious, and any inaccuracy can be ignored. These cases do arise in modeling perception, of objects in images, or words in speech, but in business processes, the stochastic nature of the world must be taken into account. This is recognized in the widespread adoption of statistical methods in machine learning.

Model errors

Models are never clairvoyant. Machine learning models are statistical tools, and as result, their predictions vary around the true value. But also, some processes are inherently stochastic---take for example network packet traffic, or customers arriving at a queue.

Consider making predictions for the "Newsvendor" problem---the single period case of how to stock up on supplies before knowing the demand. For instance, as the name suggests, one needs to estimate the daily number of newspapers to order in anticipation of how many may actually be sold. A "newsvendor" predictor of the rate of demand may be entirely accurate, despite there being significant variation in the actual daily events modeled. To simply take the model advice and purchase the most likely number of newspapers is rarely the best solution. One should hedge and stock up on a few more. Exactly how many depends on knowing the variation around the model prediction, and the relative value of over or under selling.

Since all models have errors, the fundamental design question is how to model both the extent and severity of the errors to come up with the best solution. It is only by knowing probabilities of the predictions together with values of outcomes that errors can be managed.

Probability

The calculus of uncertainty is probability, which offers a rich and complete toolset for manipulating and quantifying uncertainty. Uncertainty in values are expressed by probabilities of outcomes. Incidentally these are not to be confused with "Confidences" as the term is used by statisticians. The field that makes probability a first-class concept is Bayesian statistics. There is a formal axiomatization of the probability calculus, that enforces strict consistency among probabilities, both those generated statistically and those that express a person's judgment.

Any predicted value needs to come with a probability distribution: Nothing actual is for certain. Any ML method that does not associate a probability distribution over random variables to represent its error is incomplete. In decision making, uncertainty about outcomes as a basis for a decision plays a primary role. This is a challenge in machine learning, since many widely used tools lack probabilistic outputs.

III. The "Alignment Problem"

The so-called alignment problem, recently popularized by Brian Christian[1] arises when poorly specified goals for AI system lead to unintended consequences. The apocryphal example[2] is an all-powerful AI system tasked to, say, make paperclips. Sounds silly, but the point is that such a system left to its own devices would cause rampant damage in other areas just to maximize its goal. Remember the story of the Sorcerer's Apprentice?

Such a system needs to incorporate human values. The alignment problem has been cast as an existential threat of AI. Theatrics aside, the grain of truth is that an autonomous AI system cannot succeed unless the objective it is maximizing is correct. This comes down to having a correct value model---even if the prediction model would be completely accurate, the values it applies still need to be aligned. You get exactly what you ask for, which is not necessarily what you want.

I propose that "alignment" needs to be considered among all three corresponding variable classes -- predictions, alternatives, and outcome variables. Models for each need to be aligned with the others for the decision model to work.

Now it's time to turn to the third set of variables---and how values are measured so they can be modeled.

IV. The value model

We define the value model as a function f(s) that maps outcomes into quantifiable values, using a common unit among all outcome variables. Trivially, if we know the price of different items, we know their value, and can choose among them. However if the consequences of the choices are in the future, depend on a host of factors, and involve intangibles, valuing these choices, and thus making a rational choice is hard.

Familiar examples of value models are the economic models used in finance. Domains (e.g. "verticals") each have different models---in healthcare the value model may measure quality of life, for example. Despite the differences in domains, values modeling consists of a set of common principles that apply uniformly across them all.

Why is it necessary for Machine Learning?

In simple terms, a machine learning (ML) model predicts an event (Did the customer churn?) or a quantity (What will be the final score?). The model prediction is not the model value; these terms need to be distinguished carefully. As described the prediction of the model is an uncertain variable. The value model associates preferences over predicted outcomes---what is desirable (and hence needs to be combined with the prediction.) So there's both an ML model, and a value or utility model that are combined in the overall decision model.

What does the value function have to do with how decisions are modeled?

Clarifying model's value outcomes also helps clarify which decision alternatives to include in the model. An example may make this clear. The recent fad in Sales Opportunity Scoring demonstrated that it's possible to predict with high accuracy the probability that a sale will close or not. From the salespersons point-of-view this is a "so what?" Does a high probability mean the salesperson should put more effort into it, or that it can be left to itself and effort is best applied to lower probability opportunities? The fact that it will eventually close is not her concern, it is what is best to do given the likely outcome. The model is useful only if it comprises actions that available to the decision maker. Once it is clear how the range of actions affect the value of different outcomes, we see how the value model needs to be a function of both outcomes and actions, whose purpose is to match the best action to each outcome.

Overall, by identifying who makes the decision, we enforce consistency between the value function and the decision. Whose values are to be included in the model? If, for instance, it happens that a root causing analysis tool is more accurate, but increases the time and effort by the troubleshooting engineer, have we truly addressed the engineer's needs?

So how does value modeling combine with modeling uncertainty?

So there is this inescapable duality between probability and value. As mentioned, every predicted outcome has two aspects, a predicted probability of occurrence and a value, which could be either positive or negative. Together they form a theory of measurement or estimation, although the terms are sometimes confused--"value theory" being also a term for the combination.

How does one make a choice under uncertainty? For making choices we need a way where a distribution of values can be reduced to an equivalent single number. So we need a concept of expected value, or a certain equivalent. We borrow the notion of expectation from probability theory---which is simply the product of probability times value. If all outcomes were equally probable, then is reduces to just the average value.

There is a perversity in supervised learning to confuse this by trying to predict a value to be obtained i.e. predicting expected values. This muddled thinking leads to confusion in the model formulation. Here's a fairly contrived example that makes the point---predicting water consumption. Future uncertainty depends on weather conditions. But variation due to policy choices---who get water for what---are values not determined by the weather.

Creating the value model

Value modeling is it's own field. How do you assign a number to an outcome? An everyday example of reducing seemingly immeasurables to a certain value occurs whenever we set a price. In one instance, a negotiation for purchase results in a process where two parties come to agreement on a certain number. Another case is when prices are set in a competitive market. There no one person affects the price, but by the magic of the process of market clearing, it generates a single price.

One method is to elicit values by a structured interview technique applied to the consequences of the model's decision. One posits hypothetically the two most extreme outcomes, then calibrates the value of intermediate outcomes by eliciting the probability from an expert that makes them equivalent in value to the extremes. This is especially useful with "intangible" outcomes---those for which there are no objective measures.

An example of eliciting the value of an intangible outcome could be evaluating the cost of spurious alerts---the cost of "alert fatigue" by, say, comparing it to other work nuisances that are more tangible: "If you had to stay late because of a spurious alert, what is the cost of a spurious alert in equivalent minutes of staying late?" One can imagine creative approaches depending on the circumstances for finding a common unit of value among different outcomes.

What if there are multiple outcomes, each valued differently? For example, I can keep sick patients in the hospital longer to assure their cure, but at the risk of running out of hospital beds if hospital admissions increase. Perhaps the hardest part of building a value model is the necessity to weigh the tradeoff between competing outcomes, such as by coming up with a weighting that reduces multiple values to a common scale. In summary, elicitation interview techniques can resolve this.

Pairwise comparisons to map out preferences

These elicitation methods can be extended by posing pairs of hypothetical outcomes for the decision-maker to rank. This is a versatile tool that can be used to

order possible outcomes to create a single function from them,
come up with a relative weighting for otherwise incommensurable quantities,
quantify the value of intangible quantities,
compare the relative value based on the time when an outcome occurs,
determine how the level of risk affects value.

For example, imagine valuing sales growth in different geographic areas, or for different new product introductions. One could assess the relative value of future growth rates versus current revenue. What is it worth to "buy" market share by pricing aggressively, compared to the loss of current revenue? The key point is that it is more important to include all factors that determine value, including "intangibles" that cannot be measured with high accuracy and require judgment. Better a model that is inclusive instead of a model that avoids important factors presuming they are too hard or subjective to measure. Doug Hubbard's book argues this point.[3]

Errors in judgment

Since people don't always say, or even know what their true preferences are, the human aspect of how values and probability are perceived plays an important role in elicitation techniques. Methods to de-bias responses have been studied in a behavioral psychology under the term prospect theory, most notably by Kahneman and Tversky. This is worth its own article.

Utility models of time and risk preference

We use "utility model" specifically to express how time and risk preference affect values. If you're familiar with the economic literature the terms "utility model and "value model" tend to be used interchangeably, however we use utility model U() specifically as an additional function to which the value is passed to express time and risk preferences. Hence we can write V(sᵢ) = U(f(sᵢ)) to distinguish the parts of the model that express risk tolerance and time preference. Risk tolerance and time preference modify the value computed by the value model, expressed as a discount factor. Even though both contribute to a discount factor, risk and time preferences are different phenomena. It suffices to say both can be elicited with the same techniques used for other preferences; the details deserve a separate article.

Can the value model be learned?

In principle, by observing the choices a person makes, they reveal their value model. Arguably revealed value models avoid biases that arise when asking people for their values. This is still an area of active research, often identified with reverse reinforcement learning and the challenges that such a theory that implies.

In summary by extending the economic concept of value by the methods of elicitation just mentioned we have a general approach when conventional models economic value, e.g. when dollars versus dollars are not adequate.

V. Computing expected value from a value function

Once the value model is in hand we can borrow the computation of "expected value" from Economics that combines a value function, V(sᵢ) and it's probability distribution, P(sᵢ) over the set of outcomes sᵢ. This is a way to reduce an uncertain value to a single number that can be used in comparisons.

The expected value for a discrete set of outcomes, sᵢ each with predicted probability P(sᵢ), is

To choose between two sets of outcomes, one described by one distribution P(sᵢ), the other by P’(sᵢ), we should prefer the one whose predicted probability has a greater expected value. As probabilities not just any percentages or fractions will do; this is why we need "precise" probabilities from our model---the term is "calibrated." But we can go a step further: If both probabilities and values of a set of outcomes differ then their expected value reveals the comparison. How might they differ? As a function of the decision, of course.

Making value tradeoffs is an optimization problem

To pull these various themes together, the combination of predictive and value models generates an expected value, by which each alternative can be compared so that all aspects that need to be considered are included in making a choice; in other words, in a decision model. The various components of the decision model are interpretable, so the effects of values, their tradeoffs, uncertainties, risk, and time preference are explainable.

A graphic illustration of a decision model in the case of a binary outcome is the application of ROC curves to find the optimal tradeoff that balances true positive and false positive rates. The ROC curve is the probability model. The value model is the linear tangent line whose slope represents the tradeoff, and its intersection with the ROC curve sets the threshold that determines the decision policy that maximizes the value.

The ROC example can be generalized mathematically to the optimization of any decision model, where there may be multi-valued or continuous outcomes. These are solvable optimization problems constructed from two component models---of probability and value.

A Decision Model as a Bayes Network (or two)

Mathematically the model when making a decision against an uncertainty can be written as an optimization (here written as maximization),

or equivalently, as this Bayes network. This may make it obvious if the math isn't. Decision nodes are shown as rectangles, probabilities (random variables) as ovals, and value nodes as diamonds. Arrows show dependencies. With discrete-valued variables, the math here is simple; just enumerate the expected value for all alternatives and chose the best.

This case is not very interesting: Where does the predictive model come in, you ask? A predictive model inputs a list of features, some of which may be known when making the decision. Thus we designate a subset of features as the information i known when the decision is made.

Introducing the predictive model, we condition the prediction and the decision on the information features:

where V(d,s) is the value function, d(i) is the decision policy as a function of the observed features (think a lookup table) and P(s | i) is the model prediction of outcome s depending on the features i. The machine learning model is embedded in the decision model. This equation applies generally for any data science application.

To show this in a Bayes network we add the observed features as a node that conditions the random variable on which the value depends, as shown here:

"Solving" the decision model means finding the best choices---the policy function d(i)---that maximizes this expression, given P(s | i) and V(d,s): the prediction and value models. Written out, the equation says to take expectation over the predictive distribution of the ML model, conditional on the observed features, then for each combination of features make the choice with the highest expected value. In the fortunate case where the set of alternatives and observables are small enough to count on one hand, one can enumerate over all values to fill in entries in the policy table.

Similarly the just mentioned ROC curve example works when d is binary and s can be partitioned in two, which just reduces it to the discrete case.

Extensions: Economic Models

The equation for the decision model is for just a single-person decision-maker with one decision. What about multiple decisions and multiple decision-makers? Decision models are not limited to a single decision. They could be embellished with secondary decisions, to form a sequence, to collect more information, include other alternatives, and so on. The optimization and Economics literature are replete with more extensive models that extend this basic case---

There are a sequence of decisions, where earlier decisions may affect the information at subsequent ones.
The decision-maker must work within resource constraints---"budgets"---that create dependencies among decisions, in addition to those created by the value function.
There are multiple decision-makers, either with a common value function, or with competing values.
Or some combination of the three.

VI. What can one hope to achieve by doing this?

The Data Scientist (should we say "Decision Scientist") needs to own the model formulation. The reason many Data Science models fail is just because their decision model is wrong.
Data Science ethical concerns and model fairness fall entirely within value modeling. Conversely, some conundrums in ethics are best resolved by making the valuing of the tradeoffs explicit. Fairness is something that can be modeled.
Despite the universality of decision modeling, there are widely pursued areas of Data Science where the value analysis, and the optimization problem are already well codified and implicit. Any area where A/B testing is widespread, think product recommendations or advertising, one assumes that the decision problem is settled, and improvements can best be found simply in reducing the uncertainty around current predictions. However, when A/B testing fails to deliver results, it's likely that the decision model is wrong.

Additional thoughts

Rational choice as a founding principle---that one's actions should be driven by desired outcomes---has been a long-standing theme in AI. Decision modeling reduces this to practice. But for this to have effect, one also needs to drive adoption and this means generating trust in the model. The visibility into the components of the model improve explainability that helps in this.
These ideas are neither novel nor obscure. They have been around for half a century, in the economics, management science, and the behavioral psychology literature. Bad imitations of them reappear with depressing regularity in the popular business literature.
What about using AI to learn the value model, similarly to how the predictive model is learned? This comes under the topic of "reverse reinforcement learning", and is the basis of a proposal to humanize AI, in Stuart Russell's well-regarded book on the future of AI: "Human Compatible".[4]
Decision modeling implies the data scientist needs to get involved in operational aspects of the business--yet another competency demanded of the data scientist! These "transformative" opportunities go beyond just looking for cost savings that can be obtained by introducing new tools and software, to disruptive (in a good way) value opportunities that affect the business model, but require intensive investments in analysis over much longer time-spans that a typical ML modeling project.

[1] Brian Christian "The Alignment Problem" (2020) Norton Publishing.

[2] This example is attributed to Nick Bostrom, "Superintelligence", (2014) Oxford University Press.

[3] Douglas Hubbard, "How to Measure Anything: Finding the Value of Intangibles in Business" (2014) Wiley.

[4] Stuart Russell, "Human Compatible: Artificial Intelligence and the Problem of Control" (2010) Cambridge.

A copy of this posting is available at medium.com

Leave a Reply Cancel reply