Your code may be right, but lack of transparency still hurts!

Suppose you're given a business problem to predict the outcome of a business process such as–

"Whether an order will be fulfilled on time or not"

In other words, you're given an order with an order-date and expected-delivery-date, and you need to predict if this order will be delivered within its expected-delivery-date or not and if possible predict the actual-delivery-date.

Given the text & nature of the problem statement and what we have been reading/training about predictive analysis these days, most of us tend to think along the following lines

Solution to this problem involves coming up with a prediction, which seems to be a machine learning problem. And to solve a machine learning problem, we need to design model(s) and we need data to train the models.
Various regression and machine learning techniques are available to solve a predictive analysis problem.
Most popular tech-stacks to implement regression and machine learning solutions are R and Python based. We may not have worked on these technologies before, but we must skill up on them quickly because of two reasons
- It is hard to implement foundational frameworks from scratch in your preferred technologies (such as Java) when you also need to not only learn the basic concepts themselves, but you also must learn how to use those concepts as well.
- It is hard to explain to business stakeholders why your engineering team can't spend the effort to skill themselves up on the latest IN frameworks or technologies, without being at the risk of getting replaced with those who already have experience on the same.

Even if you are not working on (or going to work on) a machine learning business problem, you are considering (if you haven't already) enrolling yourself for various training or certifications on machine learning. Training is usually on either R or Python, not necessarily on technologies or platforms which you have worked on before.

If you have delivered (or have simply worked on) various enterprise-grade solutions before, you may also be wondering about

Are these frameworks scalable and reliable enough?
Can you integrate them with existing enterprise solutions seamlessly?
Can you test if what is deliverable with these frameworks is of high quality and accuracy?

And most importantly, after you have delivered a solution

Can you maintain your source code (or models)? How (and on what - source code or a model) do you use version control here?
And finally, can you explain to the end user how your solution has reached a specific decision?

Now, suppose a business user has two solutions against the same business problem which is stated above.

First Solution

The first solution claims that its output is highly accurate, but cannot explain how it has reached that specific output. For example, its output is along the lines of

Order Id	Probability/Confidence Score	Is On Time?
O1	0.7	Yes
O2	0.2	No
O3	0.3	No
O4	0.8	Yes

Obviously, the product team would have found the models to be release-worthy after testing and fine-tuning those models against training and testing data. However, business users have the following dilemma which seems to be unaddressed in the output shown above –

How is this specific confidence score value calculated? How can I blindly trust the output coming out of this black-box?

If I don't know how the value is calculated, what kind of feedback can I provide (other than prediction is right or wrong) for the improvement of the system?

Let's understand why it is important for business users to know how a confidence score is evaluated and what its significance is

Business users may have to use the output given by your system to take various decisions like how much to invest, in which area/region/category, with whom to partner or continuing partnership, where to promote, etc.
Unless your business users own the business/enterprise to whom you have sold this solution/platform, he/she may have to explain the decision they take based on the recommendations of your system.
Since your system is making recommendations based on only that data which is already there in the system, your models can't factor in those scenarios data required for which is not captured in the system, like past weather data, past events, recommendations made to the suppliers in the past, etc.
A bad decision based on a bad recommendation may ruin someone's business. Your client may ask you to pay for those bad decisions by including certain clauses in the contract/statement-of-work thereby increasing the legal exposure to your system.

Please remember that recommendations made by your system will have some effect on the future output of "Is On Time" column since your client may have taken some steps to improve the situation offline, like breaking up a large order into smaller ones, etc. This means a bad recommendation from your system may result in bad recommendations from your client to their suppliers thereby increasing their trust issues with your black box system.

So, if your client is willing to consider your proposal after knowing the explainability limitations of your system, he may be having a few doubts such as

How do I believe you have enough data to come to any conclusion? Confidence score by itself doesn't reveal what all data points are considered by the system.
When should I be able to discard/ignore your recommendations?
If I tweak the models delivered by you to reach higher accuracy in one region/category, does it guarantee the same accuracy across all regions/categories?

Second Solution

Now imagine the second solution, which claims high explainability potentially at the cost of high accuracy. It tells you how a confidence score is calculated, for example

Order Id	Probability/Confidence Score	Is On Time?	Notes
O1	0.9	Yes	+0.4 Last 10 out of 10 orders successfully delivered on time + 0.2 Difference between the expected delivery date and the actual delivery date is moderate + 0.3 overall on-time delivery ratio is 75%
O2	0.2	No	+0.1 Last 5 out of 10 orders successfully delivered on time + -0.2 Difference between the expected delivery date and the actual delivery date is high negative + 0.3 overall on-time delivery ratio is 75%
O3	0.3	No	+0.2 Last 5 out of 10 orders successfully delivered on time + -0.1 Difference between the expected delivery date and the actual delivery date is low negative + 0.2 overall on-time delivery ratio is 50%
O4	0.8	Yes	+0.4 Last 8 out of 10 orders successfully delivered on time + 0.2 Difference between the expected delivery date and the actual delivery date is moderate + 0.2 overall on-time delivery ratio is 60%

Or something like

Order Id	Probability/Confidence Score	Is On Time?	Factors that affect positively	Factors that affect negatively
O1	0.9	Yes	Last 10 out of 10 orders successfully delivered on time Difference between the expected delivery date and the actual delivery date is moderate The overall on-time delivery ratio is 75%
O2	0.2	No	Last 5 out of 10 orders successfully delivered on time The overall on-time delivery ratio is 75%	Difference between the expected delivery date and the actual delivery date is high negative
O3	0.3	No	Last 5 out of 10 orders successfully delivered on time The Overall on-time delivery ratio is 50%	Difference between the expected delivery date and the actual delivery date is low negative
O4	0.8	Yes	Last 8 out of 10 orders successfully delivered on time Difference between the expected delivery date and the actual delivery date is moderate The overall on-time delivery ratio is 75%

Can such a system be implemented?

Answer is –

Yes, if number/type of explanations/messages are finite, then it can be programmatically implemented.

Following is an example on how it can be implemented even with a tech-stack that you already have expertise on

For the sake of illustration, in the example shown above, calculation of the confidence score is a simple sum of the value of these 3 derived (from historical data) parameters

Performance is last 10 orders - PTen
Difference between the expected delivery date and actual delivery date - Diff
Overall on-time-delivery ratio - POverall

Dozens of such parameters can be pre-calculated and based on regions more (or less) weightage can be given to some of these parameters such as

Confidence score = 50% PTen + 25% Diff + 25% POverall

More complex rules can be created like

If the P10 ratio is 1, then no need to consider the rest of the calculations,
If the Diff is high negative, then increase its weightage and lower the weightage to a higher value, etc

More parameters (high in numbers and complexity) can be introduced into the equation without impacting the output of existing parameters, such as

A positive or negative relationship between different attributes such number of quantities order with an on-time-delivery ratio over a period of time in terms of -5 to +5
Performance of different attributes of Supplier, SKU and Order can be approximated to a number against various attributes over a period of time, to help in scenarios such as the introduction of a new supplier who doesn't have a history in your system.

How this approach helps your system?

Observe that by pre-calculating the parameter values for all the suppliers, categories and regions, and reducing the calculation to a simple algebraic equation, you have taken the data out of your model and made your system scalable and extensible since

Pre-calculation of parameter values is agnostic of what weightage it is given in the final equation that calculates the confidence score.
Since you approximate the different relevant parameters (which may have an impact on the likelihood of your supplier fulfilling the order on time) to a number, your model is now just an algebraic equation on top of already derived values. You can not only express this model in terms of XML/JSON but also store it in a database. You create different versions of it and compare their performance across regions and categories. You can analyze why certain factors/parameters are more important in those regions and categories.
You can do much faster what-if analysis by comparing different models since the values required for them are already pre-calculated.
Since the final confidence score can be calculated much faster, you can expose a live-rest-service to your client who can consume this service before even submitting the order to their supplier.
This solution can be implemented using the tech-stack which you already have expertise on. It can be implemented as a set of simple batch processes which query the database, pre-calculate the values in off-hours and save them in the database again so that final confidence score can be calculated on the go.
Explanations can easily be multi-lingual since all you need to do is set a parameterized text to a condition against the approximated value of pre-calculated value.
Finally, you can test and maintain your code since now you know how exactly the system is supposed to behave even with test data. The responsibility of making the models can now be delegated to actual data-scientists. Your deliverables will simply include out-of-the-box queries to pre-calculated values, tools to design more such queries & equations and tools to design/test different models, apart from system to ingest data and deliver templatized reports.

Does this approach ease up your client’s dilemma?

Also, let's analyze how the second solution impacts the dilemma of your client.

How do I believe you have enough data to come to any conclusion?

Since the system has explained how confidence score is calculated, the client can know what data points have been considered and more importantly, which data points have not been considered. The client can share those extra data points if they have or add more derived information into the equation.

When should I be able to discard/ignore your recommendations?

"Notes" can explain the difference between two 0 confidence scores as 0 can be achieved by offsetting two parameter values against each other and also when your system has no history whatsoever for a certain region/category.

If I tweak the models delivered by you to reach higher accuracy in one region/category, does it guarantee the same accuracy across all regions/categories?

Still no, since you cannot claim that one equation will fit for all regions/categories. But you can do faster what-if analysis of different models against different regions/categories and tell the system which equation is more suited for some certain regions/categories.

Conclusion

I am inclined to believe that the second system is more transparent and hackable. Since there is transparency in how the system does predictions and make recommendations, it will earn business user's trust faster than the first system which will have a long term impact on how business users provide feedback to your product team, hence helping in smoothening the process of product maturity.

Please note that significance of explain-ability vs accuracy may vary across different business scenario such as Hotstar/Netflix trying to predict whether a user is likely to continue their membership on their platform, or what kind of videos should be promoted to a certain user. It may not be cost-effective (if not pointless) to pre-calculate more than a few parameters (factors that affect their likelihood of continuing their membership) for millions of users.

I hope this article would have given a different perspective and a starting point as well on what you can do about riding this new age-of-AI wave.

Your code may be right, but lack of transparency still hurts!

Comments (1)