Explaining what learned models predict: In which cases can we trust machine learning models and when is caution required?

We can liken machine learning algorithms to template-building instructions. Learned models, therefore, can be said to be ready-to-use templates built by observing current data and designed in such a way as to generalize to previously unseen data. Building on this analogy, it is understandable that learned models can sometimes be effective and reliable. For instance, AlphaFold, a machine-learning algorithm, has been able to outperform other methods at the structure prediction part of the protein folding problem (Howard & Gugger, 2020, pp. 88–89). Also, there has been an increase in enterprise adoption from 26% to 30.8% of machine learning methods between 2018 and 2020 according to the 2020 Kaggle State of Data Science and Machine Learning Survey (Kaggle, 2020). The flip side is that they can also be unsuccessful/ineffective and likewise unreliable. Two solid case studies are the development of medical expert systems and the use of ML algorithms to hire people. Concerning the former, in 2020, a GPT-3 medical chatbot advised a patient to end their life (Rousseau, Baudelaire, & Riera, 2020). In the case of the latter, Williams, Brooks, & Shmargad (2018) shed light on the unreliability of machine learning models in making hiring decisions for companies even when social category data is excluded (this measure should ideally reduce bias but it only obfuscates it). Sometimes we win, sometimes we lose. Having developed some intuition about how learned models work, we can approach the subject of their reliability in two ways: the half-full-cup or optimistic approach (where we consider when it is safe to trust learned models) and the half-empty-cup or pessimistic approach (where we examine when to take caution in using learned models.) This essay will discuss both viewpoints.

The Half-Full Viewpoint: When to Trust ML Models

The foremost condition to trust deep learning models is when there is a low-risk upon failure. In scenarios like this, failure does not defeat the purpose of the system. A playful example of a low risk scenario would be learning to paint a rose. If a learned model creates a collage of stems and unintelligible petals, we can at least laugh the results off. Is this sentence not expendable? In fact, learned models are favourable when they minimise risk. This is an ideal scenario to trust them. To understand the idea of low-risk, give thought to what is written of the renowned painter Michelangelo:

Michelangelo also hired assistants to grind colors, mix paints, trim and clean brushes, and take care of other mundane tasks, and he recruited other artists and instructed them in the technique of fresco painting.(Deseret News, 1988)

When dealing with processes in the Sistine ceiling painting that were strenuous and safely outsourceable, he would commission his assistant(s). However, he would take on painting the salient parts of a figure (that effuse his artistic idiosyncrasy) himself. Applying this selective outsourcing principle, it is safe to say that in a low-risk scenario, learned models can be trusted well enough to be deployed.omo, this sentence is not connecting the story to the point directly Additionally, we can trust the predictions of learned models **after adequate supervision and validation**. Howard & Gugger (2020) in their book, ‘Deep Learning for Coders with FastAI & Pytorch,’ explain the need for human supervision with a bear detection system powered by computer vision. The system is trained on several images of bears but may fail in production when it has to deal with conditions not met during training such as nighttime or video input. They propose: Where possible, the first step is to use an entirely manual process, with your deep learning model approach running in parallel but not being used directly to drive any actions...The second step is to try to limit the scope of the model, and have it carefully supervised by people...Then, gradually increase the scope of your rollout. (Howard & Gugger, 2020, pp. 88–89) Human-in-the-loop systems help to mitigate risk and create overall trust. Also, machine learning practitioners can trust learned models **when they perform well in trust metrics**. Wong, Wang, & Hryniowski, (2020) provide some metrics of trust practitioners can use to gauge the reliability of learned models.

The Half-Empty Viewpoint: When to Apply Caution for example

The first point made above likewise means that people who engage learned models should exercise caution **in high-risk scenarios**. When the stakes are high or slight mistakes have a heavy impact on a system, relying on learned models is an unwise approach. In a review of 62 articles promising novel machine learning methods to detect the coronavirus, Roberts et al., (2021) found that none of the models were dependable enough for clinical practice, due to “methodological flaws and/or underlying biases.” What should be a general rule of thumb is that medical scenarios are not low-risk. Another example is traffic. A learned model, such as the one studied by Eykholt et al., (2017) that misreads stop signs as ‘speed limit 45’ should not be trusted because traffic is volatile and accidents are of heavy consequence. Machine learning practitioners should be cautious if they are to use learned models (that are simply mathematical) for tasks that may require emotionally guided action or simply strictly-defined heuristics. Furthermore, caution is required **when there has been a negative history associated with applying learned models in a particular field**. To illustrate, Popken & Jo Ling Kent, (2018) discuss Youtube’s recommendation system and how the characteristic filter bubbles led to an unprecedented increase in conspiracy theorists and the amplification of misinformation. Has to be a good way to connect this example Najibi, (2020) points out two recurring demerits of employing facial recognition systems in police investigations: the use of these systems is without the acknowledgement of people and there is a non-trivial history of racial bias. Knowing fully well that learned models have a track record of unreliability in some applications like this, extra care should be applied in the use of these models.

Conclusion

In spite of these concerns for when to trust machine learning models and when to be cautious, one can argue that effectiveness, speed or some other factors may often be better tradeoffs than transparency/interpretability of learned models. Nevertheless, Rudin (2019) opines that it is preferable to design interpretable models than work around unreliable black box models.

References

Articles that were not cited but were studied for this essay have been included here

Deseret News. (1988, May 15). RESEARCHER SAYS MICHELANGELO DID SISTINE STANDING UP _ AND WITH HELP. Retrieved October 3, 2021, from Deseret News website: https://www.deseret.com/1988/5/15/18766067/researcher-says-michelangelo-did-sistine-standing-up-and-with-help
Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., … Song, D. (2017). Robust Physical-World Attacks on Deep Learning Models. Retrieved October 3, 2021, from arXiv.org website: https://arxiv.org/abs/1707.08945
Heaven, D. (2019). Why deep-learning AIs are so easy to fool. Nature, 574(7777), 163–166. https://doi.org/10.1038/d41586-019-03013-5
Howard, J., & Gugger, S. (2020). Deep Learning For Coders with fastai & PyTorch : AI Applications Without a PhD (pp. 88–89). S.L.: O’reilly Media.
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., … Back, T. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589. https://doi.org/10.1038/s41586-021-03819-2
Kaggle. (2020). State of Data Science and Machine Learning 2020. Retrieved October 3, 2021, from Kaggle.com website: https://www.kaggle.com/kaggle-survey-2020
Michelangelo Hated Painting the Sistine Chapel so much He Wrote a Poem About It. (2021, June 21). Retrieved October 3, 2021, from FYI website: https://vocal.media/fyi/michelangelo-hated-painting-the-sistine-chapel-so-much-he-wrote-a-poem-about-it
Najibi, A. (2020, October 24). Racial Discrimination in Face Recognition Technology - Science in the News. Retrieved October 3, 2021, from Science in the News website: https://sitn.hms.harvard.edu/flash/2020/racial-discrimination-in-face-recognition-technology/
Popken, B., & Jo Ling Kent. (2018, April 19). As algorithms take over, YouTube’s recommendations highlight a human problem. Retrieved October 3, 2021, from NBC News website: https://www.nbcnews.com/tech/social-media/algorithms-take-over-youtube-s-recommendations-highlight-human-problem-n867596
Roberts, M., Driggs, D., Thorpe, M., Gilbey, J., Yeung, M., Ursprung, S., … Schönlieb, C.-B. (2021). Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence, 3(3), 199–217. https://doi.org/10.1038/s42256-021-00307-0
Rousseau, A.-L., Baudelaire, C., & Riera, K. (2020, October 27). Doctor GPT-3: hype or reality? - Nabla. Retrieved October 3, 2021, from Nabla.com website: https://www.nabla.com/blog/gpt-3/
Rudin, C. (2019, May). Stop explaining black box machine learning models for high stakes decisions and use interpretable models... Retrieved October 3, 2021, from ResearchGate website: https://www.researchgate.net/publication/333069815_Stop_explaining_black_box_machine_learning_models_for_high_stakes_decisions_and_use_interpretable_models_instead
Williams, B., Brooks, C., & Shmargad, Y. (2018). How Algorithms Discriminate Based on Data They Lack: Challenges, Solutions, and Policy Implications on JSTOR. Jstor.org, 8, 78. https://doi.org/10.5325/jinfopoli.8.2018.0078
Wong, A., Wang, X. Y., & Hryniowski, A. (2020). How Much Can We Really Trust You? Towards Simple, Interpretable Trust Quantification Metrics for Deep Neural Networks. Retrieved October 3, 2021, from arXiv.org website: https://arxiv.org/abs/2009.05835