Welcome back to WorkMinus, where we talk about what we need to drop from how we work, and quick pivots you can make today to get to a better future of work. Today our guest is Rich Caruana, principal researcher at Microsoft Research, and this episode is WorkMinus ‘Black Box’ Models. Hi Rich, how are you today?
Hi! How are you doing?
I’m doing excellent. Rich, we’re very excited to have you on the show. You are going to lead us into a much deeper level of technology than we’re used to on the show. We’re going to be talking about machine learning, which is Rich’s area of expertise. So Rich why don’t you start us off with the current state of machine learning and why the term is getting more mainstream attention now.
Sure. So all of you have probably heard about ‘deep learning,’ that’s the kind of machine learning that’s become very much in the forefront the last ten years or so. That’s a neural net technology. But the reason why machine learning is taking off these days is because we finally have large – in some cases massive – data sets. We’ve got a lot of computing horsepower and we’ve been working on machine learning for thirty-plus years and we finally got technology that’s really doing well on these massive data sets. Deep learning is what has made it particularly exciting, because in certain domains like speech recognition or recognizing images, faces, things like that, this stuff is achieving accuracies in the last five to ten years that many of us didn’t even think were possible fifteen years ago. So this new deep learning stuff is really doing incredibly well, and has made machine learning the thing to do right now.
So machine learning, obviously in the headlines, we’re learning about things becoming more accurate. But one of things that you advocate about is that there is something called ‘black box’ models. When you call certain forms of machine learning ‘black box,’ what do you mean by that?
So, let’s talk about deep learning. These neural nets, now, have gotten to the point where they could be hundreds of layers deep and have a million parameters on every layer, so these deep neural nets, now, might have a hundred million or more parameters in them. And even though we understand how the neural net works – we constructed it – it doesn’t mean that after it has learned to fix those parameters by looking at a massive training set, that we actually understand what it has learned. So it has actually learned by looking through hundreds of thousands of images. It has learned to recognize images by setting those hundreds of millions of weights to particular values. And no human being can look at hundreds of millions of weights, that combine in a complex way, and understand it. So a deep neural net is what we consider to be a ‘black box’ model. It has learned something very sophisticated inside of it. We know how the learning worked – we’re the ones who created the algorithm. But we don’t know exactly what it learned. We’re not able to see it. And if we’re not able to see it just because of scale. It’s so complicated that were not able to see exactly how it makes all the decisions it makes, when it when it’s trying to make a prediction for a new case. So it’s really a problem of our own models have gotten so big and complex now, that even we don’t understand what’s in them. So we call them ‘black boxes,’ because we can’t see what’s going on inside.
Do you feel like, as you’re the the creator of this, you’re coding it, you’re programming it to do this, but it’s doing something far beyond what you’re capable of doing. What does that feel like, as a creator, as an inventor, as a programmer?
It’s interesting. I mean, sometimes these things learn things that surprise you. There are certain kinds of models, which we’ll talk about in a few minutes, which are more ‘white box,’ more open box, that you can understand them. And sometimes we’re really impressed by the things that the models learn. Things which we should have suspected were in the data, but we didn’t know that the model was going to learn it so well. It’s really nice to have these things achieve accuracy sometimes that humans can’t even achieve. There are a few domains now, where we have so much data and we put so much effort into figuring out how to train the models, where the models are really doing as well, or better, than humans. And that’s kind of kind of impressive.
I’ve been working in this area now for thirty-plus years, and it’s exciting to have seen the day where, every now and then, technology we’ve created is even impressing us and going beyond our expectations.
So when it comes to exceeding expectations, mean expectations. I’m going to draw a framework here. Based on my understanding of machine learning, what you have are basically three different outcomes that can happen. One is that the machine is going to learn about what you think that a human could learn if you put them in the same situation. Obviously maybe on a more massive scale, but the outcomes are something that’s on par with human intelligence. A second might be, when the machine learns something greater than we could have figured out, so it learns something better than what we could have done on our own. And the third is when the machine learns something either we know to be inaccurate or that reflects a bias that we didn’t want it to learn. Can you unpack these three different elements, and give us some stories around and some examples of those?
Those are three great, great settings. Often what’s happening is the model that we train is actually doing all three of those at the same time. The important thing is to be happy when it learned stuff that’s like what we would have learned and have performance comparable to us; be thrilled when it has done something more accurate than us, or maybe has learned something new that we didn’t know – that’s fantastic; and then to be sort of cautious when it has learned things which are biases in the data that we need to repair before we go ahead and deploy the model and start using it.
Let me give some examples of these. So if you train a model on 100,000 images, it’ll often learn to recognize the parts of the image that you and I would think were important. If it’s learning to recognize birds, it’ll tend to recognize things like the beak, the kinds of feathers, the color, maybe the size of the bird, what the feet look like. Some of what it learns will be very similar to the kinds of things that you and I would use if we were trying to recognize different birds, or describe them to other people. Some of what it might learn would actually be better. It might learn things about the texture of the feathers, that somehow you and I wouldn’t have had picked up on. So it might actually learn some things about the proportion of the shape of the curve of the back, or it might learn some things that you and I wouldn’t have noticed. Maybe an expert ornithologist would have known these things, but it really can learn some things that maybe you and I would have missed. And part of the reason why these models can be so sophisticated – well, first of all, they really do get trained often, on hundreds of thousands or even millions of images. And if you think about it, no human being ever has the luxury looking at hundreds of thousands or millions of classified images. It would take too long, and we’d get too bored with it. But the machine learning model really can do that. It can sit there for days, if necessary, going through these images trying to get it right.
The other thing is, the machine learning model is very good at looking at everything at the same time. So while you and I might focus on the beak, and the eye, and the foot, and the shape of the neck, or something like that, it can focus on that – and other things – all at the same time. It’s very democratic that way – it looks at everything in the image that it can find signal in. And it uses all of those things at the same time, if possible. So that’s why it can actually be more accurate than us. The bias though is, it can be too democratic. For example, there are image classifiers that have learned to recognize wolves. What they do is, they learn to recognize not just the wolf itself – the animal that’s in the image – but they also learn to recognize the background. The wolf will often be in a setting that has snow in it, whereas a dog would often be in a setting that doesn’t have snow in it. And what happened is, the wolf classifier will also use this background information as part of its classification for wolf versus dog. And when it sees a snowy setting, if you put a dog in a classic wolf setting, it might actually say: oh, that’s a wolf. Because wolves and dogs are vaguely similar, but the setting will make it look more like it’s a wolf to the machine learning model. Because the machine learning model is not smart enough the way you and I are, to say: oh, that looks like my dog. It just happens to be in the setting that looks more like a place where we would find wolves. So they’re not smart enough to separate what they should and shouldn’t learn from. They kind of learn from everything, and that’s both good and bad.
So to give you some other examples of that – so it turns out that when it recognizes cows, it kind of expects cows to be sitting in a farm scene. The typical scene that we would expect in the U.S. would have grass, maybe a barn, maybe a fence, that sort of thing. But if you take that same cow and you put it in a very different setting, for example on the sandy beach next to water, it might not recognize it as a cow. And that’s because even though it looks like a cow on the beach, it doesn’t have enough experience with cows on the beach, and it’s taking the background into account too much. And if you suddenly had that picture of the cow in space, orbiting the moon or something like that, it won’t recognize it at all because the background doesn’t make any sense. It’s never seen any images like that. Whereas you and I would say, oh that’s a cow in space – what’s a cow doing in space? So this is where the models are not so smart, because they take literally everything they can find signal in into account. And some of those things they shouldn’t take into account.
Then there’s the third problem is these biases that can be in the data. For example, I think we all accept that there is a bias in any data that we would use for hiring right now. It would probably have race or gender bias in the training signals. Similarly, if we have data collected from the criminal justice system. The criminal justice system probably has biases in it because humans have bias, and the justice system is basically humans. So any kind of race or gender bias that would be in that data, the machine learning model doesn’t know what we consider to be correct or incorrect to learn. So the model will learn to be just as biased as the data.
Let me give you an advantage of this. I do a lot of work in machine learning for healthcare. We trained a model to try to predict if you’re high-risk or low-risk of dying from pneumonia. The idea being that, if you’re high-risk we’ll put you in the hospital, and if you’re low-risk we’ll give you antibiotics, chicken soup, and call us in a few days if you’re not feeling better. That really is the safest care if you’re a low-risk pneumonia patient. And the model learned a lot of things. It was the most accurate model we could train. This wasn’t neural nets, it was one of these ‘black box’ models. And we decided not to use it on real patients, because we didn’t know what it had learned. It’s a ‘black box’ model – we couldn’t understand it. And we thought that’s a little risky to deploy this thing. Part of the reason why we thought it was risky was, somebody else was training another model that, while being a lot less accurate, was completely intelligible. We could understand what that model had learned. And one of the rules it learned – it learned a lot of good things, by the way – but one of the rules it learned was that, if you have a history of asthma, you have less chance of dying from pneumonia. That is, asthma looks like it’s good for you if you have pneumonia. And that should seem a little weird, right?
You wouldn’t think that.
Yeah – why would asthma be good for you? Well it turns out, if you look more carefully at the data, it turns out the asthmatics actually do you have a better chance of survival than the non-asthmatics. But, the reason why, after we looked into it more carefully – we’re pretty sure the reason for that is asthmatics pay more attention to how they’re breathing. So they notice their symptoms quickly. And then, they have a doctor who treats their asthma. The doctors says: Oh, that’s weird, it’s kind of weird that your meds didn’t work. So they get an appointment quickly. And then they get quickly diagnosed: Oh, it’s not asthma you’re having, it’s pneumonia. They’re considered to be high-risk patients, in fact, by doctors. So they get very aggressive treatment. They might get admitted to the hospital, they might get extra-strong meds, they could even end up in the ICU – in the critical care unit – if they look sick enough. So what happens, we think, is that asthmatics notice the symptoms quickly, get diagnosed accurately quickly, and then get very good treatment quickly. And there’s nothing better if you’ve got an infection like pneumonia than getting rapid diagnosis and high-quality treatment.
So this made us worried. We figured the neural net trained on the same data has probably learned that asthma is good for you. But if we’re going to use this model – suppose you’re a patient, and we might use this model to predict whether you’re high-risk or low-risk. And the model, because you have a history of asthma, says you actually look like you’re low-risk. Then we might not give you rapid treatment or as aggressive treatment. And now, because the fact that you’re asthmatic actually makes you higher risk, means we’re withholding the very treatment that makes you low-risk in the current system. So in fact, if we were to intervene in your care using this model that predicts that asthmatics are low-risk, it conceivably would hurt some asthmatics. So that’s a that’s a bad thing, that’s a risky use of machine learning.
Now, the asthma problem, once we know it’s there, we probably can fix it. The important thing is, the only reason we knew this asthma problem existed in the data was because somebody else trained another model that wasn’t a ‘black box.’ That was very intelligible, and by looking at that model we recognized this problem. So even if we fix the asthma problem in the neural net, which might be hard, but maybe we could fix it. What else did the neural net learn that is similarly risky? But some other, more transparent model, didn’t learn these other things as well, so we don’t actually know we have other problems. It turns out now we have these much more transparent, very accurate models that we’ve been working on for years here at Microsoft.
And it turns out that there are other problems in the data. It turns out that having heart disease is also good for you, if you have pneumonia. In fact, it’s better for you than having asthma. And it turns out it’s exactly the same story. You had a heart attack, let’s say three years ago. You wake up in the morning and you notice a little tightness in your chest, a little trouble breathing. Within an hour you’re in the emergency room. You don’t waste any time. You just go right away. At the E.R. you get to the front of the line, you’re on a bed very quickly with with probes attached to you. And then they have good news – oh, good news, you’re not having another heart attack, it’s just pneumonia. They consider you to be high-risk because you have a cardiovascular disease, so they give you great treatment, and it turns out that’s even better than having asthma because you’ve got to care even faster. So again, we wouldn’t want to model to – and sometimes the model is correct – heart disease patients or asthmatics truly are low-risk, if you let healthcare run the way it runs right now. That is, if these patients actually do get to care faster and get very aggressive treatment. But if you were to use this model’s prediction of low-risk, to then intervene in these patient’s care, and possibly mean that they get care slower or less aggressive treatment, you’d actually be hurting the very patients that we otherwise would have been helping. So it very much depends on how you’re going to use the model.
So this is a case where a model has learned – I’m going to call it a bias that’s in the data, that’s asthma and heart disease problem. It’s a true statistical pattern in the data, just like race and gender bias might be true patterns in the data. The model has done its job. It has learned the statistics of the sample. And now though, given how you’re going to use the model, you recognized: it would be bad if it made this particular kind of prediction, this way. Just like we might decide: it would be bad if this model exhibited race bias or gender bias when it made a prediction. So we might want to fix that. So the risk of ‘black box’ models is that they learned all the good stuff and the bad stuff potentially at the same time. And you’d love to go in and find some of the bad stuff and fix it before you deploy the model. And that’s why a lot of us are working on ways of either taking a ‘black box’ model and opening it up, so that we can explain what’s going on inside. Or, developing other techniques that might be almost as accurate as ‘black box’ models, but which are intelligible from the from the beginning.
So in a lot of ways what you’re talking about, I keep thinking about raising children. In the sense of, you’re trying to figure out what they’re learning, where they picked that up from, are they smart enough to learn what they should or shouldn’t learn? In what ways is that a true analogy, in what ways that a faulty analogy?
It’s interesting. I mean, children learn good stuff, they learned incorrect stuff, and they learn bad stuff, all at the same time. And because we’re sort of in the loop with children – as they’re learning and growing – hopefully we reinforce the good stuff, we correct the incorrect stuff and we try to suppress the what we view as the bad stuff. And we’d like to do exactly the same thing with our machine learning models. But if the machine learning model is a ‘black box,’ that train for a few days on a massive data set, and we can’t understand what’s going on inside of the ‘black box,’ then it’s very hard for us to reward the good stuff, try to correct the errors, and suppress the the stuff that we view as bad. So that’s why it’s important to have either explanation methods that let us see what’s going on inside the models, or a different kinds of learning methods that are more ‘open box’ to begin with, where we actually can understand what they’re learning.
So talk to a business person now, who’s in a management position. They want to take advantage of machine learning. But what is the danger that ‘black box’ models pose to businesses who want to jump into machine learning?
So these problems I talked about, like the asthma and heart disease problem for pneumonia, or race and gender bias in data that has to do with hiring patterns, or with criminal justice. It turns out – every data set, no matter what it is – every data set has these kinds of issues in it. And when you train a model, you might think that the model is incredibly accurate. So what we often do in machine learning is, we have some data, we sort of cut it in half, we train on half of it and then the test on the other half. And we hope that the accuracy on the test set, the part that it didn’t train on, looks really, really good. Possibly better than humans. It could have superhuman accuracy, which is what we’re always after. Now you might think, this model has this very high accuracy, therefore it’s better than humans or just as good as humans; it’s safe to deploy. Well, what happens is, the data – because our test data looks just like our training data – it has all the very same biases in it. So when we test this model on pneumonia using part of the training data held aside as the test set, it turns out it also has the statistical pattern in it that asthma and heart disease are good for you. Or, if there’s race or gender bias in the data, that’ll be in your test set as well. So the model will actually be rewarded on the test set, with very high accuracy for making predictions that we recognize are inappropriate, one way or another.
So the first thing you should be concerned about is: you train a model, it looks on test data like it’s extremely accurate, like it it’s great, like it’s going to save you money, it’s going to help your company. You’re good to go. And in fact, what’s often the case is: some of what it has learned is extremely good, and appropriate. But some fraction, maybe five, ten percent of what it has learned is inappropriate, and it’s getting rewarded with extra high accuracy for those incorrect predictions, as well. Because your test set looks like your training set. So the real measure of the accuracy of these models, and whether they’re safe to use – one measure is, you have to deploy them. So you actually have to use them in the real world. You have to do something like a limited clinical trial. Where you very carefully use it for a small number of predictions. Look carefully at the predictions it makes, and you try to assess what its accuracy in the wild, in the real world, is. That accuracy can be incredibly different from the accuracy that you have in the lab.
To give you an analogy: right now, when they train self-driving cars to recognize pedestrians and roads and things like that, in the lab they get these things to the point where their accuracy is essentially perfect. In the lab they really get these things so that they’re as close to perfect as we can make them. They’re amazingly good. But then when you deploy the model by actually having it drive on the street, it turns up some things will happen in the real world that you never collected in your data, or you never thought of in your simulators. And the real world is basically more complicated than your laboratory set-up. Because of that, the self driving car will end up in some settings that it really wasn’t adequately trained for, or tested on. So you’d like to do careful deployment of these self-driving cars, which is what they tend to do. Usually you have a human behind the wheel still, who is ultimately in charge to take control whenever the self-driving car looks like it’s going to do something wrong. And that’s because we recognize that the accuracy on a test set in a lab setting doesn’t reflect the true accuracy, yet, that we’re going to see in the real world.
So this is an important part of good practice. Get the thing to be as accurate as possible in your company, in the lab. If possible, try to understand what the model has learned. Have human experts look at the model, or look at its predictions. And then, do a very careful rollout of this model into the real world. Especially if it’s a critical domain, where people’s lives might be at risk if if the model makes mistakes – as would be the case in healthcare, or in autonomous vehicles. And then if the model looks like it has superhuman accuracy in the real world, that’s fantastic. We’re all looking for the car the drives better than we do.
Excellent, excellent. Rich, I’ve been fascinated by all this. We really learned – you packed a lot of information into this short time. My big takeaway is that, I know that I’m still smarter than machines as long as I can recognize that it’s a cow in space. So, I’m happy about that, for now.
The thing you have is, you have a lot more real world experience. A lot more diverse experience. And you have this thing that we call common sense. And the machine – whenever I tell a human being that the pneumonia model has learned that asthma and heart disease are good for you, even if they know nothing about medicine, they’re like: oh, that doesn’t make sense, why would it do that? And the model is very smart. It’s probably a more accurate predictor of risk than you or I are, as non-experts. And yet, the model doesn’t have the common sense to question what it learned. It doesn’t know that: wow, why would asthma be good for you? So there’s definitely things that humans have, that these machine learning models do not have yet, even if the models look very accurate. The humans have a lot of background knowledge. And a lot of broader experience, and they know when to question what the model has learned or how it’s making predictions. In a way that the model doesn’t to question itself yet.
Fantastic. Very, very fascinating talk. I loved learning more about this topic. Rich, how can people connect with you to learn more about your research?
I’m pretty easy to find. Fortunately my last name is fairly unique in machine learning, so if you just type in my name into any browser and, say, machine learning – it will almost certainly come up with me. If you don’t type in my name, you’ll end up with a chess grandmaster which, unfortunately, is not me. So the easiest way to find me is just do a search for me. You’ll ultimately end up with one of several email addresses, and it turns out all of them will get to me.
That’s Rich Caruana , right?
That’s correct, yes.
Well Rich, thanks so much for being on WorkMinus, and we hope you have a great day.
Thank you very much, I enjoyed it.