I was super happy that I had the opportunity to present at a world class Machine Learning event in Warsaw, Poland. People from research organizations from all over the world attended ML in PL. I had been looking forward to all of the deeply technical talks, but I was grateful to the organizers that we could start the day by taking a step back and reflecting a bit on the ethics of what we do. It’s an important topic and doesn’t receive the attention that it should.
As Machine Learning people, we work on technologies that are super powerful. They impact billions of people in their everyday lives. And after a certain point, the decisions are made mostly by non-technical people, like managers and politicians. The message of my talk is that the Machine Learning community must shape the world so that AI is built and implemented with a focus on the entire outcome for our society, and not just optimized for accuracy and/or profit.
I have researched this topic quite a lot, and during that time I have found a number of stories that made a huge impression on me. Here are ten lessons based on real life examples that I think we should all remember as people working in machine learning, whether you’re a researcher, engineer, or a user.
I’ll start with a story. Here is the Challenger shuttle as it launched for the tenth time.
For the first time ever, a school teacher was among the crew, so unlike previous launches, this one was widely watched, especially by school children. 73 seconds after start, the shuttle broke up into pieces. The entire crew of seven astronauts died in the accident.
How could this happen? Obviously, there was a thorough investigation after the accident. Only then we learned that the crew was doomed virtually from the very beginning of the launch. This was due to a leak in one of the two side rocket boosters. These boosters were made of segments; in places where the segments were connected, they were sealed with rubber O-ring seals. During this disastrous launch, one of the seals didn’t do its job, and extremely hot gases started to escape, causing more and more leaks. At 72 seconds after launch the whole rocket booster broke apart from the main shuttle. This caused the entire vehicle to instantly change its course, which exposed it to acceleration forces it wasn’t built for. At this point the shuttle broke into pieces.
But why did this happen? It turns out that the day before the launch NASA got a weather forecast saying that it would be an unusually cold morning in Florida, with a temperature of -1 degree Celsius. However, there was a huge political and organizational pressure to launch, especially since the launch has been postponed several times already, and NASA has promised a shuttle launch every month. The rocket boosters were provided by a subcontractor company called Morton Thiokol. The O-rings I mention were designed for temperatures above 10 degrees Celsius. They weren’t even tested below that. Engineers from this company expressed concerns about previous launches in cold temperatures. And NASA procedures said that the launch couldn’t happen if a provider of a part of the shuttle recommended otherwise. NASA remembered that, so they got Morton Thiokol on an emergency conference call the night before launch.
Engineers that designed the rocket booster were shocked that someone was even considering a launch at such a low temperature. They didn’t have any data that this was safe. They were sure that the temperature was too low — rubber simply loses its elasticity and stops being a good sealant. So during this conference call, Morton Thiokol engineers and managers recommended cancelling the launch. However, NASA was their most important client and they really wanted them to say yes. So the company’s general manager pressured his people to change their mind if possible.
At one point, two of the senior managers agreed, but there was still one person objecting. That’s when the general manager said to him: “You need to take off your engineering hat, and put on your manager hat.”
And that’s what happened. He changed his vote, and they agreed to support the launch.
I think this quote is very powerful from hindsight, and I would never like to be in that position. Of course there were multiple layers of reasons for the disaster. But this decision cost the lives of seven astronauts.
As ML people we work on technologies that are powerful that affect not just seven people, but millions of people, during their whole lives, all of the time. AI should be built and implemented with a view on the whole picture — not just about focusing on some metric or making profit or achieving just one impact. As CTO of Appsilon Data Science I care about the models that we build, but also about ethics. We need to build a culture where AI is built with a focus on the entire outcome, not just accuracy or profit.
What if your grandmother’s healthcare plan was suddenly cancelled by an algorithm? Something like this happened in the United States (Rachel Thomas used this example and some of the following in her brilliant keynote, which I’ll discuss later). Tammy Dobbs has cerebral palsy, which means that mentally she is fully capable, but requires the use of a wheelchair and has stiffened hands. She lived in the state of Arkansas, where she was entitled to have someone come to her home three times a day to assist her. Without that help, she basically couldn’t exist.
So one day in 2016 a nurse goes to her house for one of the periodic assessments, this time with a computer using a new algorithm recently approved by the state. At the conclusion of the re-evaluation, Ms. Dobbs is notified that her allotment has been cut from 56 to 32 hours a week. It probably doesn’t sound like a big deal, but for Ms. Dobbs it was a disaster, a life-damaging event even.
Only after lawsuits were filed and the case was investigated in court, it was learned that Ms. Dobbs’ particular illness was improperly encoded in the algorithm.
There are several main issues from this scenario that affects many people. One of the biggest problems is the lack of explanation to the public of how this works. Even the nurse that did the assessment wasn’t able to explain anything about the decision. There was no appeals process in place. The only option was to file a lawsuit. Thirdly, no one feels responsible. There were several parties involved: a software company, the researchers that created the initial model, and politicians in the state that also made decisions.
The creator of the model was asked if he felt responsible for the error, and he replied, “Yeah, I also should probably dust under my bed.” He didn’t feel responsible at all. I can kind of understand that. From a distance, it kind of seems ridiculous that he didn’t feel responsible, but when you’re in the middle of the work, you’re mostly focusing on applying some metric, and you’re deep into the coding problems, it’s easy to lose the bigger scope. That brings us to Lesson 1.
Lesson 1: As algorithm/model authors we must take responsibility for how it will be used. Responsibility tends to get diluted.
Maybe some of you are aware that in the US, there is software that gives suggestions to judges that tries to predict if an individual will make a similar offense again, or not. The intent is super good. It can help release people who just made one mistake, and probably won’t re-offend in the future, and at the same time arrest people who likely will. However, it’s super biased! The algorithm misplaces the risk of re-offending twice more likely if you are African American than if you are White:
Here is another famous example, and what I find shocking is that it had been there for years. For quite a while, if you used Google Translate for Turkish, which has gender neutral pronouns, o bir doktor translated to “He is a doctor” and o bir hemşire translated to “She is a nurse” without context clues. So we have some bias from our society encoded in a product that is widely used. And Google Translate has a huge budget, but the translation remained there for a long time.
Lesson 2: We need to deliberately work and test our models to avoid bias. Interpret what a model has learned.
Another example comes from a book that I recommend — “Weapons of Math Destruction.” There was a person named Katherine Taylor who couldn’t get a job. She kept getting rejected. One time she applied to the Red Cross, and finally she received a denial with a clue. Red Cross informed her that she was rejected because of her criminal past. It turns out that there was another woman with the same name, same birthday, who did have a criminal past. I think it’s more likely for most of us to work on these types of systems.
Lesson 3: Even non-critical systems are critical for individual lives.
I particularly like one of the observations that she makes. Algorithms are used differently than people’s suggestions. If you think about the judges and the recommendations from the previously mentioned recidivism system, it’s hard to get judges to maintain objectivity when presented with the results of the model. They’re likely to treat the info from the algorithm as if it’s objective and error-free. I highly recommend watching her keynote from PyBay 2018.
To the human decision makers out there in the world, you have to consider the possibility that the algorithm for your model has errors. So then, what is the appeals process for the people and groups affected by the outcomes? It must be established. That takes us to the next lesson.
Lesson 4: People trust algorithms more than we want or expect them to.
Do you have your own “lessons”? Add them in the comments section below.
Thanks for reading! Follow me on Twitter @marekog. Part 2 of this article will be posted later this week with six more lessons as well as some positive examples to follow in our daily practice.