It’s tough to read technology news these days without hearing about the wonders of Big Data and how it’s going to revolutionize our world. Apparently it will soon predict epidemics, prevent terrorist attacks, and boost farm production.
In truth, though: it’s not so clear that it’s a great thing. One of the problems with Big Data is a special case of a general problem in the ethics of technology: the kinds of things that can go wrong when the public perception of how well/poorly technology performs doesn’t match well with the truth. In particular: when the public thinks that technology performs way better than it does.
You will occasionally hear people talking about how algorithms are going to take our jobs, bring about the zombie apocalypse prematurely, etc. More commonly, technology gee-whizzers will tell you the opposite: that they will remove bias and introduce complete objectivity to sentencing guidelines, for instance. In fact, an algorithm is nothing more (or less) than a defined set of procedures. In the case of an algorithm for computing, it’s typically a set of calculations. An algorithm can’t be biased. It can’t be unbiased, either. The data, though: that can be biased. An example from the interview: train an algorithm to evaluate resumes from applicants for jobs at an engineering firm. You could imagine training it with the resume of everyone who has ever been hired in the past, and the following piece of information for each person: whether or not they were a successful employee. If the engineering firm is a typical one, those previous hires are mostly going to have been males. Now the program learns the characteristics of a successful hire, and among other things, the program will conclude that a successful hire is going to be a male, since that’s all that it’s ever seen. Is the algorithm biased? No. Is the person who programmed it biased? No. What’s biased? The data. Not biased in the way that a person is biased–rather, biased in the statistical sense: not every member of the population had an equal likelihood of being included in the training set.
Where people get seduced by things with the Big Data label on them is by the bigness. Most people know that the bigger your data set is, the more reliable the statistical model that comes out of it will be. A lot of people look at Big Data and think: there’s a LOT of data, so it’s GOT to be good. That’s where the trouble comes from.
I like this interview because it’s neither a gee-whiz-this-technology-is-so-great story, nor an ignorant oh-my-God-the-data-miners-are-going-to-kill-us story. The interviewee, Cathy O’Neil, knows what she’s talking about, and she explains it well. The unbiased sentencing program? It didn’t work out so great–see a very detailed story about it here.
Link to the interview with Cathy O’Neil:
- le big data: Big Data.
- les mégadonnées: Big Data.
- les données massives: Big Data.
- to sentence to (a punishment): to assign a punishment or penalty to someone. Examples: A 46-year-old man threw feces in a Clark County, Ohio, courtroom Wednesday after learning he was being sentenced to 40 years in prison for armed robbery. (Story here.) Alan Turing, the pioneering computer scientist and cryptanalyst who cracked the Nazis’ Enigma code, was sentenced to chemical castration as a punishment for his homosexuality.
- sentencing guidelines: instructions for how to determine the length of the jail or prison sentence of someone who has been convicted of a crime. How it was used in the post: More commonly, technology gee-whizzers will tell you the opposite: that they will remove bias and introduce complete objectivity to sentencing guidelines, for instance.