Tech-makers assuming their reality accurately represents the world create many different kinds of problems. The training data for ChatGPT is believed to include most or all of Wikipedia, pages linked from Reddit, a billion words grabbed off the internet. (It can’t include, say, e-book copies of everything in the Stanford library, as books are protected by copyright law.) The humans who wrote all those words online overrepresent white people. They overrepresent men. They overrepresent wealth. What’s more, we all know what’s out there on the internet: vast swamps of racism, sexism, homophobia, Islamophobia, neo-Nazism.
Tech companies do put some effort into cleaning up their models, often by filtering out chunks of speech that include any of the 400 or so words on “Our List of Dirty, Naughty, Obscene, and Otherwise Bad Words,” a list that was originally compiled by Shutterstock developers and uploaded to GitHub to automate the concern, “What wouldn’t we want to suggest that people look at?” OpenAI also contracted out what’s known as ghost labor: gig workers, including some in Kenya (a former British Empire state, where people speak Empire English) who make $2 an hour to read and tag the worst stuff imaginable — pedophilia, bestiality, you name it — so it can be weeded out. The filtering leads to its own issues. If you remove content with words about sex, you lose content of in-groups talking with one another about those things.
These things are not AI. Repeat after me: these things are not AI. All they do is statistically predict the best next sequence of words based on a corpus of texts. That’s it. I’m not worried about these things leading to SkyNet – I’m much more worried about smart people falling for the hype.