How Are We Really Training GenAI?
- Brittany Luckham
- Jun 13
- 4 min read
Part 2: Examining Generative AI
Last time, we covered Generative AI’s carbon footprint and energy usage. This time, I will look at how we train AI and what type of information is used to do so. I’ll also cover the gender and race biases that have arisen in the data.
LibGen and Pirated Books
As a writer, I was affronted when I found out that Meta had used LibGen, an online library full of pirated books, to train its AI. Victoria Aveyard, a New York Times bestselling author I follow on social media, was interviewed by WBUR after discovering that up to 50 of her works (including translations because she’s only written eight novels) may have been used by Meta without her permission or any compensation.
The article continues, stating that “a recent report from The Atlantic found tens of millions of books and research papers on LibGen were downloaded by Meta, without the use of a license, in order to train the tech company’s generative-AI models.”
Tens of millions. Books and research papers. This begs the question: what else is being used to train AI?
Types of Data
Potter Clarkson broke down how AI is trained nicely, so I’ll cite them here.
According to Potter Clarkson, there are three types of data used in the process of creating an AI model:
Training data — data used to train the AI model
Test data — data used to test the model and compare it to other models
Validation data — data used to validate the final AI model
Training data is either structured or unstructured. Structured data, for example, can be market data shown in tables. Unstructured data includes audio, video, and images. This data can be customer data, sourced internally by organizations, or externally by third parties. These third parties can include “the government, research institutions, and companies for commercial purposes.”
Scientific American adds that “Web crawlers and scrapers can easily access data from just about anywhere that is not behind a login page. Social media profiles set to private aren’t included.”
What else is included, you ask?
Anything posted on Flickr
Online marketplaces
Voter-registration databases
Wikipedia
Reddit
Research repositories
News outlets
Academic institutions
Pirated content compilations and web archives
Customers’ Alexa conversations
And more
Ben Zhao, a computer scientist at the University of Chicago, points to “one particularly striking example where an artist discovered that a private diagnostic medical image of her was included in the LAION [AI] database.” Additionally, “reporting from Ars Technica confirmed the artist’s account and that the same dataset contained medical-record photographs of thousands of other people as well.”
Data Bias
Meredith Broussard, a data journalist who researches AI at New York University, brings forward concerns about transparency and data bias. She warns that while there is “wonderful stuff on the internet, there is also extremely toxic material too.” Think white supremacist websites, hate speech, and fake news. She adds, “It’s bias in, bias out.”
A research study titled Ethics and discrimination in artificial intelligence-enabled recruitment practices aimed to “address the research gap on algorithmic discrimination caused by AI-enabled recruitment.” They found that “algorithmic bias results in discriminatory hiring practices based on gender, race, colour, and personality traits.”
How does this happen? Well, consider how AI is shaped. We, as humans, feed it data, but that data may already be unfair and contain discriminatory, prejudiced, and incomplete information.
For example, “In 2014, Amazon developed an ML-based hiring tool, but it exhibited gender bias. The bias stemmed from training the AI system on predominantly male employees’ CVs (Beneduce, 2020), thus, the recruitment algorithm perceived this biased model as indicative of success, resulting in discrimination against female applicants (Langenkamp et al., 2019).”
Furthermore, Tay, Microsoft’s chatbot, learned to produce sexist and racist remarks on Twitter. It did this through “interacting with users on the platform, and absorbing the natural form of human language.” Unfortunately, the chatbot quickly adopted hate speech targeting women and black individuals. Tay was shut down shortly after.
Research has also indicated that when machines passively absorb human biases, they can reflect subconscious biases (Fernández and Fernández, 2019; Ong, 2019). For instance, “searches for names associated with Black individuals were more likely to be accompanied by advertisements featuring arrest records, even when no actual records existed.”
Data cannot exist without bias. We may think it can, but we as humans are full of conscious and unconscious biases, so, of course, that is going to end up included in our data. If anything, these findings on AI models cast a stark and rather harsh reality on the state of online human interactions.
Final Thoughts
No data is safe from training AI. Why? Because everyone is seeking to capitalize on it, including governments and research institutions. Everything from what we post on social media to our medical files is being used to train AI and GenAI. It’s proving the biases we know exist throughout the world, and it’s stealing humanity’s greatest achievement, the one thing that will outlast, and has outlasted, complete ruin: art and creativity.
There’s an argument to be made that AI creates “transformative works,” thereby not violating any copyright laws. However, there are two counterarguments to this. One, the underlying work will always be stolen intellectual property. No matter how you frame it, the books, research papers, and data sets are not, in most cases, willingly handed over. In fact, many of these companies are not even asking permission to use other people’s work.
Two, when I think of transformative works, I think first of fanfiction. Specifically, fanfiction that has been changed enough from the original work that it can be published as its own entity. My Life With The Walter Boys began as Vampire Diaries fanfiction. The After series by Anna Todd was once One Direction fanfic. If you weren’t aware of this before, you’d be hard pressed to connect some of those dots. However, those fanfictions were still originally written by people. The issue with GenAI is that every word, sentence, or line of dialogue was taken directly from others’ work.
Writing is a craft. It’s a skill that can be practiced and honed by reading and analyzing others’ books and stories. By writing and experimenting with different genres, styles, and voices. I wonder how anyone will find their unique writing voice if they are only allowed to choose from existing voices?
I tackle this dilemma in the next part of this series, where I’ll examine GenAI’s effect on creativity and critical thinking skills. Stay tuned.
Comments