Back to Index

What's Behind the ChatGPT History Change? How You Can Benefit + The 6 New Developments This Week


Chapters

0:0 Intro
0:59 What does this mean
2:44 Why this matters
3:56 How AI companies collect data
5:18 Whats in the training set
6:52 Stack Overflow
10:51 The Future

Transcript

18 hours ago Sam Altman put out this simple tweet that you can now disable chat history and training in ChatGPT and that we will offer ChatGPT business in the coming months. But dig a little deeper and behind this tweet is a data controversy that could engulf OpenAI, jeopardize GPT-5 and shape the new information economy.

I will show you how you can benefit from this new feature, reveal how you can check if your personal info was likely used in GPT-4 training and investigate whether ChatGPT could be banned in the EU, Brazil, California and beyond. But first the announcement. OpenAI say that you can now turn off chat history in ChatGPT but that it's only conversations that were started after chat history is disabled that won't be used to train and improve their models.

Meaning that by default your existing conversations will be banned. So if you're not sure what to do, you can turn off chat history in ChatGPT. So how does it work and what does this mean? What you need to do is click on the three dots at the bottom left of a ChatGPT conversation, then go to settings and show.

And here's where it starts to get interesting. They have linked together chat history and training. It's both or neither. They could have given two separate options. One to store your chat history so that you can look back over it later and another to opt out of training. But instead it's one button.

You either give them your data and keep your chats, or you don't give them your data and you don't keep your chats. If you opt not to give them your chat history, they still monitor the chats for what they call abuse. So bear that in mind. What if I want to keep my history on but disable model training?

We are working on a new offering called ChatGPT business. I'm going to talk about that in a moment, but clearly they don't want to make it easy to opt out of giving over your training data. Now, in fairness, they do offer an opt out form. But if you go to the form, it says cryptically, please know that in some cases this will limit the ability of our models to better address your specific use case.

So that's one big downside to this new announcement. But what's one secret upside? This export data button buried all the way down here. If you click it, you quite quickly get this email, which contains a link to download a data export of all your conversations. After you download the file and open it, you now have an easy way to search through all your previous conversations, literally all of them from the time you first started using ChatGPT to the present day.

That is a pretty great feature, I must admit. But going back to the announcement, they said that you need to upgrade to ChatGPT business available in the coming months to ensure that your data won't be used to train our models by default. But why these announcements now? Why did Sam Altman tweet this just yesterday?

Well, this article also from yesterday in the MIT Technology Review by Melissa Aikila may explain why. It said that OpenAI has until the end of this week to comply with Europe's strict data protection regime, the GDPR, but that it will likely be impossible for the company to comply because of the way data for AI is collected.

Before you leave and say this is just about Europe, no, it's much bigger than that. The European Data Collection Supervisor said that the definition of hell might be coming for OpenAI based on the potentially illegal way it collected data. If OpenAI cannot convince the authorities its data use practices are legal, it could be a big problem for the company.

So, let's get to the point. OpenAI could be banned not only in specific countries like Italy or the entire EU, it could also face hefty fines and might even be forced to delete models and the data used to train them. The stakes could not be higher for OpenAI. The EU's GDPR is the world's strictest data protection regime and it has been copied widely around the world.

Regulators everywhere from Brazil to California will be paying close attention to what happens next and the outcome could fundamentally change the way AI companies are going to be treated. So, let's get to the point. How do these companies collect your data? Well, two articles published this week tell us much more.

To take one example, they harvest pirated ebooks from the site formerly known as BookZZ. Until that was seized by the FBI last year. Despite that, contents of the site remain in the Common Crawl database. OpenAI won't reveal the dataset used to train GPT-4, but we know the Common Crawl was used to train GPT-3.

So, let's get to the point. OpenAI is a very popular website for collecting data. OpenAI may have also used the Pile, which was used recently by Stability AI for their new LLM, StableLM. The Pile contains more pirated ebooks, but also things like every internal email sent by Enron. And if you think that's strange, wait until you hear about the copyright takedown policy of the group that maintains the Pile.

I can't even read it out for the video. This article from the Washington Post reveals even more about the data that was likely used to train GPT-4. For starters, we have the exclusive content of Patreon, so presumably all my Patreon messages will be used to train GPT-5. But further down in the article, we have this search bar where you can look into whether your own website was used in the Common Crawl dataset.

I even found my mum's WordPress family blog, so it's possible that GPT-5 will remember more about my childhood than I do. If you think that's kind of strange, wait until you hear that OpenAI themselves might not even know what's in their training set. This comes from the GPT-4 technical report, and in one of the footnotes it says that portions of this big bench benchmark were inadvertently mixed into the training set.

That word inadvertently is rather startling. For the moment, let's not worry about how mixing in benchmarks might somewhat obscure our ability to test GPT-4. Let's just focus on that word inadvertently. Do they really not know entirely what's in their dataset? Whether they do or not, I want you to get ready to count the number of ways that OpenAI may soon have to pay for the data it once got for free.

First, Reddit. They trawled Reddit for all posts that got three or more upvotes and included them in the training data. Now, this New York Times article says, Reddit wants them to pay for the privilege. The founder and chief executive of Reddit said that the Reddit corpus of data is really valuable, but we don't need to give all of that value to some of the largest companies in the world for free.

I agree, but my question is, will the users be paid? In fact, that's my question for all of the examples you have seen in this video and are about to see. Does the user actually get paid? If OpenAI is set to make trillions of dollars, as Sam Altman has said, will you get paid for helping to train it?

Apparently, Reddit is right now negotiating fees with OpenAI, but will its users get any of that money? What about the Wikipedia editors that spend thousands of hours to make sure the article is accurate, and then GPT-4 or 5 just trawls all of that for free? Or what about Stack Overflow?

What about the website that uses Stack Overflow, the Q&A site for programmers? Apparently, they are now going to also charge AI giants for training data. The CEO said that users own the content that they post on Stack Overflow under the Creative Commons license, but that that license requires anyone later using the data to mention where it came from.

But of course, GPT-4 doesn't mention where its programming tricks come from. Is it me, or is there not some irony in the people being generous enough to give out answers to questions in programming, actually training a model, that may end up one day replacing them, all the while giving them no credit or compensation?

But now we must turn to lawsuits, because there are plenty of people getting ready to take this to court. Microsoft, GitHub, and OpenAI were recently sued, with the companies accused of scraping license code to build GitHub's AI-powered copilot tool. And in an interesting response, Microsoft and GitHub said that the complaint has certain defects, including a lack of injury.

And the companies are... They argue that the plaintiffs rely on hypothetical events to make their claim, and say that they don't describe how they were personally harmed by the tool. That could be the big benchmark, where these lawsuits fail currently because no one can prove harm from GPT-4. But how does that bode for the future when some people inevitably get laid off because they're simply not needed anymore because GPT-4 or GPT-5 can do their jobs?

Then would these lawsuits succeed? When you can prove that you've lost a job because of a specific tool, which was trained using in part your own data, then there is injury there that you could prove. But then if you block GPT-4 or GPT-5, there will be millions of coders who can then say that they're injured because their favorite tool has now been lost.

I have no idea how that's going to pan out in the courts. Of course, these are not the only lawsuits with the CEO of Twitter weighing in, accusing OpenAI of illegally using Twitter data. And what about publishers, journalists, and newspapers, whose work might not be read as much because people can get their answers from GPT-4?

And don't forget their websites were also called to train the models. Well, the CEO of News Corp said that clearly they are using proprietary content. There should be, obviously, some compensation for that. So it seems like there are lawsuits coming in from every direction. But Sam Altman has said in the past, "We're willing to pay a lot for very high quality data in certain domains such as science." Will that actually enrich scientists and mathematicians or will it just add to the profits of the massive scientific publishers?

That's another scandal for another video, but I am wondering if OpenAI will be tempted to use some illicit sites instead, such as SkyHub, a shadow library website that provides free access to millions of research papers without regard to copyright. It basically gets past the scientific publishers paywall and apparently up to 50% of academics say that they use websites like SkyHub.

Inevitably, just because they're using the same website as the scientific publishers, they're not going to be able to use the same data. So I think the GPC5 is going to break through some new science benchmarks. I just wish that the scientists whose work went into training it were compensated for helping it do so.

Just in case it seems like I'm picking on OpenAI, Google are just as secretive and they were even accused by their own employees of training Bard with chat GPT data. They have strenuously denied this, but it didn't stop Sam Altman from saying, "I'm not that annoyed at Google for training on chat GPT output, but the spin is annoying." He, obviously, is not the only one who has been accused of this.

He's also accused of using the same code as the GPC5. He doesn't believe their denial. And given all of this discussion on copyright and scraping data, I found this headline supremely ironic. OpenAI are trying to trademark the name GPT. Meaning all of those models that you've heard of, AutoGPT, MemoryGPT, HuggingGPT, they might be stopped from using that name.

Imagine a world where they win all of their battles in court and they can use everyone's data, but no one can use their name GPT. But maybe this entire data issue won't be relevant for much longer. Sam Altman recently said that he predicts OpenAI data spend will go down as models get smarter.

I wonder if he means that the models might be able to train their own synthetic data sets and therefore not require as much outside data. Or of course he could be talking about simplifying the reinforcement learning with human feedback phase, where essentially the model gives itself feedback, reducing the need for human evaluators.

Wouldn't that be quite something if GPT-4 was used to train GPT models? I fluctuate between being amazed, annoyed and deeply concerned about where all of this is going. Let me know in the comments what you think of it all and have a