back to indexWhat's Behind the ChatGPT History Change? How You Can Benefit + The 6 New Developments This Week
Chapters
0:0 Intro
0:59 What does this mean
2:44 Why this matters
3:56 How AI companies collect data
5:18 Whats in the training set
6:52 Stack Overflow
10:51 The Future
00:00:00.000 |
18 hours ago Sam Altman put out this simple tweet that you can now disable chat history 00:00:06.080 |
and training in ChatGPT and that we will offer ChatGPT business in the coming months. 00:00:11.760 |
But dig a little deeper and behind this tweet is a data controversy that could engulf OpenAI, 00:00:18.340 |
jeopardize GPT-5 and shape the new information economy. I will show you how you can benefit 00:00:24.700 |
from this new feature, reveal how you can check if your personal info was likely used in GPT-4 00:00:30.540 |
training and investigate whether ChatGPT could be banned in the EU, Brazil, California and beyond. 00:00:37.080 |
But first the announcement. OpenAI say that you can now turn off chat history in ChatGPT but that 00:00:43.760 |
it's only conversations that were started after chat history is disabled that won't be used to 00:00:49.860 |
train and improve their models. Meaning that by default your existing conversations 00:00:54.420 |
will be banned. So if you're not sure what to do, you can turn off chat history in ChatGPT. 00:00:54.680 |
So how does it work and what does this mean? What you need to do is click on the three dots 00:01:01.800 |
at the bottom left of a ChatGPT conversation, then go to settings and show. And here's where 00:01:07.520 |
it starts to get interesting. They have linked together chat history and training. It's both 00:01:12.580 |
or neither. They could have given two separate options. One to store your chat history so that 00:01:17.420 |
you can look back over it later and another to opt out of training. But instead it's one button. 00:01:21.980 |
You either give them your data and keep your chats, 00:01:24.660 |
or you don't give them your data and you don't keep your chats. If you opt not to give them your 00:01:29.360 |
chat history, they still monitor the chats for what they call abuse. So bear that in mind. 00:01:33.880 |
What if I want to keep my history on but disable model training? We are working on a new offering 00:01:39.140 |
called ChatGPT business. I'm going to talk about that in a moment, but clearly they don't want to 00:01:44.280 |
make it easy to opt out of giving over your training data. Now, in fairness, they do offer 00:01:49.500 |
an opt out form. But if you go to the form, it says cryptically, please know that 00:01:54.640 |
in some cases this will limit the ability of our models to better address your specific use case. 00:02:00.280 |
So that's one big downside to this new announcement. But what's one secret upside? 00:02:04.720 |
This export data button buried all the way down here. If you click it, you quite quickly get this 00:02:10.920 |
email, which contains a link to download a data export of all your conversations. 00:02:16.160 |
After you download the file and open it, you now have an easy way to search through 00:02:20.500 |
all your previous conversations, literally all of them from the time you first 00:02:24.620 |
started using ChatGPT to the present day. That is a pretty great feature, I must admit. 00:02:29.660 |
But going back to the announcement, they said that you need to upgrade to ChatGPT business 00:02:33.920 |
available in the coming months to ensure that your data won't be used to train our models by default. 00:02:39.920 |
But why these announcements now? Why did Sam Altman tweet this just yesterday? 00:02:44.060 |
Well, this article also from yesterday in the MIT Technology Review by Melissa Aikila may explain 00:02:51.240 |
why. It said that OpenAI has until the end of this 00:02:54.520 |
week to comply with Europe's strict data protection regime, the GDPR, but that it will likely be 00:03:00.860 |
impossible for the company to comply because of the way data for AI is collected. Before you leave 00:03:06.380 |
and say this is just about Europe, no, it's much bigger than that. The European Data Collection 00:03:10.840 |
Supervisor said that the definition of hell might be coming for OpenAI based on the potentially 00:03:17.100 |
illegal way it collected data. If OpenAI cannot convince the authorities its data use practices 00:03:23.100 |
are legal, it could be a big problem for the company. So, let's get to the point. 00:03:24.500 |
OpenAI could be banned not only in specific countries like Italy or the entire EU, it could also face hefty 00:03:30.080 |
fines and might even be forced to delete models and the data used to train them. The stakes could 00:03:36.480 |
not be higher for OpenAI. The EU's GDPR is the world's strictest data protection regime and it 00:03:42.820 |
has been copied widely around the world. Regulators everywhere from Brazil to California will be 00:03:49.040 |
paying close attention to what happens next and the outcome could fundamentally change the way AI 00:03:54.060 |
companies are going to be treated. So, let's get to the point. 00:03:54.400 |
How do these companies collect your data? Well, two articles published this week tell us much more. 00:04:03.460 |
To take one example, they harvest pirated ebooks from the site formerly known as BookZZ. Until that 00:04:10.140 |
was seized by the FBI last year. Despite that, contents of the site remain in the Common Crawl 00:04:16.300 |
database. OpenAI won't reveal the dataset used to train GPT-4, but we know the Common Crawl was used 00:04:23.060 |
to train GPT-3. So, let's get to the point. OpenAI is a very popular website for collecting data. 00:04:24.300 |
OpenAI may have also used the Pile, which was used recently by Stability AI for their new LLM, 00:04:31.120 |
StableLM. The Pile contains more pirated ebooks, but also things like every internal email sent 00:04:38.160 |
by Enron. And if you think that's strange, wait until you hear about the copyright takedown policy 00:04:44.280 |
of the group that maintains the Pile. I can't even read it out for the video. 00:04:48.960 |
This article from the Washington Post reveals even more about the data that was likely used 00:04:54.200 |
to train GPT-4. For starters, we have the exclusive content of Patreon, so presumably all my Patreon 00:05:00.520 |
messages will be used to train GPT-5. But further down in the article, we have this search bar where 00:05:05.720 |
you can look into whether your own website was used in the Common Crawl dataset. I even found my 00:05:11.320 |
mum's WordPress family blog, so it's possible that GPT-5 will remember more about my childhood than I 00:05:17.800 |
do. If you think that's kind of strange, wait until you hear that OpenAI themselves might not even know what's in their 00:05:24.100 |
training set. This comes from the GPT-4 technical report, and in one of the footnotes it says that 00:05:30.100 |
portions of this big bench benchmark were inadvertently mixed into the training set. 00:05:35.700 |
That word inadvertently is rather startling. For the moment, let's not worry about how mixing in 00:05:41.300 |
benchmarks might somewhat obscure our ability to test GPT-4. Let's just focus on that word inadvertently. 00:05:48.180 |
Do they really not know entirely what's in their dataset? Whether they do or not, I want you to get ready to count 00:05:54.000 |
the number of ways that OpenAI may soon have to pay for the data it once got for free. First, 00:06:00.660 |
Reddit. They trawled Reddit for all posts that got three or more upvotes and included them in the 00:06:06.500 |
training data. Now, this New York Times article says, Reddit wants them to pay for the privilege. 00:06:11.940 |
The founder and chief executive of Reddit said that the Reddit corpus of data is really valuable, 00:06:17.200 |
but we don't need to give all of that value to some of the largest companies in the world for 00:06:23.900 |
users be paid? In fact, that's my question for all of the examples you have seen in this video 00:06:28.480 |
and are about to see. Does the user actually get paid? If OpenAI is set to make trillions of 00:06:33.960 |
dollars, as Sam Altman has said, will you get paid for helping to train it? Apparently, Reddit is 00:06:39.220 |
right now negotiating fees with OpenAI, but will its users get any of that money? What about the 00:06:44.480 |
Wikipedia editors that spend thousands of hours to make sure the article is accurate, and then GPT-4 00:06:50.260 |
or 5 just trawls all of that for free? Or what about Stack Overflow? What about the 00:06:53.880 |
website that uses Stack Overflow, the Q&A site for programmers? Apparently, they are now going to 00:06:57.740 |
also charge AI giants for training data. The CEO said that users own the content that they post 00:07:04.000 |
on Stack Overflow under the Creative Commons license, but that that license requires anyone 00:07:09.500 |
later using the data to mention where it came from. But of course, GPT-4 doesn't mention where 00:07:14.540 |
its programming tricks come from. Is it me, or is there not some irony in the people being 00:07:19.260 |
generous enough to give out answers to questions in programming, actually training a model, 00:07:23.860 |
that may end up one day replacing them, all the while giving them no credit or compensation? 00:07:29.400 |
But now we must turn to lawsuits, because there are plenty of people getting ready to take this 00:07:34.380 |
to court. Microsoft, GitHub, and OpenAI were recently sued, with the companies accused of 00:07:39.760 |
scraping license code to build GitHub's AI-powered copilot tool. And in an interesting response, 00:07:46.160 |
Microsoft and GitHub said that the complaint has certain defects, including a lack of injury. 00:07:53.840 |
They argue that the plaintiffs rely on hypothetical events to make their claim, and say that they 00:07:58.280 |
don't describe how they were personally harmed by the tool. That could be the big benchmark, 00:08:03.260 |
where these lawsuits fail currently because no one can prove harm from GPT-4. But how 00:08:09.060 |
does that bode for the future when some people inevitably get laid off because they're 00:08:14.000 |
simply not needed anymore because GPT-4 or GPT-5 can do their jobs? 00:08:18.400 |
Then would these lawsuits succeed? When you can prove that you've lost a job because of a specific tool, 00:08:23.820 |
which was trained using in part your own data, then there is injury there that you could prove. 00:08:29.340 |
But then if you block GPT-4 or GPT-5, there will be millions of coders who can then say that they're 00:08:35.880 |
injured because their favorite tool has now been lost. I have no idea how that's going to pan out 00:08:41.520 |
in the courts. Of course, these are not the only lawsuits with the CEO of Twitter weighing in, 00:08:46.600 |
accusing OpenAI of illegally using Twitter data. 00:08:49.680 |
And what about publishers, journalists, and newspapers, whose work might 00:08:53.800 |
not be read as much because people can get their answers from GPT-4? 00:08:57.640 |
And don't forget their websites were also called to train the models. 00:09:01.320 |
Well, the CEO of News Corp said that clearly they are using proprietary content. There should be, 00:09:07.400 |
obviously, some compensation for that. So it seems like there are lawsuits coming in from 00:09:12.280 |
every direction. But Sam Altman has said in the past, "We're willing to pay a lot for very high 00:09:18.120 |
quality data in certain domains such as science." Will that actually enrich scientists and mathematicians 00:09:23.780 |
or will it just add to the profits of the massive scientific publishers? 00:09:28.900 |
That's another scandal for another video, but I am wondering if OpenAI will be tempted to use some 00:09:34.580 |
illicit sites instead, such as SkyHub, a shadow library website that provides free access to 00:09:40.420 |
millions of research papers without regard to copyright. It basically gets past the scientific 00:09:46.260 |
publishers paywall and apparently up to 50% of academics say that they use websites like SkyHub. 00:09:53.060 |
Inevitably, just because they're using the same website as the scientific publishers, 00:09:53.760 |
they're not going to be able to use the same data. So I think the GPC5 is going to break through some 00:09:55.540 |
new science benchmarks. I just wish that the scientists whose work went into training it 00:10:00.620 |
were compensated for helping it do so. Just in case it seems like I'm picking on OpenAI, 00:10:05.780 |
Google are just as secretive and they were even accused by their own employees of training Bard 00:10:11.620 |
with chat GPT data. They have strenuously denied this, but it didn't stop Sam Altman from saying, 00:10:17.540 |
"I'm not that annoyed at Google for training on chat GPT output, but the spin is annoying." He, 00:10:23.340 |
obviously, is not the only one who has been accused of this. He's also accused of using 00:10:23.740 |
the same code as the GPC5. He doesn't believe their denial. And given all of this discussion 00:10:27.120 |
on copyright and scraping data, I found this headline supremely ironic. OpenAI are trying to 00:10:33.880 |
trademark the name GPT. Meaning all of those models that you've heard of, AutoGPT, MemoryGPT, 00:10:40.120 |
HuggingGPT, they might be stopped from using that name. Imagine a world where they win all of their 00:10:46.000 |
battles in court and they can use everyone's data, but no one can use their name GPT. 00:10:53.720 |
relevant for much longer. Sam Altman recently said that he predicts OpenAI data spend will go down as 00:11:00.860 |
models get smarter. I wonder if he means that the models might be able to train their own synthetic 00:11:05.900 |
data sets and therefore not require as much outside data. Or of course he could be talking 00:11:11.140 |
about simplifying the reinforcement learning with human feedback phase, where essentially the model 00:11:16.540 |
gives itself feedback, reducing the need for human evaluators. Wouldn't that be quite something if GPT-4 00:11:23.700 |
was used to train GPT models? I fluctuate between being amazed, annoyed and deeply concerned about 00:11:38.580 |
where all of this is going. Let me know in the comments what you think of it all and have a