back to index

What's Behind the ChatGPT History Change? How You Can Benefit + The 6 New Developments This Week


Chapters

0:0 Intro
0:59 What does this mean
2:44 Why this matters
3:56 How AI companies collect data
5:18 Whats in the training set
6:52 Stack Overflow
10:51 The Future

Whisper Transcript | Transcript Only Page

00:00:00.000 | 18 hours ago Sam Altman put out this simple tweet that you can now disable chat history
00:00:06.080 | and training in ChatGPT and that we will offer ChatGPT business in the coming months.
00:00:11.760 | But dig a little deeper and behind this tweet is a data controversy that could engulf OpenAI,
00:00:18.340 | jeopardize GPT-5 and shape the new information economy. I will show you how you can benefit
00:00:24.700 | from this new feature, reveal how you can check if your personal info was likely used in GPT-4
00:00:30.540 | training and investigate whether ChatGPT could be banned in the EU, Brazil, California and beyond.
00:00:37.080 | But first the announcement. OpenAI say that you can now turn off chat history in ChatGPT but that
00:00:43.760 | it's only conversations that were started after chat history is disabled that won't be used to
00:00:49.860 | train and improve their models. Meaning that by default your existing conversations
00:00:54.420 | will be banned. So if you're not sure what to do, you can turn off chat history in ChatGPT.
00:00:54.680 | So how does it work and what does this mean? What you need to do is click on the three dots
00:01:01.800 | at the bottom left of a ChatGPT conversation, then go to settings and show. And here's where
00:01:07.520 | it starts to get interesting. They have linked together chat history and training. It's both
00:01:12.580 | or neither. They could have given two separate options. One to store your chat history so that
00:01:17.420 | you can look back over it later and another to opt out of training. But instead it's one button.
00:01:21.980 | You either give them your data and keep your chats,
00:01:24.660 | or you don't give them your data and you don't keep your chats. If you opt not to give them your
00:01:29.360 | chat history, they still monitor the chats for what they call abuse. So bear that in mind.
00:01:33.880 | What if I want to keep my history on but disable model training? We are working on a new offering
00:01:39.140 | called ChatGPT business. I'm going to talk about that in a moment, but clearly they don't want to
00:01:44.280 | make it easy to opt out of giving over your training data. Now, in fairness, they do offer
00:01:49.500 | an opt out form. But if you go to the form, it says cryptically, please know that
00:01:54.640 | in some cases this will limit the ability of our models to better address your specific use case.
00:02:00.280 | So that's one big downside to this new announcement. But what's one secret upside?
00:02:04.720 | This export data button buried all the way down here. If you click it, you quite quickly get this
00:02:10.920 | email, which contains a link to download a data export of all your conversations.
00:02:16.160 | After you download the file and open it, you now have an easy way to search through
00:02:20.500 | all your previous conversations, literally all of them from the time you first
00:02:24.620 | started using ChatGPT to the present day. That is a pretty great feature, I must admit.
00:02:29.660 | But going back to the announcement, they said that you need to upgrade to ChatGPT business
00:02:33.920 | available in the coming months to ensure that your data won't be used to train our models by default.
00:02:39.920 | But why these announcements now? Why did Sam Altman tweet this just yesterday?
00:02:44.060 | Well, this article also from yesterday in the MIT Technology Review by Melissa Aikila may explain
00:02:51.240 | why. It said that OpenAI has until the end of this
00:02:54.520 | week to comply with Europe's strict data protection regime, the GDPR, but that it will likely be
00:03:00.860 | impossible for the company to comply because of the way data for AI is collected. Before you leave
00:03:06.380 | and say this is just about Europe, no, it's much bigger than that. The European Data Collection
00:03:10.840 | Supervisor said that the definition of hell might be coming for OpenAI based on the potentially
00:03:17.100 | illegal way it collected data. If OpenAI cannot convince the authorities its data use practices
00:03:23.100 | are legal, it could be a big problem for the company. So, let's get to the point.
00:03:24.500 | OpenAI could be banned not only in specific countries like Italy or the entire EU, it could also face hefty
00:03:30.080 | fines and might even be forced to delete models and the data used to train them. The stakes could
00:03:36.480 | not be higher for OpenAI. The EU's GDPR is the world's strictest data protection regime and it
00:03:42.820 | has been copied widely around the world. Regulators everywhere from Brazil to California will be
00:03:49.040 | paying close attention to what happens next and the outcome could fundamentally change the way AI
00:03:54.060 | companies are going to be treated. So, let's get to the point.
00:03:54.400 | How do these companies collect your data? Well, two articles published this week tell us much more.
00:04:03.460 | To take one example, they harvest pirated ebooks from the site formerly known as BookZZ. Until that
00:04:10.140 | was seized by the FBI last year. Despite that, contents of the site remain in the Common Crawl
00:04:16.300 | database. OpenAI won't reveal the dataset used to train GPT-4, but we know the Common Crawl was used
00:04:23.060 | to train GPT-3. So, let's get to the point. OpenAI is a very popular website for collecting data.
00:04:24.300 | OpenAI may have also used the Pile, which was used recently by Stability AI for their new LLM,
00:04:31.120 | StableLM. The Pile contains more pirated ebooks, but also things like every internal email sent
00:04:38.160 | by Enron. And if you think that's strange, wait until you hear about the copyright takedown policy
00:04:44.280 | of the group that maintains the Pile. I can't even read it out for the video.
00:04:48.960 | This article from the Washington Post reveals even more about the data that was likely used
00:04:54.200 | to train GPT-4. For starters, we have the exclusive content of Patreon, so presumably all my Patreon
00:05:00.520 | messages will be used to train GPT-5. But further down in the article, we have this search bar where
00:05:05.720 | you can look into whether your own website was used in the Common Crawl dataset. I even found my
00:05:11.320 | mum's WordPress family blog, so it's possible that GPT-5 will remember more about my childhood than I
00:05:17.800 | do. If you think that's kind of strange, wait until you hear that OpenAI themselves might not even know what's in their
00:05:24.100 | training set. This comes from the GPT-4 technical report, and in one of the footnotes it says that
00:05:30.100 | portions of this big bench benchmark were inadvertently mixed into the training set.
00:05:35.700 | That word inadvertently is rather startling. For the moment, let's not worry about how mixing in
00:05:41.300 | benchmarks might somewhat obscure our ability to test GPT-4. Let's just focus on that word inadvertently.
00:05:48.180 | Do they really not know entirely what's in their dataset? Whether they do or not, I want you to get ready to count
00:05:54.000 | the number of ways that OpenAI may soon have to pay for the data it once got for free. First,
00:06:00.660 | Reddit. They trawled Reddit for all posts that got three or more upvotes and included them in the
00:06:06.500 | training data. Now, this New York Times article says, Reddit wants them to pay for the privilege.
00:06:11.940 | The founder and chief executive of Reddit said that the Reddit corpus of data is really valuable,
00:06:17.200 | but we don't need to give all of that value to some of the largest companies in the world for
00:06:21.340 | free. I agree, but my question is, will the
00:06:23.900 | users be paid? In fact, that's my question for all of the examples you have seen in this video
00:06:28.480 | and are about to see. Does the user actually get paid? If OpenAI is set to make trillions of
00:06:33.960 | dollars, as Sam Altman has said, will you get paid for helping to train it? Apparently, Reddit is
00:06:39.220 | right now negotiating fees with OpenAI, but will its users get any of that money? What about the
00:06:44.480 | Wikipedia editors that spend thousands of hours to make sure the article is accurate, and then GPT-4
00:06:50.260 | or 5 just trawls all of that for free? Or what about Stack Overflow? What about the
00:06:53.880 | website that uses Stack Overflow, the Q&A site for programmers? Apparently, they are now going to
00:06:57.740 | also charge AI giants for training data. The CEO said that users own the content that they post
00:07:04.000 | on Stack Overflow under the Creative Commons license, but that that license requires anyone
00:07:09.500 | later using the data to mention where it came from. But of course, GPT-4 doesn't mention where
00:07:14.540 | its programming tricks come from. Is it me, or is there not some irony in the people being
00:07:19.260 | generous enough to give out answers to questions in programming, actually training a model,
00:07:23.860 | that may end up one day replacing them, all the while giving them no credit or compensation?
00:07:29.400 | But now we must turn to lawsuits, because there are plenty of people getting ready to take this
00:07:34.380 | to court. Microsoft, GitHub, and OpenAI were recently sued, with the companies accused of
00:07:39.760 | scraping license code to build GitHub's AI-powered copilot tool. And in an interesting response,
00:07:46.160 | Microsoft and GitHub said that the complaint has certain defects, including a lack of injury.
00:07:52.200 | And the companies are...
00:07:53.840 | They argue that the plaintiffs rely on hypothetical events to make their claim, and say that they
00:07:58.280 | don't describe how they were personally harmed by the tool. That could be the big benchmark,
00:08:03.260 | where these lawsuits fail currently because no one can prove harm from GPT-4. But how
00:08:09.060 | does that bode for the future when some people inevitably get laid off because they're
00:08:14.000 | simply not needed anymore because GPT-4 or GPT-5 can do their jobs?
00:08:18.400 | Then would these lawsuits succeed? When you can prove that you've lost a job because of a specific tool,
00:08:23.820 | which was trained using in part your own data, then there is injury there that you could prove.
00:08:29.340 | But then if you block GPT-4 or GPT-5, there will be millions of coders who can then say that they're
00:08:35.880 | injured because their favorite tool has now been lost. I have no idea how that's going to pan out
00:08:41.520 | in the courts. Of course, these are not the only lawsuits with the CEO of Twitter weighing in,
00:08:46.600 | accusing OpenAI of illegally using Twitter data.
00:08:49.680 | And what about publishers, journalists, and newspapers, whose work might
00:08:53.800 | not be read as much because people can get their answers from GPT-4?
00:08:57.640 | And don't forget their websites were also called to train the models.
00:09:01.320 | Well, the CEO of News Corp said that clearly they are using proprietary content. There should be,
00:09:07.400 | obviously, some compensation for that. So it seems like there are lawsuits coming in from
00:09:12.280 | every direction. But Sam Altman has said in the past, "We're willing to pay a lot for very high
00:09:18.120 | quality data in certain domains such as science." Will that actually enrich scientists and mathematicians
00:09:23.780 | or will it just add to the profits of the massive scientific publishers?
00:09:28.900 | That's another scandal for another video, but I am wondering if OpenAI will be tempted to use some
00:09:34.580 | illicit sites instead, such as SkyHub, a shadow library website that provides free access to
00:09:40.420 | millions of research papers without regard to copyright. It basically gets past the scientific
00:09:46.260 | publishers paywall and apparently up to 50% of academics say that they use websites like SkyHub.
00:09:53.060 | Inevitably, just because they're using the same website as the scientific publishers,
00:09:53.760 | they're not going to be able to use the same data. So I think the GPC5 is going to break through some
00:09:55.540 | new science benchmarks. I just wish that the scientists whose work went into training it
00:10:00.620 | were compensated for helping it do so. Just in case it seems like I'm picking on OpenAI,
00:10:05.780 | Google are just as secretive and they were even accused by their own employees of training Bard
00:10:11.620 | with chat GPT data. They have strenuously denied this, but it didn't stop Sam Altman from saying,
00:10:17.540 | "I'm not that annoyed at Google for training on chat GPT output, but the spin is annoying." He,
00:10:23.340 | obviously, is not the only one who has been accused of this. He's also accused of using
00:10:23.740 | the same code as the GPC5. He doesn't believe their denial. And given all of this discussion
00:10:27.120 | on copyright and scraping data, I found this headline supremely ironic. OpenAI are trying to
00:10:33.880 | trademark the name GPT. Meaning all of those models that you've heard of, AutoGPT, MemoryGPT,
00:10:40.120 | HuggingGPT, they might be stopped from using that name. Imagine a world where they win all of their
00:10:46.000 | battles in court and they can use everyone's data, but no one can use their name GPT.
00:10:51.640 | But maybe this entire data issue won't be
00:10:53.720 | relevant for much longer. Sam Altman recently said that he predicts OpenAI data spend will go down as
00:11:00.860 | models get smarter. I wonder if he means that the models might be able to train their own synthetic
00:11:05.900 | data sets and therefore not require as much outside data. Or of course he could be talking
00:11:11.140 | about simplifying the reinforcement learning with human feedback phase, where essentially the model
00:11:16.540 | gives itself feedback, reducing the need for human evaluators. Wouldn't that be quite something if GPT-4
00:11:23.700 | was used to train GPT models? I fluctuate between being amazed, annoyed and deeply concerned about
00:11:38.580 | where all of this is going. Let me know in the comments what you think of it all and have a