back to index

"OpenAI is Not God” - The DeepSeek Documentary on Liang Wenfeng, R1 and What's Next


Whisper Transcript | Transcript Only Page

00:00:00.000 | For early access to future documentaries and 30-plus exclusive ad-free videos,
00:00:04.960 | check out my Patreon, link in the description.
00:00:07.560 | DeepSeek wasn't meant to happen.
00:00:12.260 | The lines were well-rehearsed.
00:00:14.760 | The West had an ever-growing lead in AI.
00:00:18.100 | Language models were getting ever more expensive as they got more intelligent.
00:00:23.420 | And research was retreating behind a veil of competitive secrecy.
00:00:28.820 | But on the 20th of January, 2025, those reading those lines started to stutter.
00:00:36.780 | A model that visibly seemed to think before it spoke had been released, DeepSeek R1.
00:00:42.820 | It was unbelievably cheap, competitive with the best the West had to offer,
00:00:47.860 | and out in the open, available to anyone to download.
00:00:51.960 | Even OpenAI admit as much, arguing in March that DeepSeek shows,
00:00:57.180 | quote, that our lead is not wide and is narrowing.
00:01:01.020 | OpenAI even want models like DeepSeek R1 banned because they say, quote,
00:01:06.220 | DeepSeek could be compelled by the Chinese Communist Party to manipulate its models to cause harm.
00:01:12.460 | And because DeepSeek is simultaneously state-subsidized, state-controlled, and freely available,
00:01:19.340 | it will cost users their privacy and security.
00:01:22.900 | Now, while Google's Gemini 2.5 and the new ChatGPT image gen have wrestled back the headlines at the beginning of April,
00:01:30.840 | DeepSeek is preparing to deliver yet another shock to the system,
00:01:34.840 | with DeepSeek R2 expected later in April or May.
00:01:39.380 | But truth be told, many of you will already know all of that.
00:01:42.960 | What you might not know, though, are the aims and beliefs expressed in disparate interviews by the secretive founder behind DeepSeek,
00:01:50.840 | billionaire Liang Wenfeng, a man who now has to hide from crowds of adoring fans in his own hometown,
00:01:58.440 | according to a friend he texted,
00:02:00.400 | and who has now fled his home province with his family to escape further attention.
00:02:04.400 | Nor will some of you know about the first AI operation that made Liang his money,
00:02:09.600 | and then went awry.
00:02:11.140 | Or the beauty of some of the technical innovations behind the Omega viral DeepSeek R1.
00:02:16.840 | Or just how the Western labs like OpenAI and Anthropic have fired back with their own narratives
00:02:23.200 | in the days and weeks since the release of R1.
00:02:26.520 | There is frankly so much that so many people don't know about the company DeepSeek and what it means.
00:02:32.840 | The truth is that DeepSeek is a whale caught in a net of narratives,
00:02:37.060 | most of which contradict each other.
00:02:39.180 | So let's get as close as we can to the truth behind the narratives,
00:02:43.480 | and what that truth says about where all of this is going.
00:02:46.500 | Because if Liang Wenfeng is correct,
00:02:49.000 | and artificial general intelligence is, quote,
00:02:52.460 | 10, 5, or even 2 years away,
00:02:55.260 | then this story is about far, far more than one man, one lab, or even one nation.
00:03:01.680 | Here then is what one of Liang's business partners said of the man who is thought to be 40.
00:03:07.620 | He was this very nerdy guy with a terrible hairstyle when they first met.
00:03:12.140 | Talking about building a 10,000 chip cluster to train his own AI models.
00:03:17.180 | We didn't take him seriously.
00:03:18.760 | Of course, there are many AI leaders with terrible hairstyles, so what sets Liang Wenfeng apart?
00:03:24.720 | He certainly wasn't always about solving intelligence and making it free.
00:03:28.360 | It's hard to become a billionaire that way, as you might well guess.
00:03:32.060 | No, to seek out the origin story here, we must switch to a first-hand account from the man himself.
00:03:37.980 | Before that, though, a few moments of background.
00:03:40.660 | Liang graduated university into a world that was falling apart.
00:03:45.240 | Some of you will be too young, of course, to remember the panic of September 2008,
00:03:50.240 | when the financial pyramid built on the sands of the US subprime housing market collapsed.
00:03:56.080 | Either way, you might be able to understand the drive Liang had to try to understand the patterns within the unfolding chaos,
00:04:03.760 | and predict what would come next.
00:04:05.640 | There were those who tried to tempt him into different directions while he operated out of a small flat in Chengdu, Sichuan.
00:04:12.260 | Not me, though I was there, actually, in Chengdu at the same time, learning Mandarin.
00:04:17.040 | No, no, no, it was the founder of what would become DJI, the world's preeminent drone maker,
00:04:23.200 | who tried to headhunt Liang, but to no avail.
00:04:26.420 | Liang had bigger ambitions.
00:04:28.320 | After getting a master's in information engineering in 2010,
00:04:31.960 | Liang went on a founding spree between 2013 and 2016,
00:04:36.600 | culminating in the establishment of the hedge fund High Flyer in February 2016.
00:04:42.200 | Each entity he started included the core goal of using machine learning
00:04:47.200 | to uncover the patterns behind microsecond or even nanosecond movements in the financial markets.
00:04:53.600 | Patterns and paradigms no humans could detect alone.
00:04:57.600 | Artificial intelligence, if you will.
00:04:59.600 | Before it was called that, of course.
00:05:01.280 | As late as May 2023, Liang was still describing his goal in financial terms.
00:05:07.500 | Our broader research aims to understand what kind of paradigms can fully describe the entire financial market,
00:05:14.260 | and whether there are simpler ways to express it.
00:05:17.140 | Anyway, it worked through attracting $9.4 billion in assets under management by the end of 2021,
00:05:23.440 | and providing returns that in some cases were 20 to 50 percentage points more than stock market benchmarks.
00:05:29.860 | Liang absolutely minted it.
00:05:32.140 | He was a billionaire by his mid-30s and on top of the world.
00:05:36.380 | All of High Flyer's market strategies used AI,
00:05:39.240 | and yes, they were calling it that,
00:05:41.140 | and they even had a supercomputer powered by 10,000 NVIDIA GPUs.
00:05:45.700 | He might not at this point be scaling up language models like a tiny American startup,
00:05:50.240 | OpenAI, had done the year earlier in 2020 with GPT-3.
00:05:53.960 | But had his AI truly solved the chaos of the financial markets?
00:05:59.140 | Had he done it?
00:06:00.780 | This is where the story starts to get interesting.
00:06:03.500 | Liang's AI system, built with a team of just over 100 individuals,
00:06:07.120 | had a troublesome personality quirk.
00:06:10.320 | It was frankly too much of a risk taker.
00:06:12.900 | It would double down on bets when it felt it was right, and that wasn't all.
00:06:16.800 | The hedge fund itself, High Flyer, had become hubristic.
00:06:20.420 | It was flying too close to the sun.
00:06:22.960 | Success as a hedge fund, as you might expect, attracts more investments.
00:06:26.760 | If you don't limit your fund size, and Liang didn't in time,
00:06:30.340 | then sometimes you have too much money to deploy in a smart way.
00:06:33.980 | Your trades get copied, your edge becomes less keen.
00:06:36.540 | So after seeing a sharp drawdown, High Flyer expressed its deep guilt in public,
00:06:42.560 | and took measures to further limit who could invest with them.
00:06:45.260 | Yes, in case you're curious, they did learn their lesson,
00:06:47.580 | and are still going as a hedge fund today with some degree of success.
00:06:51.120 | Actually, between 2018 and early 2024,
00:06:53.960 | High Flyer has outperformed the Chinese equivalent of the S&P index,
00:06:57.760 | albeit with some stumbles since then.
00:06:59.660 | And yes, as we know, Liang didn't give up on AI.
00:07:02.360 | He was rich now, and could afford an outfit dedicated to decoding not just financial systems,
00:07:08.020 | but the nature of general intelligence itself.
00:07:11.140 | The effort would be called DeepSeek,
00:07:13.720 | and it was first formed as a research body in April 2023.
00:07:18.580 | Any scars, perhaps, though, for Liang from his previous AI experience?
00:07:22.500 | Well, there is one that might have carried over into the paper DeepSeek produced
00:07:27.460 | on their first large language model, or chatbot.
00:07:30.140 | From his experience, Liang knew that AI could be fickle,
00:07:34.140 | and not always a reliable partner.
00:07:36.060 | So DeepSeek added this disclaimer for their first chatbot,
00:07:39.760 | DeepSeek V1, released in November 2023.
00:07:43.080 | We profoundly recognize the importance of safety for general artificial intelligence.
00:07:49.060 | The premise for establishing a truly helpful artificial intelligence model
00:07:53.440 | is that it possesses values consistent with those of humans,
00:07:56.620 | and exhibits friendliness towards humanity.
00:07:59.960 | Before I continue any further, though,
00:08:02.100 | let's not pretend that many of us in the West were paying much attention
00:08:05.340 | to any of the developments described so far.
00:08:07.980 | By then, of course, OpenAI were well onto GPT-4,
00:08:11.620 | which showed sparks of AGI.
00:08:14.060 | GPT-4 was released publicly in March 2023,
00:08:17.320 | well before DeepSeek was even officially founded in July of that year.
00:08:22.160 | But at least the stage had been set,
00:08:24.240 | a reclusive billionaire,
00:08:25.460 | one and a half decades deep into wielding artificial intelligence
00:08:29.360 | to understand the world.
00:08:30.320 | A man who had made his money,
00:08:32.300 | and was now, in his words,
00:08:33.660 | simply driven to explore.
00:08:35.660 | Quote,
00:08:36.220 | People, Liang said,
00:08:37.760 | may think there's some hidden business logic behind DeepSeek,
00:08:41.440 | but it's mainly driven by curiosity.
00:08:44.100 | Why did DeepSeek R1 capture the world's attention at the start of 2025?
00:08:50.940 | Why did it divide opinions and convulse markets?
00:08:54.860 | Was it that the wider world could see the thinking process
00:08:58.920 | of the language model before it gave its final answer?
00:09:01.600 | Was it that the DeepSeek model was so cheap?
00:09:03.520 | Or that the model and the methods behind it were so open and accessible?
00:09:07.860 | Or was it that such a performant model had come from China,
00:09:11.380 | which was supposed to be a year behind the Western frontier?
00:09:14.520 | We will investigate each of these possibilities,
00:09:17.160 | but there was one thing that was certain of the DeepSeek of summer 2023.
00:09:22.160 | It was, indeed, deeply behind Western AI labs.
00:09:26.460 | By then, don't forget,
00:09:27.460 | not only was GPT-4 out and about,
00:09:29.320 | but so was the first version of Claude from Anthropic
00:09:32.000 | and Bard from Google,
00:09:33.700 | and even Llama 2 from Meta.
00:09:36.260 | DeepSeek, by the way,
00:09:37.600 | paid particular attention to Llama 2.
00:09:40.640 | That model might not have been quite as smart on key benchmarks as GPT-4,
00:09:45.060 | but it was so-called open weights,
00:09:47.000 | which means almost anyone could download,
00:09:49.060 | tweak,
00:09:49.560 | and deploy the model as they saw fit.
00:09:51.480 | A model is, of course, nothing without its weight,
00:09:53.720 | or its billions of tweakable numerical values used to calculate outputs.
00:09:58.160 | To be clear,
00:09:58.740 | open weights isn't quite the same as open source,
00:10:01.500 | as to be open source,
00:10:03.040 | we would need to see the data that went into training the model,
00:10:06.280 | the source, so to speak,
00:10:07.560 | which we did not and still do not know.
00:10:10.000 | Despite some models like Llama 2 being open weights at least,
00:10:13.560 | key leaders within Western AI labs were saying
00:10:16.500 | that the frontier would increasingly belong
00:10:19.520 | to those who kept secret the methodology
00:10:22.520 | behind their language model training,
00:10:24.520 | as OpenAI did.
00:10:25.800 | Here's Ilya Sutskova,
00:10:27.120 | at the time the chief scientist of OpenAI.
00:10:29.680 | He was saying
00:10:30.620 | there will always be a gap
00:10:32.740 | between the open models and the private models,
00:10:35.940 | and this gap may even be increasing.
00:10:38.680 | Sam Altman, CEO and co-founder of OpenAI, went further.
00:10:42.060 | It wasn't just that research secrets were becoming a moat,
00:10:44.700 | so too was money.
00:10:46.500 | In June of 2023 in India,
00:10:48.840 | Sam Altman replied to a question
00:10:50.480 | about whether a team with just $10 million
00:10:52.320 | could compete with OpenAI.
00:10:54.000 | His response for me became a wider comment
00:10:57.060 | on whether it was possible for any startup
00:10:59.280 | to enter the race
00:11:00.360 | and build a truly intelligent language model.
00:11:02.740 | Look, the way this works is we're going to tell you
00:11:04.980 | it's totally hopeless to compete with us
00:11:06.740 | on training foundation models you shouldn't try,
00:11:08.460 | and it's your job to, like, try anyway.
00:11:10.540 | And I believe both of those things.
00:11:12.980 | I think it is pretty hopeless, but...
00:11:18.100 | Not just this,
00:11:18.840 | a month earlier, in May,
00:11:20.480 | he had put it even more bluntly.
00:11:22.360 | There will be the hyperscaler's
00:11:24.060 | best closed source models,
00:11:26.660 | and there will be the progress
00:11:29.140 | that the open source community makes,
00:11:30.540 | and it'll be, you know,
00:11:31.400 | a few years behind or whatever,
00:11:32.980 | a couple years behind, maybe.
00:11:34.000 | As we learnt,
00:11:34.900 | a few weeks before these comments,
00:11:36.700 | Liang had launched what would become DeepSeq.
00:11:39.640 | In short, remember this context
00:11:41.580 | when you wonder at the order reaction
00:11:43.660 | to DeepSeq R1 in January 2025.
00:11:46.760 | It wasn't supposed to be like this.
00:11:48.800 | Intelligence was supposed to come
00:11:50.060 | from the scale of the base model,
00:11:51.760 | measured not just in how many tens of thousands
00:11:54.220 | of NVIDIA GPUs were used to compute
00:11:56.640 | the parameters of that model,
00:11:57.840 | but on how much data it was trained on.
00:11:59.660 | It just made sense that no one could compete
00:12:01.860 | without the backing of multi-trillion dollar hyperscalers
00:12:05.380 | like Microsoft or Google.
00:12:07.240 | Liang Wenfeng was rich,
00:12:08.900 | but not that rich.
00:12:10.180 | Liang must have known
00:12:11.360 | that these Western lab leaders
00:12:12.940 | thought what he was about to attempt
00:12:14.860 | was impossible,
00:12:15.880 | but he tried anyway.
00:12:17.440 | Nor would he be distracted
00:12:18.900 | by the law of quick monetization
00:12:21.140 | through routes like $20 subscriptions.
00:12:23.480 | Liang said in May of 2023,
00:12:25.840 | our goal is clear,
00:12:27.200 | to focus on research and exploration,
00:12:29.860 | rather than vertical domains and applications.
00:12:32.960 | So DeepSeq focused its recruitment efforts
00:12:35.500 | on those who were young,
00:12:36.700 | curious and crucially Chinese.
00:12:38.920 | By the way,
00:12:39.800 | not even Chinese returnees
00:12:41.760 | from the West were favored.
00:12:43.440 | Liang added,
00:12:44.420 | DeepSeq prioritizes capability over credentials,
00:12:47.480 | core technical roles
00:12:48.860 | are primarily filled by recent grads
00:12:51.260 | or those one to two years out.
00:12:52.900 | These intellectual foot soldiers
00:12:54.560 | would not be waylaid
00:12:56.120 | by the need to release on a schedule
00:12:57.960 | to compete with OpenAI.
00:12:59.140 | That was what had led Google
00:13:00.700 | to release a botched Bard
00:13:02.300 | and Microsoft a comically wayward Bing.
00:13:04.900 | Our evaluation standards
00:13:06.340 | are quite different
00:13:07.320 | from those of most companies.
00:13:08.580 | We don't have KPIs,
00:13:09.880 | key performance indicators,
00:13:10.920 | or so-called quotas.
00:13:12.380 | In our experience,
00:13:13.420 | innovation requires as little intervention
00:13:15.460 | and management as possible,
00:13:16.740 | giving everyone the space to explore
00:13:18.600 | and the freedom to make mistakes.
00:13:20.220 | All that said,
00:13:21.380 | DeepSeq's first pair of AI models
00:13:23.700 | released in November 2023
00:13:25.660 | were not exactly stunning in their originality.
00:13:29.420 | As I hinted at earlier,
00:13:31.000 | their V1 large language model
00:13:32.860 | drew heavily upon the innovations
00:13:35.040 | of Meta's Lama 2 LLM.
00:13:37.060 | And neither of their November releases,
00:13:39.380 | DeepSeq Coda or V1,
00:13:41.240 | made waves in the Western media.
00:13:43.160 | As attention at that time,
00:13:44.420 | you may remember,
00:13:45.120 | focused on Sam Altman
00:13:46.520 | being temporarily fired from OpenAI
00:13:48.860 | for lack of candor.
00:13:50.120 | But there were just a few signs
00:13:51.980 | that DeepSeq were indeed focused on
00:13:54.420 | long-termism,
00:13:55.780 | as each of their papers explicitly claim.
00:13:58.100 | For example,
00:13:58.800 | DeepSeq excluded multiple-choice questions
00:14:01.560 | from their bespoke training dataset,
00:14:03.900 | so that their models would not
00:14:05.540 | overperform on formal tests,
00:14:08.000 | but underwhelm in practice.
00:14:09.600 | And that's a lesson not learnt
00:14:11.160 | by all AI labs at the time,
00:14:12.960 | or even now.
00:14:13.700 | DeepSeq wrote,
00:14:14.620 | quote,
00:14:15.000 | overfitting two benchmarks
00:14:16.740 | would not contribute
00:14:17.960 | to achieving true intelligence
00:14:19.840 | in the model.
00:14:20.500 | By the beginning of 2024,
00:14:22.360 | the DeepSeq team
00:14:23.560 | was cooking with gas.
00:14:24.780 | In January,
00:14:25.440 | they pioneered a novel approach
00:14:27.160 | to getting more intelligence
00:14:28.300 | from their models for less.
00:14:29.920 | Bear in mind that models like Lama 2
00:14:32.060 | use their entire set of weights,
00:14:34.220 | often tens or hundreds of billions strong,
00:14:36.760 | to compute a response
00:14:38.160 | to a user prompt.
00:14:39.220 | That contrasted with
00:14:41.220 | the mixture of experts approach,
00:14:43.480 | which was not at all original
00:14:45.180 | to DeepSeq.
00:14:45.820 | The mixture of experts approach
00:14:47.460 | involves using a specialized subset
00:14:49.740 | of those weights,
00:14:50.940 | depending on the user input,
00:14:52.700 | thereby tapping into one or more
00:14:55.220 | of the set or mix of experts
00:14:57.560 | within the model,
00:14:58.420 | if you will.
00:14:59.000 | But think about it,
00:14:59.780 | because only a subset
00:15:01.200 | of the model weights
00:15:02.160 | would respond to each request,
00:15:03.880 | every expert within the model
00:15:06.380 | had to have a degree
00:15:07.640 | of common capability.
00:15:09.020 | A tiny bit like forcing Messi
00:15:11.100 | to spend hours a week
00:15:12.440 | practicing goalkeeping,
00:15:13.740 | and yes,
00:15:14.420 | I am talking about soccer
00:15:15.340 | if you are American.
00:15:16.400 | Could DeepSeq utilize
00:15:18.240 | the mixture of experts approach,
00:15:20.060 | which is highly efficient,
00:15:21.280 | without that key downside?
00:15:22.940 | You probably guessed the answer
00:15:24.080 | from my tone,
00:15:24.660 | but yes,
00:15:25.260 | in their
00:15:25.980 | Towards Ultimate Expert Specialization paper,
00:15:28.900 | here's the innovation.
00:15:29.780 | Certain expert subnetworks
00:15:31.960 | within the language model
00:15:32.920 | would always be activated
00:15:34.760 | in any response.
00:15:36.020 | Those guys could be
00:15:37.220 | the generalists.
00:15:38.180 | This meant that
00:15:39.100 | the remaining experts,
00:15:40.220 | like Messi,
00:15:40.760 | could truly focus
00:15:42.100 | on what they are good at.
00:15:43.420 | And yes,
00:15:43.780 | just in case you're thinking ahead,
00:15:45.160 | this is also one of the
00:15:46.880 | many secrets behind
00:15:47.920 | the base model
00:15:48.700 | that powers DeepSeq R1,
00:15:50.420 | the global phenomenon.
00:15:51.520 | DeepSeq were just
00:15:52.780 | getting warmed up though.
00:15:53.660 | In April of 2024,
00:15:55.020 | they released DeepSeq Math,
00:15:57.000 | a tiny model
00:15:57.980 | that matched the performance
00:15:59.060 | in mathematics,
00:15:59.760 | at least,
00:16:00.220 | of GPT-4,
00:16:01.220 | a goliath of a model
00:16:02.540 | in comparison.
00:16:03.200 | What's the deal
00:16:04.040 | with DeepSeq Math then?
00:16:05.320 | Well,
00:16:05.580 | one of the secrets
00:16:06.400 | behind the model's success
00:16:07.500 | was the unassumingly named
00:16:09.200 | Group Relative
00:16:10.380 | Polity Optimization.
00:16:11.800 | A mouthful,
00:16:12.540 | but it's a training method
00:16:13.680 | later incorporated,
00:16:14.620 | you guessed it,
00:16:15.300 | by the celebrated
00:16:16.300 | DeepSeq R1.
00:16:17.340 | Here then is the TLDR
00:16:18.920 | on that beast
00:16:20.140 | of a training innovation.
00:16:21.300 | All language models
00:16:22.500 | need to do more
00:16:23.580 | than just predict
00:16:24.660 | the next word,
00:16:25.380 | which is what they learn
00:16:26.460 | in pre-training.
00:16:27.400 | They need
00:16:28.160 | post-training
00:16:29.320 | to move from
00:16:30.240 | predicting the most
00:16:31.060 | probable word
00:16:32.140 | to the most
00:16:32.980 | helpful sets of words
00:16:34.580 | as judged by humans.
00:16:35.900 | And ultimately,
00:16:36.940 | for mathematical reasoning
00:16:38.500 | or coding steps,
00:16:39.480 | the most
00:16:40.220 | correct word.
00:16:41.420 | Think of it like this,
00:16:42.380 | you can't be smarter
00:16:43.380 | than Twitter
00:16:43.900 | if all you do
00:16:45.120 | is train to predict
00:16:46.640 | the next tweet.
00:16:47.580 | This takes careful
00:16:48.880 | reinforcement
00:16:49.640 | of the weights
00:16:50.400 | of the model
00:16:50.920 | that produce
00:16:51.760 | these desired outputs.
00:16:53.580 | this was by mid-2024
00:16:55.320 | well-known,
00:16:55.920 | but what was the magic
00:16:56.760 | behind GRPO,
00:16:58.220 | DeepSeq's new flavor
00:16:59.640 | of reinforcement learning?
00:17:01.040 | Well,
00:17:01.640 | DeepSeq needed efficiency
00:17:03.380 | to fight the AI giants.
00:17:04.960 | Common reinforcement
00:17:06.100 | learning approaches
00:17:06.820 | at the time
00:17:07.540 | used chunky,
00:17:08.480 | clunky,
00:17:09.140 | critic models
00:17:10.300 | to assess answers
00:17:11.620 | as they were being generated
00:17:12.940 | to predict
00:17:13.600 | which ones
00:17:14.240 | were headed for success.
00:17:15.440 | DeepSeq dropped
00:17:16.700 | this memory-heavy critic
00:17:18.220 | and instead generated
00:17:19.480 | a group of answers
00:17:21.100 | in parallel,
00:17:21.700 | checked the yes-no accuracy
00:17:23.740 | of the final outputs,
00:17:25.060 | and then using
00:17:25.760 | the relative score
00:17:27.580 | of each answer
00:17:28.420 | above or below
00:17:29.420 | the average accuracy
00:17:30.420 | of the group of answers
00:17:31.400 | reinforced the successful weights
00:17:34.040 | in the model
00:17:34.520 | and down-weighted others.
00:17:36.000 | Group of answers,
00:17:37.200 | relative score,
00:17:38.440 | reinforcing
00:17:39.400 | the most successful weights.
00:17:41.320 | Group relative
00:17:42.420 | policy optimization.
00:17:43.580 | Stepping back then,
00:17:44.860 | each of these innovations
00:17:46.020 | was desperately essential
00:17:47.620 | to keep DeepSeq
00:17:48.780 | within reach
00:17:49.420 | of the resource behemoths
00:17:51.060 | behind ChatGPT,
00:17:52.380 | Claude and Gemini.
00:17:53.760 | By May of 2024,
00:17:55.100 | Liang's lab
00:17:55.880 | shipped DeepSeq V2
00:17:57.540 | with yet another
00:17:58.760 | efficiency miracle,
00:17:59.680 | multi-head latent attention.
00:18:01.460 | Now, don't worry,
00:18:02.480 | there's no deep dive
00:18:03.480 | coming on this one,
00:18:04.540 | but forgive me
00:18:05.180 | just a few words
00:18:06.480 | on how DeepSeq
00:18:07.420 | yet again reduced
00:18:08.560 | how big a model
00:18:09.680 | had to be
00:18:10.160 | to reach a similar
00:18:11.080 | level of performance.
00:18:12.040 | Think of multi-head
00:18:13.460 | latent attention
00:18:14.300 | as allowing
00:18:15.180 | multiple parts
00:18:16.400 | of the model
00:18:16.840 | to share common weights
00:18:18.180 | that are hidden
00:18:19.300 | or latent
00:18:20.180 | when they quote
00:18:21.340 | pay attention.
00:18:22.400 | If you're wondering,
00:18:23.360 | this attention mechanism
00:18:24.860 | is the process
00:18:25.880 | by which language models
00:18:26.980 | deduce which parts
00:18:28.180 | of the preceding text
00:18:29.680 | are most relevant
00:18:30.680 | for predicting
00:18:31.320 | the next word.
00:18:32.040 | Sharing those latent
00:18:33.340 | or hidden weights
00:18:34.200 | when paying attention
00:18:35.440 | meant that this model
00:18:37.000 | needed fewer
00:18:37.920 | of the weights overall.
00:18:39.340 | Shared weights,
00:18:40.160 | smaller model,
00:18:41.000 | greater efficiency.
00:18:41.940 | DeepSeq V2.
00:18:43.920 | Okay, we get it.
00:18:44.760 | The point has probably
00:18:45.720 | now been made
00:18:46.540 | that DeepSeq R1
00:18:47.840 | was not a creatio ex nihilo
00:18:49.980 | created from thin air.
00:18:51.520 | It was built
00:18:52.260 | on the back
00:18:53.020 | of painstaking innovations
00:18:54.800 | amassed over
00:18:55.820 | almost two years
00:18:57.040 | and made open
00:18:58.000 | to the world.
00:18:58.600 | Funded, of course,
00:18:59.400 | by a reclusive billionaire.
00:19:00.880 | But wait,
00:19:01.480 | why did Liang
00:19:02.260 | need so much efficiency?
00:19:03.620 | Because yes,
00:19:04.540 | DeepSeq had indeed
00:19:05.580 | secured 10,000
00:19:07.040 | Nvidia A100 GPUs
00:19:09.100 | for High Flyer's
00:19:10.080 | stock trading
00:19:10.600 | in 2021.
00:19:11.500 | But the US government
00:19:13.620 | did not want to let
00:19:14.920 | Chinese companies
00:19:15.820 | get their hands
00:19:16.600 | on more powerful chips.
00:19:18.500 | One after another,
00:19:19.900 | restrictions were introduced
00:19:21.280 | by the Biden administration
00:19:22.880 | to stop China
00:19:23.860 | getting the compute
00:19:25.300 | that it wanted.
00:19:26.100 | Nvidia tried to
00:19:27.400 | wriggle its way
00:19:28.120 | past these restrictions
00:19:29.200 | by inventing new chips
00:19:30.840 | that scraped
00:19:31.820 | under these limits.
00:19:32.780 | But each time
00:19:33.620 | a new restriction followed.
00:19:35.060 | As Liang himself said
00:19:36.840 | in the summer of 2024,
00:19:38.400 | money has never been
00:19:39.960 | the problem for us.
00:19:41.020 | Bands on shipments
00:19:42.580 | of advanced chips
00:19:43.320 | chips are the problem.
00:19:44.520 | That's the context.
00:19:45.620 | The march to more powerful AI
00:19:47.520 | was now being framed
00:19:48.760 | as a race,
00:19:49.820 | even a, quote,
00:19:51.500 | That perhaps inevitably
00:19:52.900 | kicked off a spree
00:19:54.340 | of smuggling
00:19:55.460 | worthy of a spy movie,
00:19:57.080 | with Singapore and Malaysia
00:19:58.860 | as focal points
00:20:00.140 | for Chinese companies
00:20:01.420 | getting chips
00:20:02.220 | past the new blockade.
00:20:03.680 | Think of this,
00:20:04.460 | some of the GPUs
00:20:06.300 | used in China
00:20:07.240 | to calculate R1's, say,
00:20:09.080 | recipe for Ratatouille,
00:20:10.700 | were apparently smuggled there
00:20:12.680 | in suitcases
00:20:13.800 | with, I would guess,
00:20:15.320 | little space left
00:20:16.480 | for spare socks.
00:20:17.560 | And this brings us
00:20:18.540 | to the end of 2024
00:20:20.360 | with the stage almost set.
00:20:22.180 | Liang Wenfeng
00:20:23.140 | toiling in his Hangzhou office,
00:20:25.060 | reputedly reading papers,
00:20:26.960 | writing code,
00:20:27.880 | and participating
00:20:28.900 | in group discussions,
00:20:29.900 | just like every other researcher
00:20:31.720 | at DeepSeek,
00:20:32.400 | well into the night.
00:20:33.320 | that company
00:20:34.140 | now in the line of sight
00:20:35.920 | of AI industry insiders,
00:20:37.620 | but virtually unknown
00:20:39.060 | to the public
00:20:39.840 | outside of China.
00:20:40.720 | A whale rising,
00:20:42.000 | but still just beneath the surface
00:20:44.240 | as a new year dawned.
00:20:45.820 | Liang Wenfeng
00:21:01.620 | was tired
00:21:02.760 | of the West
00:21:03.780 | inventing things
00:21:04.920 | and China
00:21:05.940 | swooping in
00:21:06.960 | to imitate
00:21:07.620 | and monetize
00:21:08.900 | those innovations.
00:21:09.760 | What's more surprising though
00:21:11.220 | is that he publicly
00:21:12.420 | said as much.
00:21:13.300 | China should
00:21:14.300 | gradually become
00:21:15.360 | a contributor
00:21:16.400 | instead of free riding,
00:21:17.540 | he said,
00:21:18.080 | in his last known
00:21:19.240 | media interview.
00:21:19.940 | He went on
00:21:20.720 | to directly cite
00:21:22.000 | the scaling law,
00:21:23.500 | an empirical finding
00:21:24.840 | first made
00:21:25.560 | in Silicon Valley
00:21:26.360 | that language models
00:21:27.440 | get predictably better
00:21:28.800 | the more parameters
00:21:29.600 | they had
00:21:30.100 | and the more
00:21:30.560 | high quality data
00:21:31.480 | they train on.
00:21:32.340 | In the past
00:21:32.960 | 30 plus years
00:21:33.920 | of the IT wave,
00:21:35.060 | Liang said,
00:21:35.760 | of China,
00:21:36.340 | we basically
00:21:37.380 | didn't participate
00:21:38.280 | in real
00:21:39.380 | technological innovation.
00:21:40.580 | We're used to
00:21:41.700 | Moore's law
00:21:42.340 | falling out of the sky,
00:21:43.560 | lying at home
00:21:44.520 | waiting 18 months
00:21:45.520 | for better hardware
00:21:46.360 | and software
00:21:47.020 | to emerge.
00:21:47.800 | That is how
00:21:49.140 | the scaling law
00:21:50.640 | is being treated.
00:21:52.300 | Liang wanted
00:21:53.200 | DeepSeek
00:21:53.780 | to be a pioneer
00:21:55.100 | that gave away
00:21:56.460 | its research
00:21:57.260 | which others
00:21:58.440 | could then learn
00:21:59.140 | from and adapt.
00:22:00.480 | In the dying days
00:22:01.920 | of 2024,
00:22:02.840 | DeepSeek produced
00:22:04.060 | DeepSeek V3.
00:22:05.820 | It was the bringing
00:22:06.880 | together and scaling
00:22:08.120 | up of all the innovations
00:22:09.480 | you have already heard
00:22:10.580 | about as well as others.
00:22:12.220 | Why not throw in
00:22:13.140 | some mixed precision
00:22:14.460 | training wherein
00:22:15.300 | your obsession
00:22:16.000 | with efficiency
00:22:16.680 | has to reach
00:22:17.580 | such crack addict
00:22:18.740 | levels that you
00:22:19.680 | handwrite code
00:22:20.700 | to optimise instructions
00:22:22.160 | to the NVIDIA GPU
00:22:23.340 | itself rather than
00:22:24.780 | relying on the
00:22:25.720 | popular CUDA libraries
00:22:27.040 | that NVIDIA provides
00:22:28.140 | for you.
00:22:28.480 | With V3,
00:22:29.380 | DeepSeek's coal picks
00:22:30.700 | were almost worn
00:22:32.020 | blunt from finding
00:22:33.420 | nuggets of efficiency.
00:22:34.520 | And though the hour
00:22:35.880 | was late,
00:22:36.420 | Western Labs
00:22:37.120 | were at last
00:22:38.020 | scrambling teams
00:22:39.040 | to study DeepSeek's
00:22:40.480 | breakthroughs.
00:22:41.120 | Dario Amadei,
00:22:41.900 | CEO of Anthropic,
00:22:42.900 | said that
00:22:43.700 | DeepSeek's V3
00:22:45.340 | was actually
00:22:46.140 | the real innovation
00:22:47.180 | and what
00:22:48.000 | should,
00:22:48.540 | he said,
00:22:49.100 | have made people
00:22:50.020 | take notice
00:22:50.660 | a month ago.
00:22:51.360 | We certainly did.
00:22:52.500 | DeepSeek knew
00:22:53.380 | to keep digging
00:22:54.020 | though because
00:22:54.560 | OpenAI had shown
00:22:55.740 | that there was
00:22:56.400 | gold just ahead.
00:22:57.540 | In September of
00:22:58.420 | 2024,
00:22:59.060 | OpenAI had showcased
00:23:00.380 | a new type
00:23:01.200 | of reinforcement
00:23:01.820 | learning that
00:23:02.580 | utilised the
00:23:03.500 | chains of thought
00:23:04.420 | a model produces
00:23:05.140 | before it submits
00:23:06.260 | a final answer.
00:23:07.040 | As we've seen,
00:23:07.780 | a model whose goal
00:23:08.780 | is to predict
00:23:09.300 | what a human on
00:23:10.080 | the web might say
00:23:10.900 | next will always
00:23:11.780 | be limited in
00:23:12.820 | capability.
00:23:13.300 | The O series
00:23:14.480 | from OpenAI
00:23:15.300 | showed that if
00:23:16.260 | instead you first
00:23:17.280 | induce the model
00:23:18.060 | to reason out loud,
00:23:19.140 | then apply brutal
00:23:20.620 | optimisation
00:23:21.280 | pressure in favour
00:23:22.260 | of those outputs
00:23:23.240 | that match verifiably
00:23:24.880 | correct answers
00:23:25.880 | in domains like
00:23:26.760 | mathematics and coding,
00:23:27.720 | you thereby optimise
00:23:29.660 | for the most
00:23:30.500 | technically accurate
00:23:32.080 | continuation
00:23:32.740 | and unveil a whole
00:23:34.680 | new terrain of
00:23:35.680 | reasoning progress
00:23:36.540 | to be explored.
00:23:37.300 | Because of
00:23:38.400 | Liang Wenfeng,
00:23:39.500 | DeepSeek was there
00:23:40.840 | ready and waiting,
00:23:41.780 | pick in hand.
00:23:42.920 | Adding this
00:23:43.800 | think-out-loud
00:23:44.520 | reasoning innovation
00:23:45.400 | on top of their
00:23:46.360 | V3-based model
00:23:47.660 | produced DeepSeek
00:23:49.200 | R1-0.
00:23:51.640 | zero,
00:23:52.120 | but the thoughts
00:23:52.760 | of that model
00:23:53.520 | could be a little
00:23:54.280 | wayward in language
00:23:55.360 | and style,
00:23:55.920 | so with some
00:23:56.640 | further tweaks
00:23:57.360 | and fine-tuning,
00:23:58.420 | DeepSeek could
00:23:59.580 | unveil DeepSeek
00:24:02.060 | the AI that has
00:24:04.220 | billions of people
00:24:05.360 | talking.
00:24:05.800 | In many technical
00:24:06.700 | benchmarks,
00:24:07.180 | R1 narrowly
00:24:08.440 | surpassed the
00:24:09.520 | performance of the
00:24:10.380 | original O1 model
00:24:11.400 | from OpenAI
00:24:12.060 | in September,
00:24:12.720 | and in others
00:24:13.460 | it was not far
00:24:14.060 | behind.
00:24:14.480 | By being open
00:24:15.620 | with their research,
00:24:16.540 | DeepSeek showed
00:24:17.520 | the world how
00:24:18.380 | language models,
00:24:19.580 | under that
00:24:20.500 | unrelenting
00:24:21.120 | optimization pressure
00:24:22.200 | to produce
00:24:22.660 | correct answers,
00:24:23.480 | could sometimes
00:24:24.300 | backtrack and
00:24:25.560 | even correct
00:24:26.400 | themselves.
00:24:26.860 | It was an
00:24:28.140 | aha moment
00:24:29.180 | for the models
00:24:30.060 | and for the
00:24:31.020 | world realizing
00:24:31.900 | just how close
00:24:32.700 | a secretive
00:24:33.400 | Chinese lab was
00:24:34.320 | to household
00:24:34.820 | names like
00:24:35.540 | ChatGPT.
00:24:36.240 | Don't get me
00:24:36.860 | wrong,
00:24:37.120 | there were other
00:24:37.900 | innovations in the
00:24:38.980 | DeepSeek R1
00:24:39.680 | paper including
00:24:40.300 | how their biggest
00:24:41.200 | and smartest
00:24:41.820 | models could
00:24:42.420 | effectively distill
00:24:43.660 | much of their
00:24:44.400 | abilities into
00:24:45.140 | smaller models,
00:24:45.800 | saving those
00:24:46.560 | models much of
00:24:47.160 | the work to
00:24:47.780 | get up to
00:24:48.160 | scratch.
00:24:48.540 | The explain it
00:24:49.580 | like I'm 10
00:24:50.180 | of that innovation
00:24:51.060 | is that models
00:24:51.960 | that can fit
00:24:52.460 | onto phones
00:24:53.120 | and home
00:24:53.560 | computers or
00:24:54.380 | served at
00:24:54.880 | incredibly low
00:24:55.600 | cost from
00:24:56.020 | anywhere are
00:24:56.760 | now set in
00:24:57.520 | 2025 to be
00:24:58.880 | smarter than
00:24:59.660 | the smartest
00:25:00.340 | giant models
00:25:01.080 | of 2024.
00:25:02.460 | But why the
00:25:03.420 | virality of
00:25:04.200 | DeepSeek R1?
00:25:05.100 | Was it the
00:25:05.780 | fact that you
00:25:06.280 | could see those
00:25:06.900 | thoughts in the
00:25:07.580 | DeepSeek chat
00:25:08.400 | that made the
00:25:09.080 | model so compelling?
00:25:09.820 | Or the fact that
00:25:10.720 | it was so cheap
00:25:11.460 | that caused
00:25:12.180 | Nvidia stocks to
00:25:13.240 | plunge by almost
00:25:14.360 | half a trillion
00:25:15.140 | dollars?
00:25:15.640 | Liang had said
00:25:16.600 | himself by the
00:25:17.340 | way that he
00:25:18.180 | quote didn't expect
00:25:19.220 | pricing to be so
00:25:20.220 | sensitive to
00:25:20.880 | everyone.
00:25:21.200 | Okay was it
00:25:22.200 | DeepSeek's openness
00:25:23.100 | that was so
00:25:23.600 | shocking?
00:25:23.960 | A hundred
00:25:24.780 | narratives have
00:25:25.760 | bloomed in the
00:25:26.620 | days and weeks
00:25:27.420 | after the release
00:25:28.280 | of DeepSeek R1
00:25:29.340 | but not all of
00:25:31.020 | them are as
00:25:31.940 | they seem.
00:25:32.500 | First let's
00:25:33.620 | address those
00:25:34.220 | chains of
00:25:34.640 | thought.
00:25:34.920 | In hindsight
00:25:35.620 | it might seem
00:25:36.520 | obvious that
00:25:37.500 | gaining privileged
00:25:38.420 | access to a
00:25:39.240 | model's thoughts
00:25:40.280 | was always going
00:25:41.140 | to stand out
00:25:41.940 | in a crowded
00:25:42.440 | market.
00:25:42.880 | OpenAI only
00:25:44.000 | gave sanitized
00:25:45.200 | summaries of its
00:25:46.260 | O1 model's
00:25:47.080 | thoughts after all.
00:25:48.080 | But wait,
00:25:48.560 | within hours of
00:25:49.700 | the R1 release
00:25:50.700 | Google had given
00:25:51.680 | us Gemini 2.0
00:25:53.360 | flash thinking,
00:25:54.160 | a model that
00:25:55.320 | showed its
00:25:56.040 | thoughts.
00:25:56.460 | That model's
00:25:57.380 | impact on the
00:25:58.080 | scene can best
00:25:58.920 | be described as
00:26:00.000 | a cute ripple
00:26:01.240 | next to R1
00:26:02.500 | tsunami.
00:26:03.040 | So it must have
00:26:03.820 | been the price,
00:26:04.420 | right?
00:26:04.700 | by some
00:26:05.180 | metrics R1's
00:26:06.240 | 95% cheaper than
00:26:07.860 | competitively capable
00:26:08.940 | models from
00:26:09.600 | OpenAI.
00:26:10.000 | But wait,
00:26:10.860 | Gemini 2 flash is
00:26:12.620 | even cheaper.
00:26:13.260 | And again,
00:26:14.200 | just a polite
00:26:15.500 | round of applause.
00:26:16.280 | Okay,
00:26:16.840 | maybe it's the
00:26:17.380 | fact that the
00:26:17.880 | model cost just
00:26:18.900 | $6 million to
00:26:20.160 | train,
00:26:20.520 | which is a
00:26:21.340 | measly sum in
00:26:22.440 | the circumstances.
00:26:23.180 | Well,
00:26:24.000 | on that,
00:26:24.600 | let's take a
00:26:25.380 | moment to at
00:26:26.120 | least hear out
00:26:26.920 | the argument from
00:26:27.820 | the leaders of
00:26:28.580 | the Western
00:26:28.980 | labs on price,
00:26:29.900 | even if you have
00:26:30.760 | reason to doubt
00:26:31.720 | their motivation.
00:26:32.800 | Anthropix CEO
00:26:33.680 | Dario Amadei
00:26:34.500 | first responded by
00:26:35.660 | describing how
00:26:36.560 | costs had already
00:26:37.640 | been consistently
00:26:38.520 | dropping 4x per
00:26:40.060 | year for the
00:26:41.160 | same amount of
00:26:41.860 | model capability.
00:26:42.600 | He even wrote a
00:26:43.720 | full article to
00:26:44.720 | in part clarify
00:26:45.720 | that,
00:26:46.280 | quote,
00:26:46.680 | even if you take
00:26:47.640 | DeepSeq's training
00:26:48.480 | costs at face
00:26:49.200 | value,
00:26:49.660 | they are on
00:26:50.880 | trend at best
00:26:51.860 | and probably not
00:26:52.760 | even that.
00:26:53.320 | He did admit that
00:26:54.420 | what was different
00:26:55.140 | in his words was
00:26:56.280 | that the company
00:26:57.380 | that was first to
00:26:58.320 | demonstrate the
00:26:59.140 | expected cost
00:26:59.980 | reductions was
00:27:00.980 | Chinese.
00:27:01.920 | DeepSeq's GPU
00:27:02.920 | investments alone
00:27:04.000 | account for more
00:27:05.160 | than $500 million
00:27:07.060 | even after
00:27:08.320 | considering export
00:27:09.240 | controls.
00:27:09.720 | Their total server
00:27:10.820 | capital expenditure
00:27:11.700 | is around $1.6
00:27:13.520 | billion.
00:27:13.800 | Even a $6
00:27:15.080 | million training run
00:27:16.360 | doesn't just appear
00:27:17.240 | from nowhere.
00:27:17.900 | Indeed,
00:27:18.480 | things are getting
00:27:19.480 | so costly for
00:27:20.500 | DeepSeq that even
00:27:21.580 | Liang's vast pockets
00:27:22.840 | are reaching their
00:27:23.600 | limits.
00:27:23.980 | According to reports
00:27:25.140 | from February of
00:27:26.120 | 2025,
00:27:26.740 | Liang is considering
00:27:28.100 | raising outside money
00:27:29.480 | for the first time,
00:27:30.520 | potentially from
00:27:31.100 | Alibaba Group
00:27:31.900 | and Chinese
00:27:32.740 | state-affiliated funds.
00:27:34.100 | Why would so much
00:27:35.480 | money be needed?
00:27:36.140 | Well,
00:27:36.600 | it's not just to
00:27:37.520 | serve the tens of
00:27:38.820 | millions of daily
00:27:39.500 | active users
00:27:40.160 | that DeepSeq now
00:27:41.280 | It's to scale
00:27:42.260 | model intelligence
00:27:43.200 | further,
00:27:43.640 | all the way to
00:27:45.700 | an artificial
00:27:46.540 | intelligence as
00:27:47.540 | general in
00:27:48.300 | applicability as
00:27:49.320 | our own.
00:27:49.700 | According to
00:27:50.480 | Altman and
00:27:51.080 | Amadei,
00:27:51.560 | tacking on that
00:27:52.720 | think-out-loud
00:27:53.520 | reasoning optimization
00:27:54.520 | to a great
00:27:55.800 | base model
00:27:56.440 | can yield
00:27:57.220 | outsized dividends
00:27:58.780 | at first,
00:27:59.700 | which have
00:28:00.460 | allowed DeepSeq
00:28:01.280 | to catch up.
00:28:01.960 | But to ride
00:28:03.200 | that upward
00:28:03.900 | curve into
00:28:04.940 | the vicinity
00:28:06.000 | of AGI,
00:28:07.020 | you'll need
00:28:08.540 | tens of
00:28:09.240 | billions of
00:28:09.680 | dollars worth
00:28:10.400 | of compute,
00:28:10.920 | they argue.
00:28:11.700 | Amadei wrote,
00:28:12.900 | We're therefore
00:28:13.480 | at an
00:28:13.900 | interesting
00:28:14.420 | crossover point
00:28:15.500 | where it is
00:28:16.660 | temporarily the
00:28:17.980 | case that
00:28:18.860 | several companies
00:28:19.740 | can produce
00:28:20.280 | good reasoning
00:28:20.980 | models.
00:28:21.360 | This will
00:28:22.220 | rapidly cease
00:28:23.360 | to be true
00:28:24.060 | as everyone
00:28:24.960 | moves further
00:28:25.880 | up the
00:28:26.680 | scaling curve
00:28:27.420 | on these
00:28:27.860 | models.
00:28:28.200 | Making AI,
00:28:29.240 | he said,
00:28:29.780 | that is smarter
00:28:30.540 | than almost all
00:28:31.520 | humans at
00:28:32.320 | almost all
00:28:33.020 | things will
00:28:34.180 | require millions
00:28:35.260 | of chips,
00:28:35.880 | tens of
00:28:36.860 | billions of
00:28:37.360 | dollars at
00:28:38.280 | least,
00:28:38.660 | and is most
00:28:39.740 | likely to happen
00:28:40.540 | in 2026,
00:28:42.480 | 2027.
00:28:43.440 | Even forgetting
00:28:44.300 | DeepSeq for a
00:28:45.060 | moment,
00:28:45.320 | that is quite
00:28:46.320 | the quote.
00:28:47.020 | If he's right
00:28:47.960 | though,
00:28:48.240 | and it's a
00:28:48.700 | big if,
00:28:49.480 | those Chinese
00:28:50.420 | corporate jet
00:28:51.100 | setters are
00:28:51.620 | going to have
00:28:51.940 | to smuggle
00:28:52.360 | god knows
00:28:53.040 | how many
00:28:53.360 | GPUs under
00:28:54.200 | their pack
00:28:54.600 | pajamas.
00:28:54.980 | But Amadei
00:28:56.440 | has a word
00:28:57.440 | on that.
00:28:58.100 | One billion
00:28:59.100 | dollars of
00:28:59.760 | economic activity
00:29:00.620 | can be hidden,
00:29:01.380 | but it's hard
00:29:02.300 | to hide
00:29:02.760 | a hundred
00:29:03.240 | billion or
00:29:03.920 | even ten
00:29:04.400 | billion.
00:29:04.700 | A million
00:29:05.420 | chips may
00:29:06.360 | also be
00:29:07.000 | physically
00:29:07.560 | difficult to
00:29:08.420 | smuggle.
00:29:08.760 | Without
00:29:09.280 | enough chips,
00:29:09.940 | according to
00:29:10.400 | this argument,
00:29:10.960 | DeepSeq are
00:29:11.740 | R2 and
00:29:12.400 | R3 can't
00:29:13.420 | help but
00:29:14.020 | fall behind.
00:29:14.680 | We simply
00:29:15.420 | do not
00:29:15.840 | know if
00:29:16.540 | the DeepSeq
00:29:17.260 | engineers can
00:29:18.140 | keep building
00:29:18.820 | at the pace
00:29:19.700 | of those
00:29:20.060 | working with
00:29:20.700 | billion dollar
00:29:21.420 | bricks.
00:29:21.860 | While we're
00:29:22.500 | on China,
00:29:23.020 | there is
00:29:23.440 | another narrative
00:29:24.120 | that I want
00:29:24.800 | to debunk.
00:29:25.580 | You may have
00:29:26.240 | been told that
00:29:27.180 | DeepSeq is a
00:29:28.060 | one-off and
00:29:28.740 | that China lacks
00:29:29.600 | the environment
00:29:30.260 | to properly
00:29:30.920 | foster innovation
00:29:31.880 | in AI.
00:29:32.660 | Well,
00:29:33.120 | even if you
00:29:33.800 | cast aside the
00:29:34.840 | text-to-image
00:29:35.520 | and text-to-video
00:29:36.340 | wonders produced
00:29:37.360 | by tools like
00:29:38.160 | Kling AI,
00:29:38.680 | you are still
00:29:39.600 | left with a
00:29:40.360 | landscape full
00:29:41.380 | of new models
00:29:41.980 | like Doobao
00:29:43.020 | 1.5 Pro from
00:29:44.500 | ByteDance,
00:29:45.200 | makers of
00:29:45.700 | TikTok,
00:29:46.000 | released within
00:29:46.860 | hours actually
00:29:47.700 | of R1.
00:29:48.800 | and a week
00:29:49.380 | before that,
00:29:50.020 | we got Spark
00:29:51.140 | Deep Reasoning
00:29:52.040 | X1 from
00:29:53.060 | iFlyTech and
00:29:54.100 | Huawei,
00:29:54.480 | which beats
00:29:55.260 | Western models
00:29:55.960 | at Chinese
00:29:56.820 | technical exams
00:29:57.620 | and is used by
00:29:58.460 | almost 100
00:29:59.140 | million people
00:29:59.800 | already.
00:30:00.220 | And on
00:30:01.000 | January 20th,
00:30:02.220 | the literal
00:30:02.840 | same day that
00:30:03.740 | R1 was
00:30:04.360 | released,
00:30:04.820 | the Chinese
00:30:05.580 | research firm
00:30:06.360 | Moonshot
00:30:07.120 | AI launched
00:30:07.980 | the multimodal
00:30:08.920 | model Kimi
00:30:09.980 | K1.5,
00:30:11.220 | achieving 96.2%
00:30:13.360 | on a popular
00:30:14.360 | math benchmark.
00:30:15.740 | that's a better
00:30:16.380 | score than
00:30:16.920 | OpenAI's
00:30:17.980 | So anyone
00:30:18.600 | saying that
00:30:19.140 | R1 is the
00:30:20.020 | last we're
00:30:20.460 | going to hear
00:30:20.860 | from China
00:30:21.460 | for quite a
00:30:22.200 | while might
00:30:23.260 | well be
00:30:23.620 | getting nervous,
00:30:24.300 | especially with
00:30:25.480 | R2 apparently
00:30:26.800 | imminent.
00:30:27.960 | I can't
00:30:28.840 | cover present
00:30:29.700 | and future
00:30:30.380 | Chinese language
00:30:31.140 | models without
00:30:31.980 | mentioning another
00:30:33.020 | narrative that
00:30:33.920 | might need
00:30:34.360 | busting.
00:30:34.720 | that the
00:30:35.580 | open nature
00:30:36.420 | of the
00:30:37.120 | DeepSeq R1
00:30:37.880 | paper is
00:30:38.820 | reflected in
00:30:39.960 | the openness
00:30:40.560 | of the model
00:30:41.380 | itself.
00:30:41.960 | Because as
00:30:42.760 | many of you
00:30:43.360 | will know,
00:30:43.820 | the model is
00:30:44.620 | not free to
00:30:45.540 | return outputs
00:30:46.340 | on sensitive
00:30:47.140 | Chinese topics.
00:30:48.020 | Not that it
00:30:49.040 | doesn't know
00:30:49.880 | anything about
00:30:50.460 | them though.
00:30:50.960 | I asked a
00:30:51.780 | simple question,
00:30:52.380 | tell me about
00:30:53.020 | the Uyghurs,
00:30:53.600 | and got this
00:30:54.380 | intriguing set
00:30:55.440 | of thoughts.
00:30:56.000 | This has to
00:30:56.960 | lead to an
00:30:57.460 | illuminating and
00:30:58.340 | deeply reflective
00:30:59.220 | final answer.
00:30:59.980 | We're sure,
00:31:00.620 | right?
00:31:00.900 | Not so much.
00:31:03.560 | DeepSeq's
00:31:04.200 | R1 model
00:31:04.820 | was released
00:31:05.560 | under MIT
00:31:06.400 | license,
00:31:06.860 | so of course
00:31:07.420 | others have
00:31:08.520 | been quick to
00:31:09.240 | adapt the
00:31:09.740 | model to,
00:31:10.380 | well,
00:31:11.060 | speak its
00:31:11.960 | truth.
00:31:12.380 | Regardless
00:31:13.140 | though,
00:31:13.500 | I am sure
00:31:14.360 | that this
00:31:15.280 | is a topic
00:31:15.920 | that DeepSeq
00:31:16.740 | and Liang
00:31:17.340 | Wenfeng,
00:31:17.920 | if they're
00:31:18.440 | watching,
00:31:18.720 | are exceptionally
00:31:19.960 | keen for me
00:31:20.640 | to move on
00:31:21.140 | from.
00:31:21.380 | So let's
00:31:21.900 | turn to how
00:31:22.420 | OpenAI
00:31:22.920 | tried,
00:31:23.620 | briefly,
00:31:24.180 | to establish
00:31:25.020 | their own
00:31:25.700 | counter-narrative,
00:31:26.660 | which was that
00:31:27.860 | DeepSeq may
00:31:28.900 | have illicitly
00:31:30.200 | accessed the
00:31:31.320 | chains of
00:31:31.860 | thought of
00:31:32.540 | OpenAI's
00:31:33.280 | O1 model
00:31:34.000 | and trained
00:31:35.140 | on them.
00:31:35.480 | Think of that
00:31:36.040 | as effectively
00:31:36.680 | stealing the
00:31:37.600 | intelligence that
00:31:38.480 | had been so
00:31:39.280 | carefully cultivated
00:31:40.500 | by OpenAI.
00:31:41.360 | A spokesperson
00:31:42.480 | for OpenAI
00:31:43.440 | said,
00:31:43.940 | we know that
00:31:44.660 | groups in
00:31:45.160 | China are
00:31:45.860 | actively working
00:31:46.800 | to use
00:31:47.260 | methods,
00:31:47.780 | including what's
00:31:48.600 | known as
00:31:48.900 | distillation,
00:31:49.520 | to try to
00:31:50.300 | replicate advanced
00:31:51.380 | US AI models.
00:31:52.620 | We are aware
00:31:53.640 | of and reviewing
00:31:54.560 | indications that
00:31:55.520 | DeepSeq may
00:31:56.180 | have inappropriately
00:31:57.520 | distilled our
00:31:58.160 | models.
00:31:58.420 | We take
00:31:58.980 | aggressive,
00:31:59.500 | proactive
00:32:00.160 | countermeasures
00:32:00.800 | to protect our
00:32:01.760 | technology and
00:32:02.580 | will continue
00:32:03.260 | working closely
00:32:04.220 | with the US
00:32:04.880 | government to
00:32:05.520 | protect the
00:32:06.020 | most capable
00:32:06.560 | models being
00:32:07.300 | built here.
00:32:07.940 | Side note,
00:32:08.540 | speaking of
00:32:09.140 | working with the
00:32:09.680 | government,
00:32:10.020 | certain US
00:32:10.900 | lawmakers are
00:32:11.660 | proposing that
00:32:12.500 | US users be
00:32:13.560 | jailed if they
00:32:14.560 | use DeepSeq R1.
00:32:15.840 | Back to the
00:32:16.660 | counter-narrative,
00:32:17.220 | that died in the
00:32:19.140 | public imagination
00:32:19.940 | almost as soon as
00:32:21.260 | it was tried,
00:32:21.840 | for one obvious
00:32:22.840 | reason.
00:32:23.220 | OpenAI themselves
00:32:24.460 | are being sued
00:32:25.400 | by everyone,
00:32:26.100 | including my
00:32:26.880 | second cousin's
00:32:27.540 | estranged grandmother,
00:32:28.340 | for knowingly
00:32:29.500 | training on
00:32:30.340 | copyrighted works
00:32:31.380 | without compensation.
00:32:32.820 | one suspects
00:32:33.740 | few will have
00:32:34.960 | any sympathy
00:32:35.800 | for those
00:32:36.340 | companies if
00:32:37.100 | others,
00:32:37.580 | like DeepSeq,
00:32:38.480 | distill anything
00:32:39.540 | from ChatGPT,
00:32:40.660 | even if they
00:32:41.380 | needed to,
00:32:41.900 | which they
00:32:42.340 | probably didn't
00:32:43.060 | by the way.
00:32:43.540 | Regardless,
00:32:44.260 | reasoning is being
00:32:45.460 | automated at a
00:32:46.800 | breakneck pace.
00:32:47.820 | As hard to believe
00:32:49.500 | as the DeepSeq
00:32:50.720 | rise is,
00:32:51.480 | for me,
00:32:52.060 | it's actually only
00:32:53.220 | a pointer to a
00:32:54.680 | bigger story.
00:32:55.420 | We are actually
00:32:57.100 | entering an era of
00:32:58.760 | automated artificial
00:33:00.160 | intelligence.
00:33:00.880 | And no,
00:33:01.500 | the models will not
00:33:02.520 | always best be
00:33:03.440 | described as tools
00:33:04.880 | akin to a
00:33:05.780 | calculator.
00:33:06.240 | If an AI in
00:33:08.180 | three years time
00:33:09.060 | can do 95% of
00:33:10.800 | my job,
00:33:11.680 | or yours,
00:33:12.360 | at what point
00:33:13.380 | am I just
00:33:14.260 | a tool responsible
00:33:15.200 | only for clicking
00:33:16.440 | submit?
00:33:16.860 | Granted,
00:33:17.540 | we are very much
00:33:18.800 | not there yet,
00:33:19.740 | of course.
00:33:20.280 | It is DeepSeq
00:33:22.640 | after all.
00:33:23.380 | And yes,
00:33:24.080 | humans are still
00:33:25.320 | just about in the
00:33:26.600 | driving seat.
00:33:27.380 | Meaning,
00:33:28.100 | I guess,
00:33:28.660 | the only thing
00:33:29.600 | absolutely guaranteed
00:33:30.680 | is drama.
00:33:31.880 | That then was the
00:33:37.420 | DeepSeq story as
00:33:38.580 | best we know it.
00:33:39.300 | What is next
00:33:40.420 | though for the
00:33:41.240 | taciturn Liangwenfeng
00:33:42.740 | and his team of
00:33:43.520 | wizards?
00:33:43.880 | The R1 paper hints
00:33:45.480 | that they are deep
00:33:46.720 | in the mind still
00:33:47.520 | working on
00:33:48.460 | infinite context
00:33:49.760 | and a replacement
00:33:51.020 | for the legendary
00:33:51.920 | transformer architecture
00:33:53.220 | behind every famous
00:33:54.520 | language model.
00:33:55.060 | But just take
00:33:55.660 | infinite context,
00:33:56.740 | where we can
00:33:57.320 | imagine a model
00:33:58.020 | provided everything
00:33:59.000 | you have ever heard
00:33:59.960 | or seen or said
00:34:01.120 | and referencing any
00:34:02.300 | of it when it gives
00:34:03.060 | you its next answer.
00:34:04.360 | Will DeepSeq do it?
00:34:05.460 | Will they reach
00:34:06.520 | AGI first?
00:34:07.580 | Would they actually
00:34:08.760 | open source it if so?
00:34:09.980 | Will the world
00:34:11.040 | grasp even a fraction
00:34:12.760 | of what is happening
00:34:13.540 | before that day
00:34:14.960 | or only after?
00:34:16.280 | Well,
00:34:17.180 | it probably won't be
00:34:18.560 | long before we find out.
00:34:21.960 | Thank you.