AlexNet and ImageNet Explained

00:00:00.000 | Today we're going to talk about one of the most important events in the history of deep learning

00:00:05.600 | we're going to talk about what happened at ImageNet 2012 and how that launched the sort of

00:00:14.560 | deep learning rocket ship that we've been strapped to for the past decade. So in short we're going to

00:00:21.040 | talk about ImageNet and where it came from why it was so important and then we're going to have a

00:00:25.360 | look at very briefly going to have a look at convolutional neural networks and AlexNet which

00:00:31.120 | is the model that triggered the massive growth of deep learning and for me I like to back everything

00:00:39.200 | up with code so what we'll do is towards the end of the video we're going to go through

00:00:43.840 | the PyTorch implementation of AlexNet and we're actually going to test it on a small ImageNet

00:00:52.480 | like data set and that'll be quite useful because we can see sort of image pre-processing steps and

00:00:58.880 | also how to perform inference with a convolutional neural network like AlexNet. So let's jump

00:01:05.840 | straight into it. Today's deep learning revolution traces its roots back to the 30th of September

00:01:12.080 | 2012. On this day a deep layered convolutional neural network won the ImageNet 2012 challenge

00:01:20.960 | and this convolutional neural network didn't just win it completely destroyed the rest of

00:01:27.120 | the competition. Now this model you might have guessed is called AlexNet and the simple fact

00:01:32.560 | that even use convolutional neural networks was very new. Convolutional neural networks had been

00:01:37.920 | around for a while but using them had kind of been deemed impractical yet when AlexNet's results came

00:01:46.640 | in it proved sort of unparalleled performance on what was seen as one of the hardest challenges of

00:01:54.800 | the time for computer vision. So this event made AlexNet the first widely acknowledged successful

00:02:03.360 | implementation of deep learning and the sheer performance improvement that it showed caught

00:02:12.320 | people's attention. Until this point deep learning was unproven it was simply a nice idea that most

00:02:19.680 | people just decided okay it's impractical we don't have enough data we don't have enough compute to

00:02:25.760 | do anything like this but AlexNet showed that this was not the case and that deep learning

00:02:30.720 | was now practical. Yet this sort of surge of interest in deep learning was not solely you

00:02:39.040 | know thanks to AlexNet. ImageNet also played a big part in this. The foundation of applied deep

00:02:45.760 | learning was set by ImageNet and built upon by AlexNet. So let's begin with ImageNet. Back in

00:02:55.840 | 2006 the world of computer vision was a lot different to how we know it now. It was pretty

00:03:03.840 | underfunded it didn't really get that much attention yet there were a lot of researchers

00:03:09.040 | around the world focused on building better models and the year after year they saw progress but

00:03:15.040 | it was slow. In that same year a woman called Fei-Fei Li had just finished her computer vision

00:03:23.760 | PhD at Caltech and had started working as a professor in computer science and had noticed

00:03:32.080 | this sort of focus in the field of computer vision on the models and the subsequent lack of focus on

00:03:41.760 | data and an idea came to Li that maybe a data set that was more representative of the world

00:03:50.240 | could improve the performance of the modelers being trained on it. Around the same time there

00:03:56.160 | was another professor called Christiana Feldbaum and she was a co-developer of a data set from the

00:04:03.040 | 1980s called WordNet. Now WordNet consisted of a pretty large number of English language terms

00:04:11.120 | organized into an ontological structure. So for example for the term Siberian Husky

00:04:16.480 | that would be within a tree structure and above Siberian Husky you would have a working dog,

00:04:23.040 | above working dog you would have dog, above dog you'd have canine, carnivore and so on. So there's

00:04:28.480 | like that tree structure of different terms and how they relate to each other. In 2007 Li and

00:04:37.120 | Feldbaum met and Feldbaum discussed her work on or her idea at the time of adding just a reference

00:04:45.280 | image to each of the terms within WordNet. So the intention was not to create a image data set but

00:04:53.600 | it was simply to add like a reference image so people could more easily understand what that

00:04:57.680 | particular term was about and this inspired an idea from Li that would kick start the world of

00:05:04.080 | computer vision and deep learning. So soon after Li put together a team to build what would become

00:05:11.120 | the largest labeled data set of images in the world called ImageNet. The idea

00:05:18.720 | behind ImageNet was that a large ontological based data set like WordNet but for images

00:05:27.600 | could be the key behind building more advanced content based image retrieval, object recognition,

00:05:36.080 | scene recognition and better visual understanding in computer vision models. And just two years

00:05:43.360 | later the first version of ImageNet was released with 12 million labeled images. These were all

00:05:51.680 | structured and labeled in line with the WordNet ontology. Yet if we consider the sheer size of

00:05:58.160 | that, the 12 million images, if one person had spent literally every single day labeling one

00:06:06.960 | image per minute and did literally nothing else in that time, they didn't eat, they didn't sleep,

00:06:12.800 | just labeled images, it would have taken them 22 years and 10 months. Which obviously is a very

00:06:21.920 | long time. There's just an insane number of images to be labeled here. So how did they do it? Because

00:06:30.160 | the team was not huge, they didn't have an infinite amount of money to pay other researchers and

00:06:35.520 | students to do this. So what they eventually settled on was a platform called Amazon's

00:06:42.000 | Mechanical Turk. Mechanical Turk is a crowdsourcing platform where people from around the globe will

00:06:49.920 | perform tasks such as labeling images for a set amount of money. Because there's just the insane

00:06:59.840 | scale of people around the world doing this at competitive prices, that made ImageNet possible

00:07:08.080 | with a few adjustments to the labeling process. Because in reality if you just have random people

00:07:14.960 | around the world labeling your dataset, some of those people are going to try and take advantage

00:07:19.840 | of that, maybe game the system. So you have to have some checks and balances in place against

00:07:26.640 | that. So there's a little bit of system design in there. But that is how they built the dataset.

00:07:36.480 | Now on its release, ImageNet was the largest publicly available labeled dataset of images

00:07:44.800 | in the world. Yet there was very little interest in the dataset. Which seems pretty crazy when we

00:07:52.320 | look back on that in hindsight. Because now we know, okay we want more data for our models to

00:07:56.960 | make them better. But at the time things were different. So ImageNet came with these 12 million

00:08:06.080 | images distributed across 22,000 categories. And at the time there were the odd other image

00:08:14.880 | datasets that used a similar sort of structure and idea. So for example the ESP dataset used

00:08:21.200 | something called the ESP game. And people would play the ESP game and label images for the dataset.

00:08:28.240 | Now reportedly they had way more images. But it wasn't publicly released. They only publicly

00:08:34.880 | released 60,000 of those images. And a couple of years later there were a few papers that kind of

00:08:44.080 | looked at the ESP game and ESP dataset and said okay it's probably not actually that useful.

00:08:52.240 | Because you can kind of guess the right answer most of the time without even looking at the image.

00:08:57.200 | So there were some questions around the usability of that dataset. So all this to say ImageNet

00:09:04.880 | was by far the biggest, at least publicly available, and most sort of accurate dataset

00:09:15.280 | for computer vision at the time. So the reason that there was very little interest in ImageNet

00:09:21.680 | despite its huge size is that people just assumed that it could not work for their models.

00:09:31.040 | You have to think back then they were training models on much smaller datasets which had like

00:09:39.440 | 12 categories of images for example. And the models would struggle with that. So when ImageNet

00:09:46.640 | comes along and it's like hey I have 22,000 categories here people are just like well I

00:09:53.040 | can't deal with 12 so I'm not going to even try 22,000. That's crazy. So there was a lack of

00:09:59.200 | interest in ImageNet at the time. It just wasn't really received that warmly. So the ImageNet team

00:10:08.800 | decided to try and push it a bit more. So by the next year 2010 they had managed to organize a

00:10:18.800 | challenge with the dataset, a classification challenge initially. And they grew into

00:10:23.680 | different things over the years but initially it's just a classification challenge. So the ImageNet

00:10:29.680 | large-scale visual recognition challenge was first hosted in 2010. And competitors had to

00:10:37.680 | correctly classify images from 1,000 categories. So not a full set of terms in the ontology of

00:10:48.960 | ImageNet but they had 1,000 categories instead. And whoever produced a model with the lowest error

00:10:57.600 | rate won. And there were a few entrants. There was not a huge number of entrants. I think it's

00:11:02.560 | something like 4, 5, 6 entrants in 2010, 2011, 2012. Now eventually this challenge would become

00:11:12.960 | the primary benchmark in computer vision progress. But it took some time. And that really started

00:11:21.440 | in 2012. So 2012 was not like the previous years for ImageNet. On the 30th of September 2012 the

00:11:30.560 | latest challenge results were released. And one of those results was a lot better than any of the

00:11:38.320 | other results. And it came from a model that most people thought was just not practical. And that

00:11:46.080 | was AlexNet. AlexNet was the first model to score a sub 25% error rate. And that same year the

00:11:54.320 | nearest competitor was 9.8 percentage points behind AlexNet. And they had done this with a

00:12:01.920 | deep layered convolutional neural network which at the time people were not really taking seriously.

00:12:07.360 | Now to understand AlexNet it's probably best we very quickly cover a little bit of what a

00:12:12.080 | convolutional neural network actually is. So a convolutional neural network or CNN is a neural

00:12:17.680 | network layer that has a special layer called a convolutional layer. And today these models are

00:12:26.560 | known for computer vision. They have been for quite a long time sort of undisputed champions

00:12:34.560 | of computer vision. And actually you know that has changed a little bit in literally like the

00:12:40.960 | past couple of years. But right now they're still pretty dominant. And unlike a lot of the models

00:12:47.760 | back in 2012 and earlier these did not need too much sort of manual feature extraction or too much

00:12:57.600 | image pre-processing before feeding data into the model. They could just kind of deal with that

00:13:04.080 | themselves. CNNs use several of these convolutional layers stacked on top of each other. And what you

00:13:10.640 | find is that the deeper the network is the more the better it can identify more sort of complex

00:13:18.400 | concepts or objects in images. So for example the first with the first few layers you're probably

00:13:26.560 | going to just kind of identify okay this is an edge, this is a circle maybe, this is this shape

00:13:33.520 | and maybe some textures. As the network gets deeper and you add more layers to it it starts to

00:13:40.640 | abstract those features and identify more abstract ideas. So a deeper network will be able to

00:13:49.520 | identify okay this is like a living thing and then you go you build a deeper network and it can

00:13:56.000 | identify mammals and then it can identify dogs and then it can identify Siberian huskies. So as the

00:14:04.720 | model gets deeper its performance and its ability to identify more nuanced things gets better.

00:14:14.400 | So at the time these models were overlooked because essentially to train these to get good

00:14:23.840 | performance from one of these models they need to be really deep. Which means that they have a lot

00:14:28.640 | of parameters okay and it's the more parameters you have the longer it's going to take your model

00:14:34.240 | to train if you can train it at all if it's if it's too big and doesn't even fit in the memory

00:14:38.800 | on your computer. And also the more parameters it has the more data it has to see before it

00:14:46.080 | can produce sort of any any good performance of anything. As a result of this they were simply

00:14:53.040 | overlooked. Yet the authors of AlexNet won the ImageNet challenge in 2012 and it turns out that

00:15:00.720 | they were the the right people in the right place at the right time. Several pieces came from from

00:15:07.040 | different places to create this. ImageNet provided a massive amount of data needed to train one of

00:15:14.800 | these deep layered convolutional neural networks. A few years earlier in 2007 NVIDIA had released

00:15:20.880 | CUDA which you may recognize the name of. So an API that allowed software access to the lower level

00:15:30.560 | highly parallel processing abilities of CUDA enabled GPUs from NVIDIA. And GPU power in itself

00:15:38.080 | was reaching a point where this you know training these big models was becoming possible. Although

00:15:45.680 | it wasn't quite there yet at the time for a single GPU. So AlexNet was by no means small

00:15:52.480 | and because of that the authors had to solve a lot of problems to get all this working. So AlexNet

00:15:57.760 | consisted of five convolutional layers followed by three fully connected linear layers. The final

00:16:05.920 | layer to produce the 1000 classifications required by ImageNet was a 1000 node layer that used a

00:16:15.840 | softmax activation function to create this probability distribution over all of those

00:16:21.040 | classes. Now a key conclusion from AlexNet was that the depth of the network was key to getting

00:16:29.280 | the performance that they got. And that depth as I mentioned before it creates a lot of parameters

00:16:36.560 | that need to be trained making training the model either impractically slow or just simply impossible.

00:16:43.040 | Or at least that was the case if you're going to train it on CPU. So they had to turn to GPUs but

00:16:49.840 | at the time the high-end GPUs only had a memory of about three gigabytes which was not enough for

00:16:59.600 | AlexNet. So to make it work they had to distribute AlexNet across two GPUs and they did this by

00:17:06.640 | pretty much splitting the layers in two and having you know half of the network on one GPU half the

00:17:15.040 | network on the other GPU and having a couple of connections between the layers. So they had a

00:17:22.080 | couple of points where the information could be passed between those two halves and then at the

00:17:27.920 | end they came together into the final classification layer. Another important factor is that they

00:17:34.240 | swapped the more typical softmax and tanh activation functions of the time for a rectified

00:17:41.600 | linear unit or radio activation function which again further improved the efficiency of the model

00:17:49.200 | and also meant that they didn't require normalization that you would usually have to

00:17:55.440 | do if you had tanh or softmax. Because both of those activation functions over many layers you

00:18:02.560 | can get what's called a saturation in your activations which means the activations in your

00:18:08.480 | neurons either kind of push towards the two limits of one of those activation functions. So for

00:18:14.160 | example with softmax you'd end up all your activations be pushed towards one or zero.

00:18:19.440 | Nonetheless they did use another type of normalization called local response normalization

00:18:24.880 | but that's not really used anymore. Nonetheless for AlexNet that was still a critical component.

00:18:33.600 | Now another super important thing that is still used today that AlexNet introduced was the use of

00:18:40.480 | overlapping in the pooling layers. Now pooling is already used in convolutional networks and

00:18:48.560 | it essentially just summarizes a window of information from one layer into a single

00:18:55.120 | activation value in the next layer. Now overlapping pooling does the same thing but there's a there's

00:18:59.920 | an overlap in the window that gets passed along in the preceding layer. So there's always it

00:19:06.080 | always sees a little bit of the previous window. Okay and they found that this reduces overfitting

00:19:12.400 | of the model and improves the performance. So that is how they got AlexNet to work and

00:19:20.720 | a few of the details behind actually you know how it worked and why it worked so well at the time.

00:19:27.040 | Now I think it's great to talk about all this but as I said at the start I think it's better

00:19:32.240 | to go through everything or back everything up with a little bit of code. So we'll go through

00:19:37.520 | a notebook that you can find a link to the collab version of this notebook in the description below

00:19:45.120 | or if you're reading this on the Pinecone article page it will be in the resources section at the

00:19:50.960 | bottom and yeah we'll start going through that. Okay so we're going to start by downloading and

00:19:58.160 | pre-processing our ImageNet dataset. So we're not using the actual ImageNet itself we're using

00:20:05.600 | another hosted version of ImageNet which is much smaller that we can find on Hugging Face.

00:20:12.160 | So to use this we will need to pip install a few things. So pip install datasets which is where

00:20:19.600 | we're going to how we're going to use the ImageNet dataset and later on we're also going to be using

00:20:25.680 | Torch and TorchVision so install those as well. So this is the dataset we're going to use so we're

00:20:32.320 | using this Macie Tiny ImageNet dataset. Now this is a validation split and that contains I think

00:20:40.720 | it's 10,000 we can see here yeah 10,000 labeled images. Okay and then we can see a single record

00:20:48.320 | in there so we have every image is sold as a pill image object and they have these labels so this

00:20:56.800 | one has labeled zero we don't necessarily know what that means right now but later on we'll see

00:21:02.480 | how we can actually figure that out. So this one is referring to the actual training

00:21:11.920 | dataset so the training split of this dataset does contain 100,000 of those labeled

00:21:17.520 | images. Now we can check the type it's the the object and we can see okay so when we're in a

00:21:26.720 | notebook like this we can just call this ImageNet and this is just how we we show that in the

00:21:33.760 | notebook so you can see it's a goldfish so we can probably guess that label zero actually means

00:21:38.560 | goldfish. So there are a few pre-processing steps that we need to go through so we need to convert

00:21:46.880 | all images into an RGB format so it will have three color channels. We need to resize all these

00:21:53.920 | images to fit the expected dimensions of AlexNet. We need to convert into a tensor for PyTorch.

00:22:02.080 | We need to normalize those values and stack so when we have multiple images we're going to stack

00:22:08.800 | them all into a single tensor okay it's to create our batch. So we start with RGB AlexNet as I

00:22:16.800 | mentioned assumes all images have three color channels red green and blue but there are many

00:22:22.320 | other formats that are supported by pill objects so we'll see here that we have grayscale okay so

00:22:28.960 | this is 201 this is a grayscale image because we have this L they're formats as well so need to be

00:22:35.280 | aware of those and we can see it's I think it's an alligator yeah an alligator in grayscale okay

00:22:43.680 | so we convert into red green and blue and we'll see okay it's still grayscale that's that's fine

00:22:50.560 | it will still be shown as grayscale but in reality this only has one color channel which is actually

00:22:57.840 | just like a brightness channel whereas this now has three color channels red green and blue but

00:23:03.120 | they're all of equal values across across those three channels so actually still shows as being

00:23:08.480 | grayscale even though it is in a RGB format. This is how we handle the RGB part but we also need to

00:23:17.120 | resize the image to fit the expected dimensionality for AlexNet. So for AlexNet and for a lot of other

00:23:27.840 | computer vision models the height and width of the the input images is expected to be at least

00:23:35.840 | 224 pixels so we need to do that we can by using this so we're going to first we resize the images

00:23:46.080 | because these are very small images and they're not necessarily all going to be the square that

00:23:50.400 | we need the 224 by 224 so we resize them to be bigger and then we use this center crop to crop

00:23:59.040 | out any edges and make sure that is now a square image of that dimensionality and yeah we're doing

00:24:07.200 | that using this transforms function from torch vision which is a very good way of pre-processing

00:24:15.920 | your your image data has a lot of functions and we'll see we'll use a couple more of those very

00:24:20.960 | soon. So if we have a look at our first image the goldfish image you see it's now a bit bigger and

00:24:25.280 | we can also see it's kind of cropped some of it as well but we still get the you know we still get

00:24:30.240 | the idea of what is in that image so if we compare that to this here we can kind of see its eye at

00:24:36.240 | the front there and more of its head whereas here it's kind of almost chopped off. Now another thing

00:24:44.000 | we need to do is normalize all the values in these in these images so RGB arrays tend to be in the

00:24:51.840 | range of 0 up to 255 and we need them to be in the range of 0 to 1 and we need to normalize them

00:25:00.240 | using these values that you see here so this mean of 0.4 and so on and the standard deviation of 0.2

00:25:07.360 | and so on this is specific to the AlexNet implementation from PyTorch so we go on the

00:25:15.200 | AlexNet PyTorch page you can go down and it here we go so the images have to be loaded into a range

00:25:22.400 | of 0 to 1 and normalize using the values I just showed you so that's why we're using those and

00:25:28.480 | yeah so we create this process function and then we process our image through it and then we can

00:25:35.520 | check the size so the the final result here is going to be that normalized tensor that we want

00:25:41.760 | and it's in the correct dimension it has the correct dimensionality that we need as well

00:25:46.160 | so yeah that that's perfect now we want to put all this together and we don't want to do it for

00:25:52.080 | every single image like this we're just going to put it all together for a mini batch of images

00:25:57.360 | so we're going to go with the first 50 images because because they're all goldfish and we can

00:26:02.960 | easily check the AlexNet's performance on that single single object so I'm going to redefine

00:26:11.600 | that pre-processing pipeline using everything we've just done so we'll resize we crop it

00:26:16.720 | to tensor we have to do this by the way because PyTorch is expecting a tensor object

00:26:22.960 | and before we normalize it we it needs to be in that tensor format otherwise we're going to get

00:26:28.400 | an error and then yeah we normalize it so we go through every image in the first 50 images and we

00:26:36.720 | first convert any that are grayscale to RGB not RBG RGB and we pre-process them okay and just append

00:26:48.640 | them to a list now that list we want to sack all those together into a single tensor so we do that

00:26:54.880 | here and we get this final mini batch of our images so we have a mini batch of 50 and we have

00:27:02.640 | those those images that you can see with the dimensionality here so with all that done we're

00:27:08.800 | now ready to move on to the inference step so the the prediction of the class label is for our images

00:27:14.960 | with AlexNet so the first thing we're going to want to do is download the model which is going

00:27:22.000 | to be hosted by PyTorch so we can do that here so let me so you can see a bit better

00:27:32.160 | we import Torch the PyTorch and we just do Torch upload okay PyTorch vision let's just see the

00:27:41.040 | version that we're using and then we have AlexNet and we're not going to train AlexNet it would

00:27:46.800 | take a bit of time so we're going to use the pre-trained model weights so this version of

00:27:52.800 | AlexNet has already been trained and then we just say it's evaluation mode for for our inference so

00:27:59.760 | for the predictions we don't want to train it by default I think it is in train mode which

00:28:05.280 | looks like this we want it in evaluation mode

00:28:12.880 | and then we can see the model structure here as well so you can see AlexNet we have so this is

00:28:20.240 | where we're creating the the image features so there's many of these convolutional layers

00:28:27.520 | followed by the radioactivation function followed by the max pooling layer and with each of those

00:28:34.480 | the model creates a more abstract tensor that represents the sort of information from that image

00:28:44.320 | so you can sort of imagine here the the the abstraction so the feature that's been extracted

00:28:49.680 | is like okay there's some straight edges here and some some curved edges here we go a little further

00:28:55.760 | and this is like okay this is a this is an animal or this is a fish and then by the time we get to

00:29:00.320 | here it's like okay this is a goldfish hopefully and then we move on to the classifier part so the

00:29:09.600 | classifier is these three layers so we have dropout this this dropout was added to

00:29:15.760 | reduce the chance of overfitting and improve the ability of the model to generalize and yeah we

00:29:24.320 | have these linear linear linear okay so these are the linear layers the fully connected linear layers

00:29:31.280 | that produce the final 1000 activations and the highest of these activations represents the class

00:29:39.840 | that the model is predicting as being the the class that identifies the image I saw

00:29:48.000 | so that's the model we initialize it if we can it's better that we move the model

00:29:55.760 | over to either a cuda gpu if available or more recently we have the apple silicon chips

00:30:03.760 | so if you are on a mac with apple silicon you want to use mps okay so that's the case for me I have a

00:30:13.440 | I'm going to run all this on mps so we move the inputs to the device and we move the model to the

00:30:21.840 | to the device now when we move the model to the device it does this in place so we don't need to

00:30:26.320 | like we did here where it's inputs equals we just write this and then we run the model so we set

00:30:34.320 | torch no grads so we don't need to calculate the gradients because we do that for training the

00:30:38.480 | model we're just performing inference so we get our outputs we detach them from the model and then

00:30:45.840 | we can sort of see the shape so we have these 50 vectors of 1000 items so that's 50 activations

00:30:55.920 | across all of our 1000 classes and we can we can see those here okay now these are not normalized

00:31:04.800 | so if we if you want to calculate the probability from this we use the softmax function so we would

00:31:12.240 | do that like this okay that that would map everything to a probability distribution and

00:31:19.360 | you'll be able to get the probability of like say the top five classes for example but we don't we

00:31:26.800 | don't necessarily need to do that for what we're doing here so we could actually skip this so up

00:31:32.160 | here we are getting the output so we could skip this the probability part and just replace that

00:31:38.080 | with output and we will get the same result for what we're doing here which is taking the

00:31:42.560 | value or the index position of the maximum value out of those 1000 classes so here we're getting

00:31:51.760 | one okay now if you remember earlier on the labels that we had in this data set was zero

00:31:57.680 | for the for the goldfish and the reason these are different is because the data set actually

00:32:04.640 | uses a different set of labels so it's it's not actually the same but if we if we do this so

00:32:13.040 | over here let me open this and show you so over here we have a text file where each class is

00:32:25.280 | separated by a newline character so this is number zero a tench and number one is a goldfish right

00:32:32.960 | so if we we get that information so number one we can see there's a lot of goldfish predictions here

00:32:38.560 | which is a good sign we can import those those classes and we can create prediction labels

00:32:48.400 | by just splitting the response we get from this by newline characters and then what we do is if

00:32:55.680 | you print out prediction labels one we get goldfish okay so the the text label for that prediction

00:33:03.920 | and yeah so we have the first 50 images all of those are goldfish and we can we can see here

00:33:13.360 | so i'm just printing out the last three you see they're all goldfish so we would expect everything

00:33:19.360 | all these predictions to be goldfish if the model is performing well okay and yeah we see for the

00:33:25.760 | for the most part that is the case now if we calculate the performance or the accuracy here

00:33:34.400 | we get 72 so that represents a top one error rate of 28 which beats the reported error rate of the

00:33:43.840 | AlexNet model in 2012 on the ImageNet challenge which was 37.5 for the for the top top one

00:33:51.920 | however this is i will point out that this is just a single class this is a single label

00:34:00.000 | goldfish right and the model will perform better on goldfish than other things okay so when we test

00:34:07.760 | this across the if we test this across the whole data set one we have to map all of the different

00:34:12.960 | labels between the uh between the AlexNet model and the data set that we have here because the

00:34:17.920 | labels have kind of messed up uh so takes a bit of extra work but if we do that the performance

00:34:22.800 | will not be as good nonetheless i think that is a pretty good result so that's it for this video

00:34:29.680 | that's our overview of one of the most significant events in computer vision and deep learning

00:34:36.720 | the ImageNet challenge was hosted annually until 2017 by then 29 38 contestants had an error rate

00:34:46.160 | of less than five percent so the you know over those years the models the progress in computer

00:34:53.120 | vision just kind of went crazy AlexNet ended up being superseded by even more powerful convolution

00:34:59.360 | neural networks Microsoft Research Asia was the first other team to beat AlexNet and they did that

00:35:07.360 | in 2015 and since then there have been many other sort of state-of-the-art convolution networks that

00:35:14.160 | have come and gone and even more recently there are the possibility of other networks coming in

00:35:20.560 | such as transformer models and disrupting the dominance of convolution neural networks in

00:35:26.880 | computer vision now i'll leave you with the final paragraph of the of the AlexNet paper because it

00:35:32.160 | almost seems like they saw the future of deep learning they noted that they did not use any

00:35:40.240 | unsupervised pre-training even though they expect it will help and our results have improved as we

00:35:47.520 | make our network larger but we still have many orders of magnitude to go in order to map the

00:35:53.840 | infrotemporal pathway of the human visual system so to match human level performance now we know

00:36:01.040 | that unsupervised pre-training and ever greater models ever deeper models were really sort of the

00:36:07.760 | key to all the improvement gains that we've got in deep learning in the past decade so i hope that

00:36:14.400 | has been useful thank you very much for watching and i will see you again in the next one bye

AlexNet and ImageNet Explained

Chapters