back to index

AlexNet and ImageNet Explained


Chapters

0:0 Intro
1:6 Birth of Deep Learning
2:52 ImageNet
7:56 Lack of Readiness for Big Datasets
9:57 ImageNet Challenge (ILSVRC)
11:47 AlexNet
19:30 PYTORCH IMPLEMENTATION
19:55 Data Preprocessing
27:6 Class Prediction with AlexNet
31:50 Goldfish Results
34:27 Closing Notes

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to talk about one of the most important events in the history of deep learning
00:00:05.600 | we're going to talk about what happened at ImageNet 2012 and how that launched the sort of
00:00:14.560 | deep learning rocket ship that we've been strapped to for the past decade. So in short we're going to
00:00:21.040 | talk about ImageNet and where it came from why it was so important and then we're going to have a
00:00:25.360 | look at very briefly going to have a look at convolutional neural networks and AlexNet which
00:00:31.120 | is the model that triggered the massive growth of deep learning and for me I like to back everything
00:00:39.200 | up with code so what we'll do is towards the end of the video we're going to go through
00:00:43.840 | the PyTorch implementation of AlexNet and we're actually going to test it on a small ImageNet
00:00:52.480 | like data set and that'll be quite useful because we can see sort of image pre-processing steps and
00:00:58.880 | also how to perform inference with a convolutional neural network like AlexNet. So let's jump
00:01:05.840 | straight into it. Today's deep learning revolution traces its roots back to the 30th of September
00:01:12.080 | 2012. On this day a deep layered convolutional neural network won the ImageNet 2012 challenge
00:01:20.960 | and this convolutional neural network didn't just win it completely destroyed the rest of
00:01:27.120 | the competition. Now this model you might have guessed is called AlexNet and the simple fact
00:01:32.560 | that even use convolutional neural networks was very new. Convolutional neural networks had been
00:01:37.920 | around for a while but using them had kind of been deemed impractical yet when AlexNet's results came
00:01:46.640 | in it proved sort of unparalleled performance on what was seen as one of the hardest challenges of
00:01:54.800 | the time for computer vision. So this event made AlexNet the first widely acknowledged successful
00:02:03.360 | implementation of deep learning and the sheer performance improvement that it showed caught
00:02:12.320 | people's attention. Until this point deep learning was unproven it was simply a nice idea that most
00:02:19.680 | people just decided okay it's impractical we don't have enough data we don't have enough compute to
00:02:25.760 | do anything like this but AlexNet showed that this was not the case and that deep learning
00:02:30.720 | was now practical. Yet this sort of surge of interest in deep learning was not solely you
00:02:39.040 | know thanks to AlexNet. ImageNet also played a big part in this. The foundation of applied deep
00:02:45.760 | learning was set by ImageNet and built upon by AlexNet. So let's begin with ImageNet. Back in
00:02:55.840 | 2006 the world of computer vision was a lot different to how we know it now. It was pretty
00:03:03.840 | underfunded it didn't really get that much attention yet there were a lot of researchers
00:03:09.040 | around the world focused on building better models and the year after year they saw progress but
00:03:15.040 | it was slow. In that same year a woman called Fei-Fei Li had just finished her computer vision
00:03:23.760 | PhD at Caltech and had started working as a professor in computer science and had noticed
00:03:32.080 | this sort of focus in the field of computer vision on the models and the subsequent lack of focus on
00:03:41.760 | data and an idea came to Li that maybe a data set that was more representative of the world
00:03:50.240 | could improve the performance of the modelers being trained on it. Around the same time there
00:03:56.160 | was another professor called Christiana Feldbaum and she was a co-developer of a data set from the
00:04:03.040 | 1980s called WordNet. Now WordNet consisted of a pretty large number of English language terms
00:04:11.120 | organized into an ontological structure. So for example for the term Siberian Husky
00:04:16.480 | that would be within a tree structure and above Siberian Husky you would have a working dog,
00:04:23.040 | above working dog you would have dog, above dog you'd have canine, carnivore and so on. So there's
00:04:28.480 | like that tree structure of different terms and how they relate to each other. In 2007 Li and
00:04:37.120 | Feldbaum met and Feldbaum discussed her work on or her idea at the time of adding just a reference
00:04:45.280 | image to each of the terms within WordNet. So the intention was not to create a image data set but
00:04:53.600 | it was simply to add like a reference image so people could more easily understand what that
00:04:57.680 | particular term was about and this inspired an idea from Li that would kick start the world of
00:05:04.080 | computer vision and deep learning. So soon after Li put together a team to build what would become
00:05:11.120 | the largest labeled data set of images in the world called ImageNet. The idea
00:05:18.720 | behind ImageNet was that a large ontological based data set like WordNet but for images
00:05:27.600 | could be the key behind building more advanced content based image retrieval, object recognition,
00:05:36.080 | scene recognition and better visual understanding in computer vision models. And just two years
00:05:43.360 | later the first version of ImageNet was released with 12 million labeled images. These were all
00:05:51.680 | structured and labeled in line with the WordNet ontology. Yet if we consider the sheer size of
00:05:58.160 | that, the 12 million images, if one person had spent literally every single day labeling one
00:06:06.960 | image per minute and did literally nothing else in that time, they didn't eat, they didn't sleep,
00:06:12.800 | just labeled images, it would have taken them 22 years and 10 months. Which obviously is a very
00:06:21.920 | long time. There's just an insane number of images to be labeled here. So how did they do it? Because
00:06:30.160 | the team was not huge, they didn't have an infinite amount of money to pay other researchers and
00:06:35.520 | students to do this. So what they eventually settled on was a platform called Amazon's
00:06:42.000 | Mechanical Turk. Mechanical Turk is a crowdsourcing platform where people from around the globe will
00:06:49.920 | perform tasks such as labeling images for a set amount of money. Because there's just the insane
00:06:59.840 | scale of people around the world doing this at competitive prices, that made ImageNet possible
00:07:08.080 | with a few adjustments to the labeling process. Because in reality if you just have random people
00:07:14.960 | around the world labeling your dataset, some of those people are going to try and take advantage
00:07:19.840 | of that, maybe game the system. So you have to have some checks and balances in place against
00:07:26.640 | that. So there's a little bit of system design in there. But that is how they built the dataset.
00:07:36.480 | Now on its release, ImageNet was the largest publicly available labeled dataset of images
00:07:44.800 | in the world. Yet there was very little interest in the dataset. Which seems pretty crazy when we
00:07:52.320 | look back on that in hindsight. Because now we know, okay we want more data for our models to
00:07:56.960 | make them better. But at the time things were different. So ImageNet came with these 12 million
00:08:06.080 | images distributed across 22,000 categories. And at the time there were the odd other image
00:08:14.880 | datasets that used a similar sort of structure and idea. So for example the ESP dataset used
00:08:21.200 | something called the ESP game. And people would play the ESP game and label images for the dataset.
00:08:28.240 | Now reportedly they had way more images. But it wasn't publicly released. They only publicly
00:08:34.880 | released 60,000 of those images. And a couple of years later there were a few papers that kind of
00:08:44.080 | looked at the ESP game and ESP dataset and said okay it's probably not actually that useful.
00:08:52.240 | Because you can kind of guess the right answer most of the time without even looking at the image.
00:08:57.200 | So there were some questions around the usability of that dataset. So all this to say ImageNet
00:09:04.880 | was by far the biggest, at least publicly available, and most sort of accurate dataset
00:09:15.280 | for computer vision at the time. So the reason that there was very little interest in ImageNet
00:09:21.680 | despite its huge size is that people just assumed that it could not work for their models.
00:09:31.040 | You have to think back then they were training models on much smaller datasets which had like
00:09:39.440 | 12 categories of images for example. And the models would struggle with that. So when ImageNet
00:09:46.640 | comes along and it's like hey I have 22,000 categories here people are just like well I
00:09:53.040 | can't deal with 12 so I'm not going to even try 22,000. That's crazy. So there was a lack of
00:09:59.200 | interest in ImageNet at the time. It just wasn't really received that warmly. So the ImageNet team
00:10:08.800 | decided to try and push it a bit more. So by the next year 2010 they had managed to organize a
00:10:18.800 | challenge with the dataset, a classification challenge initially. And they grew into
00:10:23.680 | different things over the years but initially it's just a classification challenge. So the ImageNet
00:10:29.680 | large-scale visual recognition challenge was first hosted in 2010. And competitors had to
00:10:37.680 | correctly classify images from 1,000 categories. So not a full set of terms in the ontology of
00:10:48.960 | ImageNet but they had 1,000 categories instead. And whoever produced a model with the lowest error
00:10:57.600 | rate won. And there were a few entrants. There was not a huge number of entrants. I think it's
00:11:02.560 | something like 4, 5, 6 entrants in 2010, 2011, 2012. Now eventually this challenge would become
00:11:12.960 | the primary benchmark in computer vision progress. But it took some time. And that really started
00:11:21.440 | in 2012. So 2012 was not like the previous years for ImageNet. On the 30th of September 2012 the
00:11:30.560 | latest challenge results were released. And one of those results was a lot better than any of the
00:11:38.320 | other results. And it came from a model that most people thought was just not practical. And that
00:11:46.080 | was AlexNet. AlexNet was the first model to score a sub 25% error rate. And that same year the
00:11:54.320 | nearest competitor was 9.8 percentage points behind AlexNet. And they had done this with a
00:12:01.920 | deep layered convolutional neural network which at the time people were not really taking seriously.
00:12:07.360 | Now to understand AlexNet it's probably best we very quickly cover a little bit of what a
00:12:12.080 | convolutional neural network actually is. So a convolutional neural network or CNN is a neural
00:12:17.680 | network layer that has a special layer called a convolutional layer. And today these models are
00:12:26.560 | known for computer vision. They have been for quite a long time sort of undisputed champions
00:12:34.560 | of computer vision. And actually you know that has changed a little bit in literally like the
00:12:40.960 | past couple of years. But right now they're still pretty dominant. And unlike a lot of the models
00:12:47.760 | back in 2012 and earlier these did not need too much sort of manual feature extraction or too much
00:12:57.600 | image pre-processing before feeding data into the model. They could just kind of deal with that
00:13:04.080 | themselves. CNNs use several of these convolutional layers stacked on top of each other. And what you
00:13:10.640 | find is that the deeper the network is the more the better it can identify more sort of complex
00:13:18.400 | concepts or objects in images. So for example the first with the first few layers you're probably
00:13:26.560 | going to just kind of identify okay this is an edge, this is a circle maybe, this is this shape
00:13:33.520 | and maybe some textures. As the network gets deeper and you add more layers to it it starts to
00:13:40.640 | abstract those features and identify more abstract ideas. So a deeper network will be able to
00:13:49.520 | identify okay this is like a living thing and then you go you build a deeper network and it can
00:13:56.000 | identify mammals and then it can identify dogs and then it can identify Siberian huskies. So as the
00:14:04.720 | model gets deeper its performance and its ability to identify more nuanced things gets better.
00:14:14.400 | So at the time these models were overlooked because essentially to train these to get good
00:14:23.840 | performance from one of these models they need to be really deep. Which means that they have a lot
00:14:28.640 | of parameters okay and it's the more parameters you have the longer it's going to take your model
00:14:34.240 | to train if you can train it at all if it's if it's too big and doesn't even fit in the memory
00:14:38.800 | on your computer. And also the more parameters it has the more data it has to see before it
00:14:46.080 | can produce sort of any any good performance of anything. As a result of this they were simply
00:14:53.040 | overlooked. Yet the authors of AlexNet won the ImageNet challenge in 2012 and it turns out that
00:15:00.720 | they were the the right people in the right place at the right time. Several pieces came from from
00:15:07.040 | different places to create this. ImageNet provided a massive amount of data needed to train one of
00:15:14.800 | these deep layered convolutional neural networks. A few years earlier in 2007 NVIDIA had released
00:15:20.880 | CUDA which you may recognize the name of. So an API that allowed software access to the lower level
00:15:30.560 | highly parallel processing abilities of CUDA enabled GPUs from NVIDIA. And GPU power in itself
00:15:38.080 | was reaching a point where this you know training these big models was becoming possible. Although
00:15:45.680 | it wasn't quite there yet at the time for a single GPU. So AlexNet was by no means small
00:15:52.480 | and because of that the authors had to solve a lot of problems to get all this working. So AlexNet
00:15:57.760 | consisted of five convolutional layers followed by three fully connected linear layers. The final
00:16:05.920 | layer to produce the 1000 classifications required by ImageNet was a 1000 node layer that used a
00:16:15.840 | softmax activation function to create this probability distribution over all of those
00:16:21.040 | classes. Now a key conclusion from AlexNet was that the depth of the network was key to getting
00:16:29.280 | the performance that they got. And that depth as I mentioned before it creates a lot of parameters
00:16:36.560 | that need to be trained making training the model either impractically slow or just simply impossible.
00:16:43.040 | Or at least that was the case if you're going to train it on CPU. So they had to turn to GPUs but
00:16:49.840 | at the time the high-end GPUs only had a memory of about three gigabytes which was not enough for
00:16:59.600 | AlexNet. So to make it work they had to distribute AlexNet across two GPUs and they did this by
00:17:06.640 | pretty much splitting the layers in two and having you know half of the network on one GPU half the
00:17:15.040 | network on the other GPU and having a couple of connections between the layers. So they had a
00:17:22.080 | couple of points where the information could be passed between those two halves and then at the
00:17:27.920 | end they came together into the final classification layer. Another important factor is that they
00:17:34.240 | swapped the more typical softmax and tanh activation functions of the time for a rectified
00:17:41.600 | linear unit or radio activation function which again further improved the efficiency of the model
00:17:49.200 | and also meant that they didn't require normalization that you would usually have to
00:17:55.440 | do if you had tanh or softmax. Because both of those activation functions over many layers you
00:18:02.560 | can get what's called a saturation in your activations which means the activations in your
00:18:08.480 | neurons either kind of push towards the two limits of one of those activation functions. So for
00:18:14.160 | example with softmax you'd end up all your activations be pushed towards one or zero.
00:18:19.440 | Nonetheless they did use another type of normalization called local response normalization
00:18:24.880 | but that's not really used anymore. Nonetheless for AlexNet that was still a critical component.
00:18:33.600 | Now another super important thing that is still used today that AlexNet introduced was the use of
00:18:40.480 | overlapping in the pooling layers. Now pooling is already used in convolutional networks and
00:18:48.560 | it essentially just summarizes a window of information from one layer into a single
00:18:55.120 | activation value in the next layer. Now overlapping pooling does the same thing but there's a there's
00:18:59.920 | an overlap in the window that gets passed along in the preceding layer. So there's always it
00:19:06.080 | always sees a little bit of the previous window. Okay and they found that this reduces overfitting
00:19:12.400 | of the model and improves the performance. So that is how they got AlexNet to work and
00:19:20.720 | a few of the details behind actually you know how it worked and why it worked so well at the time.
00:19:27.040 | Now I think it's great to talk about all this but as I said at the start I think it's better
00:19:32.240 | to go through everything or back everything up with a little bit of code. So we'll go through
00:19:37.520 | a notebook that you can find a link to the collab version of this notebook in the description below
00:19:45.120 | or if you're reading this on the Pinecone article page it will be in the resources section at the
00:19:50.960 | bottom and yeah we'll start going through that. Okay so we're going to start by downloading and
00:19:58.160 | pre-processing our ImageNet dataset. So we're not using the actual ImageNet itself we're using
00:20:05.600 | another hosted version of ImageNet which is much smaller that we can find on Hugging Face.
00:20:12.160 | So to use this we will need to pip install a few things. So pip install datasets which is where
00:20:19.600 | we're going to how we're going to use the ImageNet dataset and later on we're also going to be using
00:20:25.680 | Torch and TorchVision so install those as well. So this is the dataset we're going to use so we're
00:20:32.320 | using this Macie Tiny ImageNet dataset. Now this is a validation split and that contains I think
00:20:40.720 | it's 10,000 we can see here yeah 10,000 labeled images. Okay and then we can see a single record
00:20:48.320 | in there so we have every image is sold as a pill image object and they have these labels so this
00:20:56.800 | one has labeled zero we don't necessarily know what that means right now but later on we'll see
00:21:02.480 | how we can actually figure that out. So this one is referring to the actual training
00:21:11.920 | dataset so the training split of this dataset does contain 100,000 of those labeled
00:21:17.520 | images. Now we can check the type it's the the object and we can see okay so when we're in a
00:21:26.720 | notebook like this we can just call this ImageNet and this is just how we we show that in the
00:21:33.760 | notebook so you can see it's a goldfish so we can probably guess that label zero actually means
00:21:38.560 | goldfish. So there are a few pre-processing steps that we need to go through so we need to convert
00:21:46.880 | all images into an RGB format so it will have three color channels. We need to resize all these
00:21:53.920 | images to fit the expected dimensions of AlexNet. We need to convert into a tensor for PyTorch.
00:22:02.080 | We need to normalize those values and stack so when we have multiple images we're going to stack
00:22:08.800 | them all into a single tensor okay it's to create our batch. So we start with RGB AlexNet as I
00:22:16.800 | mentioned assumes all images have three color channels red green and blue but there are many
00:22:22.320 | other formats that are supported by pill objects so we'll see here that we have grayscale okay so
00:22:28.960 | this is 201 this is a grayscale image because we have this L they're formats as well so need to be
00:22:35.280 | aware of those and we can see it's I think it's an alligator yeah an alligator in grayscale okay
00:22:43.680 | so we convert into red green and blue and we'll see okay it's still grayscale that's that's fine
00:22:50.560 | it will still be shown as grayscale but in reality this only has one color channel which is actually
00:22:57.840 | just like a brightness channel whereas this now has three color channels red green and blue but
00:23:03.120 | they're all of equal values across across those three channels so actually still shows as being
00:23:08.480 | grayscale even though it is in a RGB format. This is how we handle the RGB part but we also need to
00:23:17.120 | resize the image to fit the expected dimensionality for AlexNet. So for AlexNet and for a lot of other
00:23:27.840 | computer vision models the height and width of the the input images is expected to be at least
00:23:35.840 | 224 pixels so we need to do that we can by using this so we're going to first we resize the images
00:23:46.080 | because these are very small images and they're not necessarily all going to be the square that
00:23:50.400 | we need the 224 by 224 so we resize them to be bigger and then we use this center crop to crop
00:23:59.040 | out any edges and make sure that is now a square image of that dimensionality and yeah we're doing
00:24:07.200 | that using this transforms function from torch vision which is a very good way of pre-processing
00:24:15.920 | your your image data has a lot of functions and we'll see we'll use a couple more of those very
00:24:20.960 | soon. So if we have a look at our first image the goldfish image you see it's now a bit bigger and
00:24:25.280 | we can also see it's kind of cropped some of it as well but we still get the you know we still get
00:24:30.240 | the idea of what is in that image so if we compare that to this here we can kind of see its eye at
00:24:36.240 | the front there and more of its head whereas here it's kind of almost chopped off. Now another thing
00:24:44.000 | we need to do is normalize all the values in these in these images so RGB arrays tend to be in the
00:24:51.840 | range of 0 up to 255 and we need them to be in the range of 0 to 1 and we need to normalize them
00:25:00.240 | using these values that you see here so this mean of 0.4 and so on and the standard deviation of 0.2
00:25:07.360 | and so on this is specific to the AlexNet implementation from PyTorch so we go on the
00:25:15.200 | AlexNet PyTorch page you can go down and it here we go so the images have to be loaded into a range
00:25:22.400 | of 0 to 1 and normalize using the values I just showed you so that's why we're using those and
00:25:28.480 | yeah so we create this process function and then we process our image through it and then we can
00:25:35.520 | check the size so the the final result here is going to be that normalized tensor that we want
00:25:41.760 | and it's in the correct dimension it has the correct dimensionality that we need as well
00:25:46.160 | so yeah that that's perfect now we want to put all this together and we don't want to do it for
00:25:52.080 | every single image like this we're just going to put it all together for a mini batch of images
00:25:57.360 | so we're going to go with the first 50 images because because they're all goldfish and we can
00:26:02.960 | easily check the AlexNet's performance on that single single object so I'm going to redefine
00:26:11.600 | that pre-processing pipeline using everything we've just done so we'll resize we crop it
00:26:16.720 | to tensor we have to do this by the way because PyTorch is expecting a tensor object
00:26:22.960 | and before we normalize it we it needs to be in that tensor format otherwise we're going to get
00:26:28.400 | an error and then yeah we normalize it so we go through every image in the first 50 images and we
00:26:36.720 | first convert any that are grayscale to RGB not RBG RGB and we pre-process them okay and just append
00:26:48.640 | them to a list now that list we want to sack all those together into a single tensor so we do that
00:26:54.880 | here and we get this final mini batch of our images so we have a mini batch of 50 and we have
00:27:02.640 | those those images that you can see with the dimensionality here so with all that done we're
00:27:08.800 | now ready to move on to the inference step so the the prediction of the class label is for our images
00:27:14.960 | with AlexNet so the first thing we're going to want to do is download the model which is going
00:27:22.000 | to be hosted by PyTorch so we can do that here so let me so you can see a bit better
00:27:32.160 | we import Torch the PyTorch and we just do Torch upload okay PyTorch vision let's just see the
00:27:41.040 | version that we're using and then we have AlexNet and we're not going to train AlexNet it would
00:27:46.800 | take a bit of time so we're going to use the pre-trained model weights so this version of
00:27:52.800 | AlexNet has already been trained and then we just say it's evaluation mode for for our inference so
00:27:59.760 | for the predictions we don't want to train it by default I think it is in train mode which
00:28:05.280 | looks like this we want it in evaluation mode
00:28:12.880 | and then we can see the model structure here as well so you can see AlexNet we have so this is
00:28:20.240 | where we're creating the the image features so there's many of these convolutional layers
00:28:27.520 | followed by the radioactivation function followed by the max pooling layer and with each of those
00:28:34.480 | the model creates a more abstract tensor that represents the sort of information from that image
00:28:44.320 | so you can sort of imagine here the the the abstraction so the feature that's been extracted
00:28:49.680 | is like okay there's some straight edges here and some some curved edges here we go a little further
00:28:55.760 | and this is like okay this is a this is an animal or this is a fish and then by the time we get to
00:29:00.320 | here it's like okay this is a goldfish hopefully and then we move on to the classifier part so the
00:29:09.600 | classifier is these three layers so we have dropout this this dropout was added to
00:29:15.760 | reduce the chance of overfitting and improve the ability of the model to generalize and yeah we
00:29:24.320 | have these linear linear linear okay so these are the linear layers the fully connected linear layers
00:29:31.280 | that produce the final 1000 activations and the highest of these activations represents the class
00:29:39.840 | that the model is predicting as being the the class that identifies the image I saw
00:29:48.000 | so that's the model we initialize it if we can it's better that we move the model
00:29:55.760 | over to either a cuda gpu if available or more recently we have the apple silicon chips
00:30:03.760 | so if you are on a mac with apple silicon you want to use mps okay so that's the case for me I have a
00:30:13.440 | I'm going to run all this on mps so we move the inputs to the device and we move the model to the
00:30:21.840 | to the device now when we move the model to the device it does this in place so we don't need to
00:30:26.320 | like we did here where it's inputs equals we just write this and then we run the model so we set
00:30:34.320 | torch no grads so we don't need to calculate the gradients because we do that for training the
00:30:38.480 | model we're just performing inference so we get our outputs we detach them from the model and then
00:30:45.840 | we can sort of see the shape so we have these 50 vectors of 1000 items so that's 50 activations
00:30:55.920 | across all of our 1000 classes and we can we can see those here okay now these are not normalized
00:31:04.800 | so if we if you want to calculate the probability from this we use the softmax function so we would
00:31:12.240 | do that like this okay that that would map everything to a probability distribution and
00:31:19.360 | you'll be able to get the probability of like say the top five classes for example but we don't we
00:31:26.800 | don't necessarily need to do that for what we're doing here so we could actually skip this so up
00:31:32.160 | here we are getting the output so we could skip this the probability part and just replace that
00:31:38.080 | with output and we will get the same result for what we're doing here which is taking the
00:31:42.560 | value or the index position of the maximum value out of those 1000 classes so here we're getting
00:31:51.760 | one okay now if you remember earlier on the labels that we had in this data set was zero
00:31:57.680 | for the for the goldfish and the reason these are different is because the data set actually
00:32:04.640 | uses a different set of labels so it's it's not actually the same but if we if we do this so
00:32:13.040 | over here let me open this and show you so over here we have a text file where each class is
00:32:25.280 | separated by a newline character so this is number zero a tench and number one is a goldfish right
00:32:32.960 | so if we we get that information so number one we can see there's a lot of goldfish predictions here
00:32:38.560 | which is a good sign we can import those those classes and we can create prediction labels
00:32:48.400 | by just splitting the response we get from this by newline characters and then what we do is if
00:32:55.680 | you print out prediction labels one we get goldfish okay so the the text label for that prediction
00:33:03.920 | and yeah so we have the first 50 images all of those are goldfish and we can we can see here
00:33:13.360 | so i'm just printing out the last three you see they're all goldfish so we would expect everything
00:33:19.360 | all these predictions to be goldfish if the model is performing well okay and yeah we see for the
00:33:25.760 | for the most part that is the case now if we calculate the performance or the accuracy here
00:33:34.400 | we get 72 so that represents a top one error rate of 28 which beats the reported error rate of the
00:33:43.840 | AlexNet model in 2012 on the ImageNet challenge which was 37.5 for the for the top top one
00:33:51.920 | however this is i will point out that this is just a single class this is a single label
00:34:00.000 | goldfish right and the model will perform better on goldfish than other things okay so when we test
00:34:07.760 | this across the if we test this across the whole data set one we have to map all of the different
00:34:12.960 | labels between the uh between the AlexNet model and the data set that we have here because the
00:34:17.920 | labels have kind of messed up uh so takes a bit of extra work but if we do that the performance
00:34:22.800 | will not be as good nonetheless i think that is a pretty good result so that's it for this video
00:34:29.680 | that's our overview of one of the most significant events in computer vision and deep learning
00:34:36.720 | the ImageNet challenge was hosted annually until 2017 by then 29 38 contestants had an error rate
00:34:46.160 | of less than five percent so the you know over those years the models the progress in computer
00:34:53.120 | vision just kind of went crazy AlexNet ended up being superseded by even more powerful convolution
00:34:59.360 | neural networks Microsoft Research Asia was the first other team to beat AlexNet and they did that
00:35:07.360 | in 2015 and since then there have been many other sort of state-of-the-art convolution networks that
00:35:14.160 | have come and gone and even more recently there are the possibility of other networks coming in
00:35:20.560 | such as transformer models and disrupting the dominance of convolution neural networks in
00:35:26.880 | computer vision now i'll leave you with the final paragraph of the of the AlexNet paper because it
00:35:32.160 | almost seems like they saw the future of deep learning they noted that they did not use any
00:35:40.240 | unsupervised pre-training even though they expect it will help and our results have improved as we
00:35:47.520 | make our network larger but we still have many orders of magnitude to go in order to map the
00:35:53.840 | infrotemporal pathway of the human visual system so to match human level performance now we know
00:36:01.040 | that unsupervised pre-training and ever greater models ever deeper models were really sort of the
00:36:07.760 | key to all the improvement gains that we've got in deep learning in the past decade so i hope that
00:36:14.400 | has been useful thank you very much for watching and i will see you again in the next one bye