back to indexAlexNet and ImageNet Explained
Chapters
0:0 Intro
1:6 Birth of Deep Learning
2:52 ImageNet
7:56 Lack of Readiness for Big Datasets
9:57 ImageNet Challenge (ILSVRC)
11:47 AlexNet
19:30 PYTORCH IMPLEMENTATION
19:55 Data Preprocessing
27:6 Class Prediction with AlexNet
31:50 Goldfish Results
34:27 Closing Notes
00:00:00.000 |
Today we're going to talk about one of the most important events in the history of deep learning 00:00:05.600 |
we're going to talk about what happened at ImageNet 2012 and how that launched the sort of 00:00:14.560 |
deep learning rocket ship that we've been strapped to for the past decade. So in short we're going to 00:00:21.040 |
talk about ImageNet and where it came from why it was so important and then we're going to have a 00:00:25.360 |
look at very briefly going to have a look at convolutional neural networks and AlexNet which 00:00:31.120 |
is the model that triggered the massive growth of deep learning and for me I like to back everything 00:00:39.200 |
up with code so what we'll do is towards the end of the video we're going to go through 00:00:43.840 |
the PyTorch implementation of AlexNet and we're actually going to test it on a small ImageNet 00:00:52.480 |
like data set and that'll be quite useful because we can see sort of image pre-processing steps and 00:00:58.880 |
also how to perform inference with a convolutional neural network like AlexNet. So let's jump 00:01:05.840 |
straight into it. Today's deep learning revolution traces its roots back to the 30th of September 00:01:12.080 |
2012. On this day a deep layered convolutional neural network won the ImageNet 2012 challenge 00:01:20.960 |
and this convolutional neural network didn't just win it completely destroyed the rest of 00:01:27.120 |
the competition. Now this model you might have guessed is called AlexNet and the simple fact 00:01:32.560 |
that even use convolutional neural networks was very new. Convolutional neural networks had been 00:01:37.920 |
around for a while but using them had kind of been deemed impractical yet when AlexNet's results came 00:01:46.640 |
in it proved sort of unparalleled performance on what was seen as one of the hardest challenges of 00:01:54.800 |
the time for computer vision. So this event made AlexNet the first widely acknowledged successful 00:02:03.360 |
implementation of deep learning and the sheer performance improvement that it showed caught 00:02:12.320 |
people's attention. Until this point deep learning was unproven it was simply a nice idea that most 00:02:19.680 |
people just decided okay it's impractical we don't have enough data we don't have enough compute to 00:02:25.760 |
do anything like this but AlexNet showed that this was not the case and that deep learning 00:02:30.720 |
was now practical. Yet this sort of surge of interest in deep learning was not solely you 00:02:39.040 |
know thanks to AlexNet. ImageNet also played a big part in this. The foundation of applied deep 00:02:45.760 |
learning was set by ImageNet and built upon by AlexNet. So let's begin with ImageNet. Back in 00:02:55.840 |
2006 the world of computer vision was a lot different to how we know it now. It was pretty 00:03:03.840 |
underfunded it didn't really get that much attention yet there were a lot of researchers 00:03:09.040 |
around the world focused on building better models and the year after year they saw progress but 00:03:15.040 |
it was slow. In that same year a woman called Fei-Fei Li had just finished her computer vision 00:03:23.760 |
PhD at Caltech and had started working as a professor in computer science and had noticed 00:03:32.080 |
this sort of focus in the field of computer vision on the models and the subsequent lack of focus on 00:03:41.760 |
data and an idea came to Li that maybe a data set that was more representative of the world 00:03:50.240 |
could improve the performance of the modelers being trained on it. Around the same time there 00:03:56.160 |
was another professor called Christiana Feldbaum and she was a co-developer of a data set from the 00:04:03.040 |
1980s called WordNet. Now WordNet consisted of a pretty large number of English language terms 00:04:11.120 |
organized into an ontological structure. So for example for the term Siberian Husky 00:04:16.480 |
that would be within a tree structure and above Siberian Husky you would have a working dog, 00:04:23.040 |
above working dog you would have dog, above dog you'd have canine, carnivore and so on. So there's 00:04:28.480 |
like that tree structure of different terms and how they relate to each other. In 2007 Li and 00:04:37.120 |
Feldbaum met and Feldbaum discussed her work on or her idea at the time of adding just a reference 00:04:45.280 |
image to each of the terms within WordNet. So the intention was not to create a image data set but 00:04:53.600 |
it was simply to add like a reference image so people could more easily understand what that 00:04:57.680 |
particular term was about and this inspired an idea from Li that would kick start the world of 00:05:04.080 |
computer vision and deep learning. So soon after Li put together a team to build what would become 00:05:11.120 |
the largest labeled data set of images in the world called ImageNet. The idea 00:05:18.720 |
behind ImageNet was that a large ontological based data set like WordNet but for images 00:05:27.600 |
could be the key behind building more advanced content based image retrieval, object recognition, 00:05:36.080 |
scene recognition and better visual understanding in computer vision models. And just two years 00:05:43.360 |
later the first version of ImageNet was released with 12 million labeled images. These were all 00:05:51.680 |
structured and labeled in line with the WordNet ontology. Yet if we consider the sheer size of 00:05:58.160 |
that, the 12 million images, if one person had spent literally every single day labeling one 00:06:06.960 |
image per minute and did literally nothing else in that time, they didn't eat, they didn't sleep, 00:06:12.800 |
just labeled images, it would have taken them 22 years and 10 months. Which obviously is a very 00:06:21.920 |
long time. There's just an insane number of images to be labeled here. So how did they do it? Because 00:06:30.160 |
the team was not huge, they didn't have an infinite amount of money to pay other researchers and 00:06:35.520 |
students to do this. So what they eventually settled on was a platform called Amazon's 00:06:42.000 |
Mechanical Turk. Mechanical Turk is a crowdsourcing platform where people from around the globe will 00:06:49.920 |
perform tasks such as labeling images for a set amount of money. Because there's just the insane 00:06:59.840 |
scale of people around the world doing this at competitive prices, that made ImageNet possible 00:07:08.080 |
with a few adjustments to the labeling process. Because in reality if you just have random people 00:07:14.960 |
around the world labeling your dataset, some of those people are going to try and take advantage 00:07:19.840 |
of that, maybe game the system. So you have to have some checks and balances in place against 00:07:26.640 |
that. So there's a little bit of system design in there. But that is how they built the dataset. 00:07:36.480 |
Now on its release, ImageNet was the largest publicly available labeled dataset of images 00:07:44.800 |
in the world. Yet there was very little interest in the dataset. Which seems pretty crazy when we 00:07:52.320 |
look back on that in hindsight. Because now we know, okay we want more data for our models to 00:07:56.960 |
make them better. But at the time things were different. So ImageNet came with these 12 million 00:08:06.080 |
images distributed across 22,000 categories. And at the time there were the odd other image 00:08:14.880 |
datasets that used a similar sort of structure and idea. So for example the ESP dataset used 00:08:21.200 |
something called the ESP game. And people would play the ESP game and label images for the dataset. 00:08:28.240 |
Now reportedly they had way more images. But it wasn't publicly released. They only publicly 00:08:34.880 |
released 60,000 of those images. And a couple of years later there were a few papers that kind of 00:08:44.080 |
looked at the ESP game and ESP dataset and said okay it's probably not actually that useful. 00:08:52.240 |
Because you can kind of guess the right answer most of the time without even looking at the image. 00:08:57.200 |
So there were some questions around the usability of that dataset. So all this to say ImageNet 00:09:04.880 |
was by far the biggest, at least publicly available, and most sort of accurate dataset 00:09:15.280 |
for computer vision at the time. So the reason that there was very little interest in ImageNet 00:09:21.680 |
despite its huge size is that people just assumed that it could not work for their models. 00:09:31.040 |
You have to think back then they were training models on much smaller datasets which had like 00:09:39.440 |
12 categories of images for example. And the models would struggle with that. So when ImageNet 00:09:46.640 |
comes along and it's like hey I have 22,000 categories here people are just like well I 00:09:53.040 |
can't deal with 12 so I'm not going to even try 22,000. That's crazy. So there was a lack of 00:09:59.200 |
interest in ImageNet at the time. It just wasn't really received that warmly. So the ImageNet team 00:10:08.800 |
decided to try and push it a bit more. So by the next year 2010 they had managed to organize a 00:10:18.800 |
challenge with the dataset, a classification challenge initially. And they grew into 00:10:23.680 |
different things over the years but initially it's just a classification challenge. So the ImageNet 00:10:29.680 |
large-scale visual recognition challenge was first hosted in 2010. And competitors had to 00:10:37.680 |
correctly classify images from 1,000 categories. So not a full set of terms in the ontology of 00:10:48.960 |
ImageNet but they had 1,000 categories instead. And whoever produced a model with the lowest error 00:10:57.600 |
rate won. And there were a few entrants. There was not a huge number of entrants. I think it's 00:11:02.560 |
something like 4, 5, 6 entrants in 2010, 2011, 2012. Now eventually this challenge would become 00:11:12.960 |
the primary benchmark in computer vision progress. But it took some time. And that really started 00:11:21.440 |
in 2012. So 2012 was not like the previous years for ImageNet. On the 30th of September 2012 the 00:11:30.560 |
latest challenge results were released. And one of those results was a lot better than any of the 00:11:38.320 |
other results. And it came from a model that most people thought was just not practical. And that 00:11:46.080 |
was AlexNet. AlexNet was the first model to score a sub 25% error rate. And that same year the 00:11:54.320 |
nearest competitor was 9.8 percentage points behind AlexNet. And they had done this with a 00:12:01.920 |
deep layered convolutional neural network which at the time people were not really taking seriously. 00:12:07.360 |
Now to understand AlexNet it's probably best we very quickly cover a little bit of what a 00:12:12.080 |
convolutional neural network actually is. So a convolutional neural network or CNN is a neural 00:12:17.680 |
network layer that has a special layer called a convolutional layer. And today these models are 00:12:26.560 |
known for computer vision. They have been for quite a long time sort of undisputed champions 00:12:34.560 |
of computer vision. And actually you know that has changed a little bit in literally like the 00:12:40.960 |
past couple of years. But right now they're still pretty dominant. And unlike a lot of the models 00:12:47.760 |
back in 2012 and earlier these did not need too much sort of manual feature extraction or too much 00:12:57.600 |
image pre-processing before feeding data into the model. They could just kind of deal with that 00:13:04.080 |
themselves. CNNs use several of these convolutional layers stacked on top of each other. And what you 00:13:10.640 |
find is that the deeper the network is the more the better it can identify more sort of complex 00:13:18.400 |
concepts or objects in images. So for example the first with the first few layers you're probably 00:13:26.560 |
going to just kind of identify okay this is an edge, this is a circle maybe, this is this shape 00:13:33.520 |
and maybe some textures. As the network gets deeper and you add more layers to it it starts to 00:13:40.640 |
abstract those features and identify more abstract ideas. So a deeper network will be able to 00:13:49.520 |
identify okay this is like a living thing and then you go you build a deeper network and it can 00:13:56.000 |
identify mammals and then it can identify dogs and then it can identify Siberian huskies. So as the 00:14:04.720 |
model gets deeper its performance and its ability to identify more nuanced things gets better. 00:14:14.400 |
So at the time these models were overlooked because essentially to train these to get good 00:14:23.840 |
performance from one of these models they need to be really deep. Which means that they have a lot 00:14:28.640 |
of parameters okay and it's the more parameters you have the longer it's going to take your model 00:14:34.240 |
to train if you can train it at all if it's if it's too big and doesn't even fit in the memory 00:14:38.800 |
on your computer. And also the more parameters it has the more data it has to see before it 00:14:46.080 |
can produce sort of any any good performance of anything. As a result of this they were simply 00:14:53.040 |
overlooked. Yet the authors of AlexNet won the ImageNet challenge in 2012 and it turns out that 00:15:00.720 |
they were the the right people in the right place at the right time. Several pieces came from from 00:15:07.040 |
different places to create this. ImageNet provided a massive amount of data needed to train one of 00:15:14.800 |
these deep layered convolutional neural networks. A few years earlier in 2007 NVIDIA had released 00:15:20.880 |
CUDA which you may recognize the name of. So an API that allowed software access to the lower level 00:15:30.560 |
highly parallel processing abilities of CUDA enabled GPUs from NVIDIA. And GPU power in itself 00:15:38.080 |
was reaching a point where this you know training these big models was becoming possible. Although 00:15:45.680 |
it wasn't quite there yet at the time for a single GPU. So AlexNet was by no means small 00:15:52.480 |
and because of that the authors had to solve a lot of problems to get all this working. So AlexNet 00:15:57.760 |
consisted of five convolutional layers followed by three fully connected linear layers. The final 00:16:05.920 |
layer to produce the 1000 classifications required by ImageNet was a 1000 node layer that used a 00:16:15.840 |
softmax activation function to create this probability distribution over all of those 00:16:21.040 |
classes. Now a key conclusion from AlexNet was that the depth of the network was key to getting 00:16:29.280 |
the performance that they got. And that depth as I mentioned before it creates a lot of parameters 00:16:36.560 |
that need to be trained making training the model either impractically slow or just simply impossible. 00:16:43.040 |
Or at least that was the case if you're going to train it on CPU. So they had to turn to GPUs but 00:16:49.840 |
at the time the high-end GPUs only had a memory of about three gigabytes which was not enough for 00:16:59.600 |
AlexNet. So to make it work they had to distribute AlexNet across two GPUs and they did this by 00:17:06.640 |
pretty much splitting the layers in two and having you know half of the network on one GPU half the 00:17:15.040 |
network on the other GPU and having a couple of connections between the layers. So they had a 00:17:22.080 |
couple of points where the information could be passed between those two halves and then at the 00:17:27.920 |
end they came together into the final classification layer. Another important factor is that they 00:17:34.240 |
swapped the more typical softmax and tanh activation functions of the time for a rectified 00:17:41.600 |
linear unit or radio activation function which again further improved the efficiency of the model 00:17:49.200 |
and also meant that they didn't require normalization that you would usually have to 00:17:55.440 |
do if you had tanh or softmax. Because both of those activation functions over many layers you 00:18:02.560 |
can get what's called a saturation in your activations which means the activations in your 00:18:08.480 |
neurons either kind of push towards the two limits of one of those activation functions. So for 00:18:14.160 |
example with softmax you'd end up all your activations be pushed towards one or zero. 00:18:19.440 |
Nonetheless they did use another type of normalization called local response normalization 00:18:24.880 |
but that's not really used anymore. Nonetheless for AlexNet that was still a critical component. 00:18:33.600 |
Now another super important thing that is still used today that AlexNet introduced was the use of 00:18:40.480 |
overlapping in the pooling layers. Now pooling is already used in convolutional networks and 00:18:48.560 |
it essentially just summarizes a window of information from one layer into a single 00:18:55.120 |
activation value in the next layer. Now overlapping pooling does the same thing but there's a there's 00:18:59.920 |
an overlap in the window that gets passed along in the preceding layer. So there's always it 00:19:06.080 |
always sees a little bit of the previous window. Okay and they found that this reduces overfitting 00:19:12.400 |
of the model and improves the performance. So that is how they got AlexNet to work and 00:19:20.720 |
a few of the details behind actually you know how it worked and why it worked so well at the time. 00:19:27.040 |
Now I think it's great to talk about all this but as I said at the start I think it's better 00:19:32.240 |
to go through everything or back everything up with a little bit of code. So we'll go through 00:19:37.520 |
a notebook that you can find a link to the collab version of this notebook in the description below 00:19:45.120 |
or if you're reading this on the Pinecone article page it will be in the resources section at the 00:19:50.960 |
bottom and yeah we'll start going through that. Okay so we're going to start by downloading and 00:19:58.160 |
pre-processing our ImageNet dataset. So we're not using the actual ImageNet itself we're using 00:20:05.600 |
another hosted version of ImageNet which is much smaller that we can find on Hugging Face. 00:20:12.160 |
So to use this we will need to pip install a few things. So pip install datasets which is where 00:20:19.600 |
we're going to how we're going to use the ImageNet dataset and later on we're also going to be using 00:20:25.680 |
Torch and TorchVision so install those as well. So this is the dataset we're going to use so we're 00:20:32.320 |
using this Macie Tiny ImageNet dataset. Now this is a validation split and that contains I think 00:20:40.720 |
it's 10,000 we can see here yeah 10,000 labeled images. Okay and then we can see a single record 00:20:48.320 |
in there so we have every image is sold as a pill image object and they have these labels so this 00:20:56.800 |
one has labeled zero we don't necessarily know what that means right now but later on we'll see 00:21:02.480 |
how we can actually figure that out. So this one is referring to the actual training 00:21:11.920 |
dataset so the training split of this dataset does contain 100,000 of those labeled 00:21:17.520 |
images. Now we can check the type it's the the object and we can see okay so when we're in a 00:21:26.720 |
notebook like this we can just call this ImageNet and this is just how we we show that in the 00:21:33.760 |
notebook so you can see it's a goldfish so we can probably guess that label zero actually means 00:21:38.560 |
goldfish. So there are a few pre-processing steps that we need to go through so we need to convert 00:21:46.880 |
all images into an RGB format so it will have three color channels. We need to resize all these 00:21:53.920 |
images to fit the expected dimensions of AlexNet. We need to convert into a tensor for PyTorch. 00:22:02.080 |
We need to normalize those values and stack so when we have multiple images we're going to stack 00:22:08.800 |
them all into a single tensor okay it's to create our batch. So we start with RGB AlexNet as I 00:22:16.800 |
mentioned assumes all images have three color channels red green and blue but there are many 00:22:22.320 |
other formats that are supported by pill objects so we'll see here that we have grayscale okay so 00:22:28.960 |
this is 201 this is a grayscale image because we have this L they're formats as well so need to be 00:22:35.280 |
aware of those and we can see it's I think it's an alligator yeah an alligator in grayscale okay 00:22:43.680 |
so we convert into red green and blue and we'll see okay it's still grayscale that's that's fine 00:22:50.560 |
it will still be shown as grayscale but in reality this only has one color channel which is actually 00:22:57.840 |
just like a brightness channel whereas this now has three color channels red green and blue but 00:23:03.120 |
they're all of equal values across across those three channels so actually still shows as being 00:23:08.480 |
grayscale even though it is in a RGB format. This is how we handle the RGB part but we also need to 00:23:17.120 |
resize the image to fit the expected dimensionality for AlexNet. So for AlexNet and for a lot of other 00:23:27.840 |
computer vision models the height and width of the the input images is expected to be at least 00:23:35.840 |
224 pixels so we need to do that we can by using this so we're going to first we resize the images 00:23:46.080 |
because these are very small images and they're not necessarily all going to be the square that 00:23:50.400 |
we need the 224 by 224 so we resize them to be bigger and then we use this center crop to crop 00:23:59.040 |
out any edges and make sure that is now a square image of that dimensionality and yeah we're doing 00:24:07.200 |
that using this transforms function from torch vision which is a very good way of pre-processing 00:24:15.920 |
your your image data has a lot of functions and we'll see we'll use a couple more of those very 00:24:20.960 |
soon. So if we have a look at our first image the goldfish image you see it's now a bit bigger and 00:24:25.280 |
we can also see it's kind of cropped some of it as well but we still get the you know we still get 00:24:30.240 |
the idea of what is in that image so if we compare that to this here we can kind of see its eye at 00:24:36.240 |
the front there and more of its head whereas here it's kind of almost chopped off. Now another thing 00:24:44.000 |
we need to do is normalize all the values in these in these images so RGB arrays tend to be in the 00:24:51.840 |
range of 0 up to 255 and we need them to be in the range of 0 to 1 and we need to normalize them 00:25:00.240 |
using these values that you see here so this mean of 0.4 and so on and the standard deviation of 0.2 00:25:07.360 |
and so on this is specific to the AlexNet implementation from PyTorch so we go on the 00:25:15.200 |
AlexNet PyTorch page you can go down and it here we go so the images have to be loaded into a range 00:25:22.400 |
of 0 to 1 and normalize using the values I just showed you so that's why we're using those and 00:25:28.480 |
yeah so we create this process function and then we process our image through it and then we can 00:25:35.520 |
check the size so the the final result here is going to be that normalized tensor that we want 00:25:41.760 |
and it's in the correct dimension it has the correct dimensionality that we need as well 00:25:46.160 |
so yeah that that's perfect now we want to put all this together and we don't want to do it for 00:25:52.080 |
every single image like this we're just going to put it all together for a mini batch of images 00:25:57.360 |
so we're going to go with the first 50 images because because they're all goldfish and we can 00:26:02.960 |
easily check the AlexNet's performance on that single single object so I'm going to redefine 00:26:11.600 |
that pre-processing pipeline using everything we've just done so we'll resize we crop it 00:26:16.720 |
to tensor we have to do this by the way because PyTorch is expecting a tensor object 00:26:22.960 |
and before we normalize it we it needs to be in that tensor format otherwise we're going to get 00:26:28.400 |
an error and then yeah we normalize it so we go through every image in the first 50 images and we 00:26:36.720 |
first convert any that are grayscale to RGB not RBG RGB and we pre-process them okay and just append 00:26:48.640 |
them to a list now that list we want to sack all those together into a single tensor so we do that 00:26:54.880 |
here and we get this final mini batch of our images so we have a mini batch of 50 and we have 00:27:02.640 |
those those images that you can see with the dimensionality here so with all that done we're 00:27:08.800 |
now ready to move on to the inference step so the the prediction of the class label is for our images 00:27:14.960 |
with AlexNet so the first thing we're going to want to do is download the model which is going 00:27:22.000 |
to be hosted by PyTorch so we can do that here so let me so you can see a bit better 00:27:32.160 |
we import Torch the PyTorch and we just do Torch upload okay PyTorch vision let's just see the 00:27:41.040 |
version that we're using and then we have AlexNet and we're not going to train AlexNet it would 00:27:46.800 |
take a bit of time so we're going to use the pre-trained model weights so this version of 00:27:52.800 |
AlexNet has already been trained and then we just say it's evaluation mode for for our inference so 00:27:59.760 |
for the predictions we don't want to train it by default I think it is in train mode which 00:28:05.280 |
looks like this we want it in evaluation mode 00:28:12.880 |
and then we can see the model structure here as well so you can see AlexNet we have so this is 00:28:20.240 |
where we're creating the the image features so there's many of these convolutional layers 00:28:27.520 |
followed by the radioactivation function followed by the max pooling layer and with each of those 00:28:34.480 |
the model creates a more abstract tensor that represents the sort of information from that image 00:28:44.320 |
so you can sort of imagine here the the the abstraction so the feature that's been extracted 00:28:49.680 |
is like okay there's some straight edges here and some some curved edges here we go a little further 00:28:55.760 |
and this is like okay this is a this is an animal or this is a fish and then by the time we get to 00:29:00.320 |
here it's like okay this is a goldfish hopefully and then we move on to the classifier part so the 00:29:09.600 |
classifier is these three layers so we have dropout this this dropout was added to 00:29:15.760 |
reduce the chance of overfitting and improve the ability of the model to generalize and yeah we 00:29:24.320 |
have these linear linear linear okay so these are the linear layers the fully connected linear layers 00:29:31.280 |
that produce the final 1000 activations and the highest of these activations represents the class 00:29:39.840 |
that the model is predicting as being the the class that identifies the image I saw 00:29:48.000 |
so that's the model we initialize it if we can it's better that we move the model 00:29:55.760 |
over to either a cuda gpu if available or more recently we have the apple silicon chips 00:30:03.760 |
so if you are on a mac with apple silicon you want to use mps okay so that's the case for me I have a 00:30:13.440 |
I'm going to run all this on mps so we move the inputs to the device and we move the model to the 00:30:21.840 |
to the device now when we move the model to the device it does this in place so we don't need to 00:30:26.320 |
like we did here where it's inputs equals we just write this and then we run the model so we set 00:30:34.320 |
torch no grads so we don't need to calculate the gradients because we do that for training the 00:30:38.480 |
model we're just performing inference so we get our outputs we detach them from the model and then 00:30:45.840 |
we can sort of see the shape so we have these 50 vectors of 1000 items so that's 50 activations 00:30:55.920 |
across all of our 1000 classes and we can we can see those here okay now these are not normalized 00:31:04.800 |
so if we if you want to calculate the probability from this we use the softmax function so we would 00:31:12.240 |
do that like this okay that that would map everything to a probability distribution and 00:31:19.360 |
you'll be able to get the probability of like say the top five classes for example but we don't we 00:31:26.800 |
don't necessarily need to do that for what we're doing here so we could actually skip this so up 00:31:32.160 |
here we are getting the output so we could skip this the probability part and just replace that 00:31:38.080 |
with output and we will get the same result for what we're doing here which is taking the 00:31:42.560 |
value or the index position of the maximum value out of those 1000 classes so here we're getting 00:31:51.760 |
one okay now if you remember earlier on the labels that we had in this data set was zero 00:31:57.680 |
for the for the goldfish and the reason these are different is because the data set actually 00:32:04.640 |
uses a different set of labels so it's it's not actually the same but if we if we do this so 00:32:13.040 |
over here let me open this and show you so over here we have a text file where each class is 00:32:25.280 |
separated by a newline character so this is number zero a tench and number one is a goldfish right 00:32:32.960 |
so if we we get that information so number one we can see there's a lot of goldfish predictions here 00:32:38.560 |
which is a good sign we can import those those classes and we can create prediction labels 00:32:48.400 |
by just splitting the response we get from this by newline characters and then what we do is if 00:32:55.680 |
you print out prediction labels one we get goldfish okay so the the text label for that prediction 00:33:03.920 |
and yeah so we have the first 50 images all of those are goldfish and we can we can see here 00:33:13.360 |
so i'm just printing out the last three you see they're all goldfish so we would expect everything 00:33:19.360 |
all these predictions to be goldfish if the model is performing well okay and yeah we see for the 00:33:25.760 |
for the most part that is the case now if we calculate the performance or the accuracy here 00:33:34.400 |
we get 72 so that represents a top one error rate of 28 which beats the reported error rate of the 00:33:43.840 |
AlexNet model in 2012 on the ImageNet challenge which was 37.5 for the for the top top one 00:33:51.920 |
however this is i will point out that this is just a single class this is a single label 00:34:00.000 |
goldfish right and the model will perform better on goldfish than other things okay so when we test 00:34:07.760 |
this across the if we test this across the whole data set one we have to map all of the different 00:34:12.960 |
labels between the uh between the AlexNet model and the data set that we have here because the 00:34:17.920 |
labels have kind of messed up uh so takes a bit of extra work but if we do that the performance 00:34:22.800 |
will not be as good nonetheless i think that is a pretty good result so that's it for this video 00:34:29.680 |
that's our overview of one of the most significant events in computer vision and deep learning 00:34:36.720 |
the ImageNet challenge was hosted annually until 2017 by then 29 38 contestants had an error rate 00:34:46.160 |
of less than five percent so the you know over those years the models the progress in computer 00:34:53.120 |
vision just kind of went crazy AlexNet ended up being superseded by even more powerful convolution 00:34:59.360 |
neural networks Microsoft Research Asia was the first other team to beat AlexNet and they did that 00:35:07.360 |
in 2015 and since then there have been many other sort of state-of-the-art convolution networks that 00:35:14.160 |
have come and gone and even more recently there are the possibility of other networks coming in 00:35:20.560 |
such as transformer models and disrupting the dominance of convolution neural networks in 00:35:26.880 |
computer vision now i'll leave you with the final paragraph of the of the AlexNet paper because it 00:35:32.160 |
almost seems like they saw the future of deep learning they noted that they did not use any 00:35:40.240 |
unsupervised pre-training even though they expect it will help and our results have improved as we 00:35:47.520 |
make our network larger but we still have many orders of magnitude to go in order to map the 00:35:53.840 |
infrotemporal pathway of the human visual system so to match human level performance now we know 00:36:01.040 |
that unsupervised pre-training and ever greater models ever deeper models were really sort of the 00:36:07.760 |
key to all the improvement gains that we've got in deep learning in the past decade so i hope that 00:36:14.400 |
has been useful thank you very much for watching and i will see you again in the next one bye