Today we're going to talk about one of the most important events in the history of deep learning we're going to talk about what happened at ImageNet 2012 and how that launched the sort of deep learning rocket ship that we've been strapped to for the past decade. So in short we're going to talk about ImageNet and where it came from why it was so important and then we're going to have a look at very briefly going to have a look at convolutional neural networks and AlexNet which is the model that triggered the massive growth of deep learning and for me I like to back everything up with code so what we'll do is towards the end of the video we're going to go through the PyTorch implementation of AlexNet and we're actually going to test it on a small ImageNet like data set and that'll be quite useful because we can see sort of image pre-processing steps and also how to perform inference with a convolutional neural network like AlexNet.
So let's jump straight into it. Today's deep learning revolution traces its roots back to the 30th of September 2012. On this day a deep layered convolutional neural network won the ImageNet 2012 challenge and this convolutional neural network didn't just win it completely destroyed the rest of the competition. Now this model you might have guessed is called AlexNet and the simple fact that even use convolutional neural networks was very new.
Convolutional neural networks had been around for a while but using them had kind of been deemed impractical yet when AlexNet's results came in it proved sort of unparalleled performance on what was seen as one of the hardest challenges of the time for computer vision. So this event made AlexNet the first widely acknowledged successful implementation of deep learning and the sheer performance improvement that it showed caught people's attention.
Until this point deep learning was unproven it was simply a nice idea that most people just decided okay it's impractical we don't have enough data we don't have enough compute to do anything like this but AlexNet showed that this was not the case and that deep learning was now practical.
Yet this sort of surge of interest in deep learning was not solely you know thanks to AlexNet. ImageNet also played a big part in this. The foundation of applied deep learning was set by ImageNet and built upon by AlexNet. So let's begin with ImageNet. Back in 2006 the world of computer vision was a lot different to how we know it now.
It was pretty underfunded it didn't really get that much attention yet there were a lot of researchers around the world focused on building better models and the year after year they saw progress but it was slow. In that same year a woman called Fei-Fei Li had just finished her computer vision PhD at Caltech and had started working as a professor in computer science and had noticed this sort of focus in the field of computer vision on the models and the subsequent lack of focus on data and an idea came to Li that maybe a data set that was more representative of the world could improve the performance of the modelers being trained on it.
Around the same time there was another professor called Christiana Feldbaum and she was a co-developer of a data set from the 1980s called WordNet. Now WordNet consisted of a pretty large number of English language terms organized into an ontological structure. So for example for the term Siberian Husky that would be within a tree structure and above Siberian Husky you would have a working dog, above working dog you would have dog, above dog you'd have canine, carnivore and so on.
So there's like that tree structure of different terms and how they relate to each other. In 2007 Li and Feldbaum met and Feldbaum discussed her work on or her idea at the time of adding just a reference image to each of the terms within WordNet. So the intention was not to create a image data set but it was simply to add like a reference image so people could more easily understand what that particular term was about and this inspired an idea from Li that would kick start the world of computer vision and deep learning.
So soon after Li put together a team to build what would become the largest labeled data set of images in the world called ImageNet. The idea behind ImageNet was that a large ontological based data set like WordNet but for images could be the key behind building more advanced content based image retrieval, object recognition, scene recognition and better visual understanding in computer vision models.
And just two years later the first version of ImageNet was released with 12 million labeled images. These were all structured and labeled in line with the WordNet ontology. Yet if we consider the sheer size of that, the 12 million images, if one person had spent literally every single day labeling one image per minute and did literally nothing else in that time, they didn't eat, they didn't sleep, just labeled images, it would have taken them 22 years and 10 months.
Which obviously is a very long time. There's just an insane number of images to be labeled here. So how did they do it? Because the team was not huge, they didn't have an infinite amount of money to pay other researchers and students to do this. So what they eventually settled on was a platform called Amazon's Mechanical Turk.
Mechanical Turk is a crowdsourcing platform where people from around the globe will perform tasks such as labeling images for a set amount of money. Because there's just the insane scale of people around the world doing this at competitive prices, that made ImageNet possible with a few adjustments to the labeling process.
Because in reality if you just have random people around the world labeling your dataset, some of those people are going to try and take advantage of that, maybe game the system. So you have to have some checks and balances in place against that. So there's a little bit of system design in there.
But that is how they built the dataset. Now on its release, ImageNet was the largest publicly available labeled dataset of images in the world. Yet there was very little interest in the dataset. Which seems pretty crazy when we look back on that in hindsight. Because now we know, okay we want more data for our models to make them better.
But at the time things were different. So ImageNet came with these 12 million images distributed across 22,000 categories. And at the time there were the odd other image datasets that used a similar sort of structure and idea. So for example the ESP dataset used something called the ESP game.
And people would play the ESP game and label images for the dataset. Now reportedly they had way more images. But it wasn't publicly released. They only publicly released 60,000 of those images. And a couple of years later there were a few papers that kind of looked at the ESP game and ESP dataset and said okay it's probably not actually that useful.
Because you can kind of guess the right answer most of the time without even looking at the image. So there were some questions around the usability of that dataset. So all this to say ImageNet was by far the biggest, at least publicly available, and most sort of accurate dataset for computer vision at the time.
So the reason that there was very little interest in ImageNet despite its huge size is that people just assumed that it could not work for their models. You have to think back then they were training models on much smaller datasets which had like 12 categories of images for example.
And the models would struggle with that. So when ImageNet comes along and it's like hey I have 22,000 categories here people are just like well I can't deal with 12 so I'm not going to even try 22,000. That's crazy. So there was a lack of interest in ImageNet at the time.
It just wasn't really received that warmly. So the ImageNet team decided to try and push it a bit more. So by the next year 2010 they had managed to organize a challenge with the dataset, a classification challenge initially. And they grew into different things over the years but initially it's just a classification challenge.
So the ImageNet large-scale visual recognition challenge was first hosted in 2010. And competitors had to correctly classify images from 1,000 categories. So not a full set of terms in the ontology of ImageNet but they had 1,000 categories instead. And whoever produced a model with the lowest error rate won.
And there were a few entrants. There was not a huge number of entrants. I think it's something like 4, 5, 6 entrants in 2010, 2011, 2012. Now eventually this challenge would become the primary benchmark in computer vision progress. But it took some time. And that really started in 2012.
So 2012 was not like the previous years for ImageNet. On the 30th of September 2012 the latest challenge results were released. And one of those results was a lot better than any of the other results. And it came from a model that most people thought was just not practical.
And that was AlexNet. AlexNet was the first model to score a sub 25% error rate. And that same year the nearest competitor was 9.8 percentage points behind AlexNet. And they had done this with a deep layered convolutional neural network which at the time people were not really taking seriously.
Now to understand AlexNet it's probably best we very quickly cover a little bit of what a convolutional neural network actually is. So a convolutional neural network or CNN is a neural network layer that has a special layer called a convolutional layer. And today these models are known for computer vision.
They have been for quite a long time sort of undisputed champions of computer vision. And actually you know that has changed a little bit in literally like the past couple of years. But right now they're still pretty dominant. And unlike a lot of the models back in 2012 and earlier these did not need too much sort of manual feature extraction or too much image pre-processing before feeding data into the model.
They could just kind of deal with that themselves. CNNs use several of these convolutional layers stacked on top of each other. And what you find is that the deeper the network is the more the better it can identify more sort of complex concepts or objects in images. So for example the first with the first few layers you're probably going to just kind of identify okay this is an edge, this is a circle maybe, this is this shape and maybe some textures.
As the network gets deeper and you add more layers to it it starts to abstract those features and identify more abstract ideas. So a deeper network will be able to identify okay this is like a living thing and then you go you build a deeper network and it can identify mammals and then it can identify dogs and then it can identify Siberian huskies.
So as the model gets deeper its performance and its ability to identify more nuanced things gets better. So at the time these models were overlooked because essentially to train these to get good performance from one of these models they need to be really deep. Which means that they have a lot of parameters okay and it's the more parameters you have the longer it's going to take your model to train if you can train it at all if it's if it's too big and doesn't even fit in the memory on your computer.
And also the more parameters it has the more data it has to see before it can produce sort of any any good performance of anything. As a result of this they were simply overlooked. Yet the authors of AlexNet won the ImageNet challenge in 2012 and it turns out that they were the the right people in the right place at the right time.
Several pieces came from from different places to create this. ImageNet provided a massive amount of data needed to train one of these deep layered convolutional neural networks. A few years earlier in 2007 NVIDIA had released CUDA which you may recognize the name of. So an API that allowed software access to the lower level highly parallel processing abilities of CUDA enabled GPUs from NVIDIA.
And GPU power in itself was reaching a point where this you know training these big models was becoming possible. Although it wasn't quite there yet at the time for a single GPU. So AlexNet was by no means small and because of that the authors had to solve a lot of problems to get all this working.
So AlexNet consisted of five convolutional layers followed by three fully connected linear layers. The final layer to produce the 1000 classifications required by ImageNet was a 1000 node layer that used a softmax activation function to create this probability distribution over all of those classes. Now a key conclusion from AlexNet was that the depth of the network was key to getting the performance that they got.
And that depth as I mentioned before it creates a lot of parameters that need to be trained making training the model either impractically slow or just simply impossible. Or at least that was the case if you're going to train it on CPU. So they had to turn to GPUs but at the time the high-end GPUs only had a memory of about three gigabytes which was not enough for AlexNet.
So to make it work they had to distribute AlexNet across two GPUs and they did this by pretty much splitting the layers in two and having you know half of the network on one GPU half the network on the other GPU and having a couple of connections between the layers.
So they had a couple of points where the information could be passed between those two halves and then at the end they came together into the final classification layer. Another important factor is that they swapped the more typical softmax and tanh activation functions of the time for a rectified linear unit or radio activation function which again further improved the efficiency of the model and also meant that they didn't require normalization that you would usually have to do if you had tanh or softmax.
Because both of those activation functions over many layers you can get what's called a saturation in your activations which means the activations in your neurons either kind of push towards the two limits of one of those activation functions. So for example with softmax you'd end up all your activations be pushed towards one or zero.
Nonetheless they did use another type of normalization called local response normalization but that's not really used anymore. Nonetheless for AlexNet that was still a critical component. Now another super important thing that is still used today that AlexNet introduced was the use of overlapping in the pooling layers. Now pooling is already used in convolutional networks and it essentially just summarizes a window of information from one layer into a single activation value in the next layer.
Now overlapping pooling does the same thing but there's a there's an overlap in the window that gets passed along in the preceding layer. So there's always it always sees a little bit of the previous window. Okay and they found that this reduces overfitting of the model and improves the performance.
So that is how they got AlexNet to work and a few of the details behind actually you know how it worked and why it worked so well at the time. Now I think it's great to talk about all this but as I said at the start I think it's better to go through everything or back everything up with a little bit of code.
So we'll go through a notebook that you can find a link to the collab version of this notebook in the description below or if you're reading this on the Pinecone article page it will be in the resources section at the bottom and yeah we'll start going through that. Okay so we're going to start by downloading and pre-processing our ImageNet dataset.
So we're not using the actual ImageNet itself we're using another hosted version of ImageNet which is much smaller that we can find on Hugging Face. So to use this we will need to pip install a few things. So pip install datasets which is where we're going to how we're going to use the ImageNet dataset and later on we're also going to be using Torch and TorchVision so install those as well.
So this is the dataset we're going to use so we're using this Macie Tiny ImageNet dataset. Now this is a validation split and that contains I think it's 10,000 we can see here yeah 10,000 labeled images. Okay and then we can see a single record in there so we have every image is sold as a pill image object and they have these labels so this one has labeled zero we don't necessarily know what that means right now but later on we'll see how we can actually figure that out.
So this one is referring to the actual training dataset so the training split of this dataset does contain 100,000 of those labeled images. Now we can check the type it's the the object and we can see okay so when we're in a notebook like this we can just call this ImageNet and this is just how we we show that in the notebook so you can see it's a goldfish so we can probably guess that label zero actually means goldfish.
So there are a few pre-processing steps that we need to go through so we need to convert all images into an RGB format so it will have three color channels. We need to resize all these images to fit the expected dimensions of AlexNet. We need to convert into a tensor for PyTorch.
We need to normalize those values and stack so when we have multiple images we're going to stack them all into a single tensor okay it's to create our batch. So we start with RGB AlexNet as I mentioned assumes all images have three color channels red green and blue but there are many other formats that are supported by pill objects so we'll see here that we have grayscale okay so this is 201 this is a grayscale image because we have this L they're formats as well so need to be aware of those and we can see it's I think it's an alligator yeah an alligator in grayscale okay so we convert into red green and blue and we'll see okay it's still grayscale that's that's fine it will still be shown as grayscale but in reality this only has one color channel which is actually just like a brightness channel whereas this now has three color channels red green and blue but they're all of equal values across across those three channels so actually still shows as being grayscale even though it is in a RGB format.
This is how we handle the RGB part but we also need to resize the image to fit the expected dimensionality for AlexNet. So for AlexNet and for a lot of other computer vision models the height and width of the the input images is expected to be at least 224 pixels so we need to do that we can by using this so we're going to first we resize the images because these are very small images and they're not necessarily all going to be the square that we need the 224 by 224 so we resize them to be bigger and then we use this center crop to crop out any edges and make sure that is now a square image of that dimensionality and yeah we're doing that using this transforms function from torch vision which is a very good way of pre-processing your your image data has a lot of functions and we'll see we'll use a couple more of those very soon.
So if we have a look at our first image the goldfish image you see it's now a bit bigger and we can also see it's kind of cropped some of it as well but we still get the you know we still get the idea of what is in that image so if we compare that to this here we can kind of see its eye at the front there and more of its head whereas here it's kind of almost chopped off.
Now another thing we need to do is normalize all the values in these in these images so RGB arrays tend to be in the range of 0 up to 255 and we need them to be in the range of 0 to 1 and we need to normalize them using these values that you see here so this mean of 0.4 and so on and the standard deviation of 0.2 and so on this is specific to the AlexNet implementation from PyTorch so we go on the AlexNet PyTorch page you can go down and it here we go so the images have to be loaded into a range of 0 to 1 and normalize using the values I just showed you so that's why we're using those and yeah so we create this process function and then we process our image through it and then we can check the size so the the final result here is going to be that normalized tensor that we want and it's in the correct dimension it has the correct dimensionality that we need as well so yeah that that's perfect now we want to put all this together and we don't want to do it for every single image like this we're just going to put it all together for a mini batch of images so we're going to go with the first 50 images because because they're all goldfish and we can easily check the AlexNet's performance on that single single object so I'm going to redefine that pre-processing pipeline using everything we've just done so we'll resize we crop it to tensor we have to do this by the way because PyTorch is expecting a tensor object and before we normalize it we it needs to be in that tensor format otherwise we're going to get an error and then yeah we normalize it so we go through every image in the first 50 images and we first convert any that are grayscale to RGB not RBG RGB and we pre-process them okay and just append them to a list now that list we want to sack all those together into a single tensor so we do that here and we get this final mini batch of our images so we have a mini batch of 50 and we have those those images that you can see with the dimensionality here so with all that done we're now ready to move on to the inference step so the the prediction of the class label is for our images with AlexNet so the first thing we're going to want to do is download the model which is going to be hosted by PyTorch so we can do that here so let me so you can see a bit better we import Torch the PyTorch and we just do Torch upload okay PyTorch vision let's just see the version that we're using and then we have AlexNet and we're not going to train AlexNet it would take a bit of time so we're going to use the pre-trained model weights so this version of AlexNet has already been trained and then we just say it's evaluation mode for for our inference so for the predictions we don't want to train it by default I think it is in train mode which looks like this we want it in evaluation mode and then we can see the model structure here as well so you can see AlexNet we have so this is where we're creating the the image features so there's many of these convolutional layers followed by the radioactivation function followed by the max pooling layer and with each of those the model creates a more abstract tensor that represents the sort of information from that image so you can sort of imagine here the the the abstraction so the feature that's been extracted is like okay there's some straight edges here and some some curved edges here we go a little further and this is like okay this is a this is an animal or this is a fish and then by the time we get to here it's like okay this is a goldfish hopefully and then we move on to the classifier part so the classifier is these three layers so we have dropout this this dropout was added to reduce the chance of overfitting and improve the ability of the model to generalize and yeah we have these linear linear linear okay so these are the linear layers the fully connected linear layers that produce the final 1000 activations and the highest of these activations represents the class that the model is predicting as being the the class that identifies the image I saw so that's the model we initialize it if we can it's better that we move the model over to either a cuda gpu if available or more recently we have the apple silicon chips so if you are on a mac with apple silicon you want to use mps okay so that's the case for me I have a I'm going to run all this on mps so we move the inputs to the device and we move the model to the to the device now when we move the model to the device it does this in place so we don't need to like we did here where it's inputs equals we just write this and then we run the model so we set torch no grads so we don't need to calculate the gradients because we do that for training the model we're just performing inference so we get our outputs we detach them from the model and then we can sort of see the shape so we have these 50 vectors of 1000 items so that's 50 activations across all of our 1000 classes and we can we can see those here okay now these are not normalized so if we if you want to calculate the probability from this we use the softmax function so we would do that like this okay that that would map everything to a probability distribution and you'll be able to get the probability of like say the top five classes for example but we don't we don't necessarily need to do that for what we're doing here so we could actually skip this so up here we are getting the output so we could skip this the probability part and just replace that with output and we will get the same result for what we're doing here which is taking the value or the index position of the maximum value out of those 1000 classes so here we're getting one okay now if you remember earlier on the labels that we had in this data set was zero for the for the goldfish and the reason these are different is because the data set actually uses a different set of labels so it's it's not actually the same but if we if we do this so over here let me open this and show you so over here we have a text file where each class is separated by a newline character so this is number zero a tench and number one is a goldfish right so if we we get that information so number one we can see there's a lot of goldfish predictions here which is a good sign we can import those those classes and we can create prediction labels by just splitting the response we get from this by newline characters and then what we do is if you print out prediction labels one we get goldfish okay so the the text label for that prediction and yeah so we have the first 50 images all of those are goldfish and we can we can see here so i'm just printing out the last three you see they're all goldfish so we would expect everything all these predictions to be goldfish if the model is performing well okay and yeah we see for the for the most part that is the case now if we calculate the performance or the accuracy here we get 72 so that represents a top one error rate of 28 which beats the reported error rate of the AlexNet model in 2012 on the ImageNet challenge which was 37.5 for the for the top top one however this is i will point out that this is just a single class this is a single label goldfish right and the model will perform better on goldfish than other things okay so when we test this across the if we test this across the whole data set one we have to map all of the different labels between the uh between the AlexNet model and the data set that we have here because the labels have kind of messed up uh so takes a bit of extra work but if we do that the performance will not be as good nonetheless i think that is a pretty good result so that's it for this video that's our overview of one of the most significant events in computer vision and deep learning the ImageNet challenge was hosted annually until 2017 by then 29 38 contestants had an error rate of less than five percent so the you know over those years the models the progress in computer vision just kind of went crazy AlexNet ended up being superseded by even more powerful convolution neural networks Microsoft Research Asia was the first other team to beat AlexNet and they did that in 2015 and since then there have been many other sort of state-of-the-art convolution networks that have come and gone and even more recently there are the possibility of other networks coming in such as transformer models and disrupting the dominance of convolution neural networks in computer vision now i'll leave you with the final paragraph of the of the AlexNet paper because it almost seems like they saw the future of deep learning they noted that they did not use any unsupervised pre-training even though they expect it will help and our results have improved as we make our network larger but we still have many orders of magnitude to go in order to map the infrotemporal pathway of the human visual system so to match human level performance now we know that unsupervised pre-training and ever greater models ever deeper models were really sort of the key to all the improvement gains that we've got in deep learning in the past decade so i hope that has been useful thank you very much for watching and i will see you again in the next one bye