MIT Introduction to Deep Learning | 6.S191

MIT Introduction to Deep Learning | 6.S191


hi everyone, let’s get started. Good
afternoon and welcome to MIT 6.S191! TThis is really incredible to see the
turnout this year. This is the fourth year now we’re teaching this course and
every single year it just seems to be getting bigger and bigger. 6.S191 is a
one-week intensive boot camp on everything deep learning. In the past, at
this point I usually try to give you a synopsis about the course and tell you
all of the amazing things that you’re going to be learning. You’ll be gaining
fundamentals into deep learning and learning some practical knowledge about
how you can implement some of the algorithms of deep learning in your own
research and on some cool lab related software projects. But this year I
figured we could do something a little bit different and instead of me telling
you how great this class is I figured we could invite someone else from outside
the class to do that instead. So let’s check this out first. Hi everybody and
welcome MIT 6.S191 the official introductory course on deep
learning to taught here at MIT. Deep learning is revolutionising so many
fields from robotics to medicine and everything in between. You’ll the learn the fundamentals of this field and how you can build some of these
incredible algorithms. In fact, this entire speech and video are not real and
were created using deep learning and artificial intelligence. And in this
class you’ll learn how. It has been an honor to speak with you today and I hope you enjoy the course! Alright. so as you can tell deep learning
is an incredibly powerful tool. This was just an example of how we use deep
learning to perform voice synthesis and actually emulate someone else’s voice, in
this case Barack Obama, and also using video dialogue replacement to
actually create that video with the help of Canny AI. And of course you might as
you’re watching this video you might raise some ethical concerns which we’re
also very concerned about and we’ll actually talk about some of those later
on in the class as well. But let’s start by taking a step back and actually
introducing some of these terms that we’ve been we’ve talked about so far now. Let’s start with the word intelligence. I like to define intelligence as the
ability to process information to inform future decisions. Now the field of
artificial intelligence is simply the the field which focuses on building
algorithms, in this case artificial algorithms that can do this as well: process information to inform future
decisions. Now machine learning is just a subset of artificial intelligence
specifically that focuses on actually teaching an algorithm how to do this
without being explicitly programmed to do the task at hand.
Now deep learning is just a subset of machine learning which takes this idea
even a step further and says how can we automatically extract the useful pieces
of information needed to inform those future predictions or make a decision
And that’s what this class is all about teaching algorithms how to learn a task
directly from raw data. We want to provide you with a solid foundation of
how you can understand or how to understand these algorithms under the
hood but also provide you with the practical knowledge and practical skills
to implement state-of-the-art deep learning algorithms in Tensorflow which
is a very popular deep learning toolbox. Now we have an amazing set of lectures
lined up for you this year including Today which will cover neural networks
and deep sequential modeling. Tomorrow we’ll talk about computer vision and
also a little bit about generative modeling which is how we can generate
new data and finally I will talk about deep reinforcement learning and touch on
some of the limitations and new frontiers of where this field might be
going and how research might be heading in the next couple of years. We’ll spend
the final two days hearing about some of the guest lectures from top industry
researchers on some really cool and exciting projects. Every year these
happen to be really really exciting talks so we really encourage you to come
especially for those talks. The class will conclude with some final project
presentations which we’ll talk about in a little a little bit and also some
awards and a quick award ceremony to celebrate all of your hard work. Also I
should mention that after each day of lectures so after today we have two
lectures and after each day of lectures we’ll have a software lab which tries to
focus and build upon all of the things that you’ve learned in that day so
you’ll get the foundation’s during the lectures and you’ll get the practical
knowledge during the software lab so the two are kind of jointly coupled in that
sense. For those of you taking this class for credit you have a couple different
options to fulfill your credit requirement first is a project proposal
I’m sorry first yeah first you can propose a project in optionally groups
of two three or four people and in these groups you’ll work to develop a cool new
deep learning idea and we realized that one week which is the span of this
course is an extremely short amount of time to really not only think of an idea
but move that idea past the planning stage and try to implement something so
we’re not going to be judging you on your results towards this idea but
rather just the novelty of the idea itself on Friday
each of these three teams will give a three-minute presentation on that idea
and the awards will be announced for the top winners judged by a panel of judges the second option in my opinion is a bit
more boring but we like to give this option for people that don’t like to
give presentations so in this option if you don’t want to work in a group or you
don’t want to give a presentation you can write a one-page paper review of the
deep learning of a recent deepening of paper or any paper of your choice and
this will be due on the last day of class as well also I should mention that
and for the project presentations we give out all of these cool prizes
especially these three nvidia gpus which are really crucial for doing any sort of
deep learning on your own so we definitely encourage everyone to enter
this competition and have a chance to win these GPUs and these other cool
prizes like Google home and SSD cards as well also for each of the labs the three
labs will have corresponding prizes so it instructions to actually enter those
respective competitions will be within the labs themselves and you can enter to
enter to win these different prices depending on the different lab please
post a Piazza if you have questions check out the course website for slides
today’s slides are already up there is a bug in the website we fixed that now so
today’s slides are up now digital recordings of each of these lectures
will be up a few days after each class this course has an incredible team of
TAS that you can reach out to if you have any questions especially during the
software labs they can help you answer any questions that you might have and
finally we really want to give a huge thank to all of our sponsors who without
their help and support this class would have not been possible ok so now with
all of that administrative stuff out of the way let’s start with the the fun
stuff that we’re all here for let’s start actually by asking ourselves a
question why do we care about deep learning well why do you all care about
deep learning and all of you came to this classroom today and why
specifically do care about deep learning now well to answer that question we
actually have to go back and understand traditional machine learning at its core
first now traditional machine learning algorithms typically try to define as
set of rules or features in the data and these are usually hand engineered and
because their hand engineered they often tend to be brittle in practice so let’s
take a concrete example if you want to perform facial detection how might you
go about doing that well first you might say to classify a face the first thing
I’m gonna do is I’m gonna try and classify or recognize if I see a mouth
in the image the eyes ears and nose if I see all of those things then maybe I can
say that there’s a face in that image but then the question is okay but how do
I recognize each of those sub things like how do I recognize an eye how do I
recognize a mouth and then you have to decompose that into okay to recognize a
mouth I maybe have to recognize these pairs of lines oriented lines in a
certain direction certain orientation and then it keeps getting more
complicated and each of these steps you kind of have to define a set of features
that you’re looking for in the image now the key idea of deep learning is that
you will need to learn these features just from raw data so what you’re going
to do is you’re going to just take a bunch of images of faces and then the
deep learning algorithm is going to develop some hierarchical representation
of first detecting lines and edges in the image using these lines and edges to
detect corners and eyes and mid-level features like eyes noses mouths ears
then composing these together to detect higher-level features like maybe jaw
lines side of the face etc which then can be used to detect the final face
structure and actually the fundamental building blocks of deep learning have
existed for decades and they’re under underlying algorithms for training these
models have also existed for many years so why are we studying this now well for
one data has become much more pervasive we’re living in a the age of big data
and these these algorithms are hungry for a huge amounts of data to succeed
secondly these algorithms are massively parallel izybelle which means that they
can benefit tremendously from modern GPU architectures and hardware acceleration
that simply did not exist when these algorithms were developed and finally
due to open-source tool boxes like tensor flow which are which you’ll get
experience with in this class building and deploying these models has
become extremely streamlined so much so that we can condense all this material
down into one week so let’s start with the fundamental building block of a
neural network which is a single neuron or what’s also called a perceptron the
idea of a perceptron or a single neuron is very basic and I’ll try and keep it
as simple as possible and then we’ll try and work our way up from there let’s
start by talking about the forward propagation of information through a
neuron we define a set of inputs to that neuron as x1 through XM and each of
these inputs have a corresponding weight w1
through WN now what we can do is with each of these inputs and each of these
ways we can multiply them correspondingly together and take a sum
of all of them then we take this single number that’s summation and we pass it
through what’s called a nonlinear activation function and that produces
our final output Y now this is actually not entirely correct we also have what’s
called a bias term in this neuron which you can see here in green so the bias
term the purpose of the bias term is really to allow you to shift your
activation function to the left and to the right regardless of your inputs
right so you can notice that the bias term doesn’t is not affected by the X’s
it’s just a bias associate to that input now on the right side you can see this
diagram illustrated mathematically as a single equation and we can actually
rewrite this as a linear using linear algebra in terms of vectors and dot
products so instead of having a summation over all of the X’s I’m going
to collapse my X into a vector capital X which is now just a list or a vector of
numbers a vector of inputs I should say and you also have a vector of weights
capital W to compute the output of a single perceptron all you have to do is
take the dot product of X and W which represents that element wise
multiplication and summation and then apply that non-linearity which here is
denoted as G so now you might be wondering what is
this nonlinear activation function I’ve mentioned it a couple times but I
haven’t really told you precisely what it is now one common example of this
activation function is what’s called a sigmoid function and you can see an
example of a sigmoid function here on the bottom right one thing to note is
that this function takes any real number as input on the x-axis and it transforms
that real number into a scalar output between 0 & 1
it’s a bounded output between 0 & 1 so one very common use case of the sigmoid
function is to when you’re dealing with probabilities because probabilities have
to also be bounded between 0 & 1 so sigmoids are really useful when you want
to output a single number and represent that number as a probability
distribution in fact there are many common types of nonlinear activation
functions not just the sigmoid but many others that you can use in neural
networks and here are some common ones and throughout this presentation you’ll
find these tensorflow icons like you can see on the bottom right or sorry all
across the bottom here and these are just to illustrate how one could use
each of these topics in a practical setting you’ll see these kind of
scattered in throughout the slides no need to really take furious notes at
these codeblocks like I said all of the slides are published online so
especially during your labs if you want to refer back to any of the slides you
can you can always do that from the online lecture notes now why do we care
about activation functions the point of an activation function is to introduce
nonlinearities into the data and this is actually really important in real life
because in real life almost all of our data is nonlinear and here’s a concrete
example if I told you to separate the green points from the red points using a
linear function could you do that I don’t think so right so you’d get
something like this oh you could do it you wouldn’t do very good job at it and
no matter how deep or how large your network is if you’re using a linear
activation function you’re just composing lines on top of lines and
you’re going to get another line right so this is the best you’ll be able to do
with the linear activation function on the other hand nonlinearities allow you
to approximate arbitrarily complex
functions by kind of introducing these nonlinearities into your decision
boundary and this is what makes neural networks extremely powerful let’s
understand this with a simple example and let’s go back to this picture that
we had before imagine I give you a train network with weights W on the top right
so W here is 3 and minus 2 and the network only has 2 inputs x1 and x2 if
we want to get the output it’s simply the same story as we had before we
multiply our inputs by those weights we take the sum and pass it through a
non-linearity but let’s take a look at what’s inside of that non-linearity
before we apply it so we get is when we take this dot product of x1 times 3 X 2
times minus 2 we mul – 1 that’s simply a 2d line so we can plot that if we set
that equal to 0 for example that’s a 2d line and it looks like this so on the x
axis is X 1 on the y axis is X 2 and we’re setting that we’re just
illustrating when this line equals 0 so anywhere on this line is where X 1 and X
2 correspond to a value of 0 now if I feed in a new input either a test
example a training example or whatever and that input is with this coordinates
it’s has these coordinates minus 1 and 2 so it has the value of x1 of minus 1
value of x2 of 2 I can see visually where this lies with respect to that
line and in fact this this idea can be generalized a little bit more if we
compute that line we get minus 6 right so inside that before we apply the
non-linearity we get minus 6 when we apply a sigmoid non-linearity because
sigmoid collapses everything between 0 and 1 anything greater than 0 is going
to be above 0.5 anything below zero is going to be less than 0.5 so in is
because minus 6 is less than zero we’re going to have a very low output this
point Oh 200 to we can actually generalize this idea for
the entire feature space let’s call it for any point on this plot I can tell
you if it lies on the left side of the line that means that before we apply the
non-linearity the Z or the state of that neuron will be negative less than zero
after applying that non-linearity the sigmoid will give it a probability of
less than 0.5 and on the right side if it falls on the right side of the line
it’s the opposite story if it falls right on the line it means that Z equals
zero exactly and the probability equals 0.5 now actually before I move on this
is a great example of actually visualizing and understanding what’s
going on inside of a neural network the reason why it’s hard to do this with
deep neural networks is because you usually don’t have only two inputs and
usually don’t have only two weights as well so as you scale up your problem
this is a simple two dimensional problem but as you scale up the size of your
network you could be dealing with hundreds or thousands or millions of
parameters and million dimensional spaces and then visualizing these type
of plots becomes extremely difficult and it’s not practical and pause in practice
so this is one of the challenges that we face when we’re training with neural
networks and really understanding their internals but we’ll talk about how we
can actually tackle some of those challenges in later lectures as well
okay so now that we have that idea of a perceptron a single neuron let’s start
by building up neural networks now how we can use that perceptron to create
full neural networks and seeing how all of this story comes together let’s
revisit this previous diagram of the perceptron if there are only a few
things you remember from this class try to take away this so how a perceptron
works just keep remembering this I’m going to keep drilling it in you take
your inputs you apply a dot product with your weights and you apply a
non-linearity it’s that simple oh sorry I missed the step you have dot
product with your weights add a bias and apply your non-linearity so three steps
now let’s simplify this type of diagram a little bit I’m gonna remove the bias
just for simplicity I’m gonna remove all of the weight labels so now you can
assume that every line the weight associated to it and let’s
say so I’m going to note Z that Z is the output of that dot product so that’s the
element wise multiplication of our inputs with our weights and that’s what
gets fed into our activation function so our final output Y is just there our
activation function applied on Z if we want to define a multi output neural
network we simply can just add another one of these perceptrons to this picture
now we have two outputs one is a normal perceptron which is y1 and y2 is just
another normal perceptron the same ideas before they all connect to the previous
layer with a different set of weights and because all inputs are densely
connected to all of the outputs these type of layers are often called dense
layers and let’s take an example of how one might actually go from this nice
illustration which is very conceptual and and nice and simple to how you could
actually implement one of these dense layers from scratch by yourselves using
tensor flow so what we can do is start off by first defining our two weights so
we have our actual weight vector which is W and we also have our bias vector
right both of both of these parameters are governed by the output space so
depending on how many neurons you have in that output layer that will govern
the size of each of those weight and bias vectors what we can do then is
simply define that forward propagation of information so here I’m showing you
this to the call function in tensor flow don’t get too caught up on the details
of the code again you’ll get really a walk through of this code inside of the
labs today but I want to just show you some some high level understanding of
how you could actually take what you’re learning and apply the tensor flow
implementations to it inside the call function it’s the same idea again you
can compute Z which is the state it’s that multiplication of your inputs with
the weights you add the bias right so that’s right there
and once you have Z you just pass it through your sigmoid and that’s your
output for that now tension flow is great because it’s
already implemented a lot of these layers for us so we don’t have to do
what I just showed you from scratch in fact to implement a layer like this with
two two outputs or a percept a multi layer a multi output perceptron layer
with two outputs we can simply call this TF Harris layers dense with units equal
to two to indicate that we have two outputs on this layer and there is a
whole bunch of other parameters that you could input here such as the activation
function as well as many other things to customize how this layer behaves in
practice so now let’s take a look at a single layered neural network so this is
taking it one step beyond what we’ve just seen this is where we have now a
single hidden layer that feeds into a single output layer and I’m calling this
a hidden layer because unlike our inputs and our outputs these states of the
hidden layer are not directly enforced or they’re not directly observable we
can probe inside the network and see them but we don’t actually enforce what
they are these are learned as opposed to the inputs which are provided by us now
since we have a transformation between the inputs and the hidden layer and the
hidden layer and the output layer each of those two transformations will have
their own weight matrices which here I call W 1 and W 2 so its corresponds to
the first layer and the second layer if we look at a single unit inside of that
hidden layer take for example Z 2 I’m showing here
that’s just a single perceptron like we talked about before it’s taking a
weighted sum of all of those inputs that feed into it and it applies the
non-linearity and feeds it on to the next layer same story as before this
picture actually looks a little bit messy so what I want to do is actually
clean things up a little bit for you and I’m gonna replace all of those lines
with just this symbolic representation and we’ll just use this from now on in
the future to denote dense layers or fully connected layers between two
between an input and an output or between an input and hidden layer and again if we wanted to implement this
intensive flow the idea is pretty simple we can just define two of these dense
layers the first one our hidden layer with n outputs and the second one our
output layer with two outputs we can cut week and like join them together
aggregate them together into this wrapper which is called a TF sequential
model and sequential models are just this idea of composing neural networks
using a sequence of layers so whenever you have a sequential message passing
system or sequentially processing information throughout the network you
can use sequential models and just define your layers as a sequence and
it’s very nice to allow information to propagate through that model now if we
want to create a deep neural network the idea is basically the same thing except
you just keep stacking on more of these layers and to create more of an more of
a hierarchical model ones where the final output is computed by going deeper
and deeper into this representation and the code looks pretty similar again so
again we have this TF sequential model and inside that model we just have a
list of all of the layers that we want to use and they’re just stacked on top
of each other okay so this is awesome so hopefully now you have an understanding
of not only what a single neuron is but how you can compose neurons together and
actually build complex hierarchical models with deep with neural networks
now let’s take a look at how you can apply these neural networks into a very
real and applied setting to solve some problem and actually train them to
accomplish some task here’s a problem that I believe any AI system should be
able to solve for all of you and probably one that you care a lot about
will I pass this class to do this let’s start with a very simple two input model
one feature or one input we’re gonna define is how many let’s see how many
lectures you attend during this class and the second one is the number of
hours that you spend on your final projects I should say that the minimum
number of hours you can spend your final project is 50 hours now I’m just joking
okay so let’s take all of the data from previous years and plot it on this
feature space like we looked at before green points are students that have
passed the class in the past and red points are people that have failed we
can plot all of this data onto this two-dimensional grid like this and we
can also plot you so here you are you have attended four lectures and you’ve
only spent five hours on your final exam you’re on you’re on your final project
and the question is are you going to pass the class given everyone around you
and how they’ve done in the past how are you going to do so let’s do it we have
two inputs we have a single layered set single hidden layer neural network we
have three hidden units in that hidden layer and we’ll see that the final
output probability when we feed in those two inputs of four and five is predicted
to be 0.1 or 10% the probability of you passing this class is 10% that’s not
great news the actual prediction was one so you did pass the class now does
anyone have an idea of why the network was so wrong in this case exactly so we
never told this network anything the weights are wrong we’ve just initialized
the weights in fact it has no idea what it means to pass a class it has no idea
of what each of these inputs mean how many lectures you’ve attended and the
hours you’ve spent on your final project it’s just seeing some random numbers it
has no concept of how other people in the class have done so far so what we
have to do to this network first is train it and we have to teach it how to
perform this task until we teach it it’s just like a baby that doesn’t know
anything so it just entered the world it has no concepts or no idea of how to
solve this task and we have to teach at that now how do we do that the idea here
is that first we have to tell the network when it’s wrong so we have to
quantify what’s called its loss or its error and to do that we actually just
take our prediction or what the network predicts and we compare it to what the
true answer was if there’s a big discrepancy between the
prediction and the true answer we can tell the network hey you made a big
mistake right so this is a big error it’s a big loss and you should try and
fix your answer to move closer towards the true answer which it should be okay
now you can imagine if you don’t have just one student but now you have many
students the total loss let’s call it here the empirical risk or the objective
function it has many different names it’s just the the average of all of
those individual losses so the individual loss is a loss that takes as
input your prediction and your actual that’s telling you how wrong that single
example is and then the final the total loss is just the average of all of those
individual student losses so if we look at the problem of binary classification
which is the case that we’re actually caring about in this example so we’re
asking a question will I pass the class yes or no binary classification we can
use what is called as the softmax cross-entropy loss and for those of you
who aren’t familiar with cross-entropy this was actually a a formulation
introduced by Claude Shannon here at MIT during his master’s thesis as well and
this was about 50 years ago it’s still being used very prevalently today and
the idea is it just again compares how different these two distributions are so
you have a distribution of how how likely you think the student is going to
pass and you have the true distribution of if the student passed or not you can
compare the difference between those two distributions and that tells you the
loss that the network incurs on that example now let’s assume that instead of
a classification problem we have a regression problem where instead of
predicting if you’re going to pass or fail to class you want to predict the
final grade that you’re going to get so now it’s not a yes/no answer problem
anymore but instead it’s a what’s the grade I’m
going to get what’s the number what so it’s it’s a full range of numbers that
are possible now and now we might want to use a different
type of loss for this different type of problem and in this case we can do
what’s called a mean squared error loss so we take the actual prediction we take
the the sorry excuse me we take the prediction of the network we take the
actual true final grade that the student got we subtract them we take their
squared error and we say that that’s the mean squared error that’s the loss that
the network should should try to optimize and try to minimize so ok so
now that we have all this information with the loss function and how to
actually quantify the error of the neural network let’s take this and
understand how to train train our model to actually find those weights that it
needs to to use for its prediction so W is what we want to find out W is the set
of weights and we want to find the optimal set of weights that tries to
minimize this total loss over our entire test set so our test set is this example
data set that we want to evaluate our model on so in the class example the
test set is you so you want to understand how likely you are to pass
this class you’re the test set now what this means is that we want to find the
W’s that minimize that total loss function which we call as the objective
function J of W now remember that W is just a aggregation or a collection of
all of the individual w’s from all of your weights so here this is just a way
for me to express this in a clean notation but W is a whole set of numbers
it’s not just a single number and you want to find this all of the W’s you
want to find the value of each of those weights such that you can minimize this
entire loss function it’s a very complicated problem and remember that
our loss function is just a simple function in terms of those weights so if
we plot in the case again of a two-dimensional weight problem so one of
the weights is on the x-axis one of the weights is on this axis and on the z
axis we have the loss so for any value of w we can see what the loss
would be at that point now what do we want to do we want to find the place on
this landscape what are the values of W that we get the minimum loss okay so
what we can do is we can just pick a random W pick a random place on this
this landscape to start with and from this random place let’s try to
understand how the landscape is changing what’s the slope of the landscape we can
take the gradient of the loss with respect to each of these weights to
understand the direction of maximum ascent okay that’s what the gradient
tells us now that we know which way is up we can take a step in the direction
that’s down so we know which way is up we reverse the sign so now we start
heading downhill and we can move towards that lowest point now we just keep
repeating this process over and over again until we’ve converged to a local
minimum now we can summarize this algorithm which is known as gradient
descent because you’re taking a gradient and you’re descending down down that
landscape by starting to initialize our rates wait randomly we compute the
gradient DJ with respect to all of our weights then we update our weights in
the opposite direction of that gradient and take a small step which we call here
ADA of that gradient and this is referred to as the learning rate and
we’ll talk a little bit more about that later but ADA is just a scalar number
that determines how much of a step you want to take at each iteration how
strongly or aggressively do you want to step towards that gradient in code the
picture looks very similar so to implement gradient descent is just a few
lines of code just like the pseudocode you can initialize your weights randomly
in the first line you can compute your loss with respect to those gradients and
with respect to those predictions and your data given that gradient you just
update your weights in the opposite direction of that event of that vector
right now the magic line here is actually how
do you compute that gradient and that’s something I haven’t told you and that’s
something it’s not easy at all so the question is given a loss and given all
of our weights in our network how do we know which way is good which way is a
good place to move given all of this information and I never told you about
that but that’s a process called back propagation and let’s talk about a very
simple example of how we can actually derive back propagation using elementary
calculus so we’ll start with a very simple network with only one hidden
neuron and one output this is probably the simplest neural network that you can
create you can’t really get smaller than this computing the gradient of our loss
with respect to W to here which is that second way between the hidden state and
our output can tell us how much a small change in W 2 will impact our loss so
that’s what the gradient tells us right if we change W 2 in the differential
different like a very minor manner how does our loss change does it go up or
down how does it change and by how much really so that’s the gradient that we
care about the gradient of our loss with respect to W 2 now to evaluate this we
can just apply the chain rule in calculus so we can split this up into
the gradient of our loss with respect to our output Y multiplied by the gradient
of our walk or output Y with respect to W 2 now if we want to repeat this
process for a different way in the neural network let’s say now W 1 not W 2
now we replace W 1 on both sides we also apply the chain rule but now you’re
going to notice that the gradient of Y with respect to W 1 is also not directly
computable we have to apply the chain rule again to evaluate this so let’s
apply the chain rule again we can break that second term up into with respect to
now the the state Z ok and using that we can kind of back propagate all of these
gradients from the output all the way back to the input that allows our error
signal to really propagate from output to input and
allows these gradients to be computed in practice now a lot of this is not really
important or excuse me it’s not as crucial that you understand the
nitty-gritty math here because in a lot of popular deep learning frameworks we
have what’s called automatic differentiation which does all of this
back propagation for you under the hood and you never even see it which is
incredible it made training neural networks so much easier you don’t have
to implement back propagation anymore but it’s still important to understand
how these work at the foundation which is why we’re going through it now ok
obviously then you repeat this for every single way in the network here we showed
it for just W 1 and W 2 which is every single way in this network but if you
have more you can just repeat it again keep applying the chain rule from output
to input to compute this ok and that’s the back prop algorithm in theory very
simple it’s just an application of the chain rule in essence but now let’s
touch on some of the insights from training and how you can use the back
prop algorithm to train these networks in practice optimization of neural
networks is incredibly tough in practice so it’s not as simple as the picture I
showed you on the colorful one on the previous slide here’s an illustration
from a paper that came out about two or three years ago now where the authors
tried to visualize the landscape of a of a neural network with millions of
parameters but they collapsed that down onto just two-dimensional space so that
we can visualize it and you can see that the landscape is incredibly complex
it’s not easy there are many local minima where the gradient descent
algorithm could get stuck into and applying gradient descent in practice in
these type of environments which is very standard in neural networks can be a
huge challenge now we’re called the update equation
that we defined previously with gradient descent this is that same equation we’re
going to update our weights in the direction in the opposite direction of
our gradient I didn’t talk too much about this parameter ADA I pointed it
out this is the learning rate it determines
how much of a step we should take in the direction of that gradient and in
practice setting this learning rate can have a huge impact in performance so if
you set that learning rate to small that means that you’re not really trusting
your gradient on each step so if ADA is super tiny
that means on each time each step you’re only going to move a little bit towards
in the opposite direction of your gradient just in little small increments
and what can happen then is you can get stuck in these local minima because
you’re not being as aggressive as you should be to escape them now if you set
the learning rate to large you can actually overshoot completely and
diverge which is even more undesirable so setting the learning rate can be very
challenging in practice you want to pick a learning rate that’s large enough such
that you avoid the local minima but small offs such that you still converge
in practice now the question that you’re all probably asking is how do we set the
learning rate then well one option is that you can just try a bunch of
learning rates and see what works best another option is to do something a
little bit more clever and see if we can try to have an adaptive learning rate
that changes with respect to our lost landscape maybe it changes with respect
to how fast the learning is happening or a range of other ideas within the
network optimization scheme itself this means that the learning rate is no
longer fixed but it can now increase or decrease throughout training so as
training progressive your learning rate may speed up you may take more
aggressive steps you may take smaller steps as you get closer to the local
minima so that you really converge on that point and there are many options
here of how you might want to design this adaptive algorithm and this has
been a huge or a widely studied field in optimization theory for machine learning
and deep learning and there have been many published papers and
implementations within tensor flow on these different types of adaptive
learning rate algorithms so SGD is just that vanilla gradient descent that I
showed you before that’s the first one all of the others are all
adaptive learning rates which means that they change their learning rate during
training itself so they can increase or decrease depending on how the
optimization is going and during your labs we really encourage you again to
try out some of these different optimization schemes see what works what
doesn’t work a lot of it is problem dependent there are some heuristics that
you can you can get but we want you to really gain those heuristics yourselves
through the course of the labs it’s part of building character okay so let’s put
this all together from the beginning we can define our model which is defined as
this sequential wrapper inside of this sequential wrapper we have all of our
layers all of these layers are composed of perceptrons or single neurons which
we saw earlier the second line defines our optimizer which we saw in the
previous slide this can be SGD it can also be any of
those adaptive learning rates that we saw before now what we want to do is
during our training loop it’s very it’s the same stories again as before
nothing’s changing here we forward pass all of our inputs through that model we
get our predictions using those predictions we can evaluate them and
compute our loss our loss tells us how wrong our network was on that iteration
it also tells us how we can compute the gradients and how we can change all of
the weights in the network to improve in the future and then the final line there
takes those gradients and actually allows our optimizer to update the
weights and the trainable variables such that on the next iteration they do a
little bit better and over time if you keep looping this will converge and
hopefully you should fit your data no now I want to continue to talk about
some tips for training these networks in practice and focus on a very powerful
idea of batching your data into mini batches so to do this let’s revisit the
gradient descent algorithm this gradient is actually very computationally
expensive to compute in practice so using the backprop algorithm is
a very expensive idea and practice so what we want to do is actually not
compute this over all of the data points but actually computed over just a single
data point in the data set and most real-life applications it’s not actually
feasible to compute on your entire data set at every iteration it’s just too
much data so instead we pick a single point randomly we compute our gradient
with respect to that point and then on the next iteration we pick a different
point and we can get a rough estimate of our gradient at each step right so
instead of using all of our data now we just pick a single point I we compute
our gradient with respect to that single point I and what’s a middle ground here
so the downside of using a single point is that it’s going to be very noisy the
downside of using all of the points is that it’s too computationally expensive
if there’s some middle ground that we can have in between so that middle
ground is actually just very simple you instead of taking one point and instead
taking all of the points let take a mini batch of points so maybe something on
the order of 10 20 30 100 maybe depending on how rough or accurate you
want that approximation of your gradient to be and how much you want to trade off
speed and computational efficiency now the true gradient is just obtained by
averaging the gradient from each of those B points so B is the size of your
batch in this case now since B is normally not that large like I said
maybe on the order of tens to a hundreds this is much faster to compute than full
gradient descent and much more accurate than stochastic gradient descent because
it’s using more than one point more than one estimate now this increase in
gradient accuracy estimation actually allows us to converge to our target much
quicker because it means that our gradients are more accurate in practice
it also means that we can increase our learning rate and trust each update more
so if we’re very noisy in our gradient estimation we probably want to lower our
learning rate a little more so we don’t fully step in the wrong direction if
we’re not totally confident with that gradient if we have a larger batch of
gradient of data to they are gradients with we can trust
that learning great a little more increase it so that it steps it more
aggressively in that direction what this means also is that we can now massively
paralyze this computation because we can split up batches on multiple GPUs or
multiple computers even to achieve even more significant speed ups with this
training process now the last topic I want to address is that of overfitting
and this is also known as the problem of generalization in machine learning and
it’s actually not unique to just deep learning but it’s a fundamental problem
of all of machine learning now ideally in machine learning we want a model that
will approximate or estimate our data or accurately describes our data let’s say
like that said differently we want to build models that can learn
representations from our training data that’s still generalize to unseen test
data now assume that you want to build a line that best describes these points
you can see on the on the screen under fitting describes if we if our model
does not describe the state of complexity of this problem or if we
can’t really capture the true complexity of this problem while overfitting on the
right starts to memorize certain aspects of our training data and this is also
not desirable we want the middle ground which ideally we end up with a model in
the middle that is not too complex to memorize all of our training data but
also one that will continue to generalize when it sees new data so to
address this problem of regularization in neural network specifically let’s
talk about a technique of regularization which is another way that we can deal
with this and what this is doing is it’s trying to discourage complex information
from being learned so we want to eliminate the model from actually
learning to memorize the training data we don’t want to learn like very
specific pinpoints of the training data that don’t generalize well to test data
now as we’ve seen before this is actually crucial for our models to be
able to generalize to our test data so this is very important the most popular
regularization technique deep learning is this very basic idea of
drop out now the idea of drop out is well actually let’s start with by
revisiting this picture of a neural network that we had introduced
previously and drop out during training we randomly set some of these
activations of the hidden neurons to zero with some probability so I’d say
our probability is 0.5 we’re randomly going to set the
activations to 0.5 with probability of 0.5 to some of our
hidden neurons to 0 the idea is extremely powerful because it allows the
network to lower its capacity it also makes it such that the network can’t
build these memorization channels through the network where it tries to
just remember the data because on every iteration 50% of that data is going to
be or 50% of that memorization or memory is going to be wiped out so it’s going
to be forced to to not only generalize better but it’s going to be forced to
have multiple channels through the network and build a more robust
representation of its prediction now we just repeat this on every iteration so
on the first iteration we dropped out one 50% of the nodes on the next
iteration we can drop out a different randomly sampled 50% which may include
some of the previously sampled nodes as well and this will allow the network to
generalize better to new test data the second regularization technique that
we’ll talk about is the notion of early stopping so what I want to do here is
just talk about two lines so during training which is the x-axis here we
have two lines the y-axis is our loss curve the first line is our training
loss so that’s the green line the green line tells us how our training data how
well our model is fitting to our training data we expect this to be lower
than the second line which is our testing data
so usually we expect to be doing better on our training data than our testing
data as we train and as this line moves forward into the future both of these
lines should kind of decrease go down because we’re optimizing the network
we’re improving its performance eventually though there becomes a point
where the training data starts to diverge from the testing data now what
happens is that the training day should always continue to fit or the
model should always continue to fit the training data because it’s still seeing
all of the training data it’s not being penalized from that except for maybe if
you drop out or other means but the testing data it’s not seeing so at some
point the network is going to start to do better on its training data than its
testing data and what this means is basically that the network is starting
to memorize some of the training data and that’s what you don’t want so what
we can do is well we can perform early stopping or we can identify this point
this inflection point where the test data starts to increase and diverge from
the training data so we can stop the network early and make sure that our
test accuracy is as minimum as possible and of course if we actually look at on
the side of this line if we look at on the left side that’s where a model is
under fit so we haven’t reached the true capacity of our model yet so we’d want
to keep training if we didn’t stop yet if we did stop already and on the right
side is where we’ve over fit where we’ve passed that early stopping point and we
need to like basically we’ve started to memorize some of our training did and
that’s when we’ve gone too far I’ll conclude this lecture by just
summarizing three main points that we’ve covered so far first we’ve learned about
the fundamentals of neural networks which is a single neuron or a perceptron
we’ve learned about stacking and composing these perceptrons together to
form complex hierarchical representations and how we can
mathematically optimize these networks using a technique called back
propagation using their loss and finally we address the practical side of
training these models such as mini batching regularization and adaptive
learning rates as well with that I’ll finish up I can take a couple questions
and then we’ll move on to office lecture on deep sequential modeling I’ll take
any like maybe a couple questions if there are any now thank you

20 thoughts on “MIT Introduction to Deep Learning | 6.S191

  • February 8, 2020 at 3:17 pm
    Permalink

    Great!!! Excited to watch your lecture….

    Reply
  • February 8, 2020 at 3:56 pm
    Permalink

    Great lecture. When will be lecture 2 uploaded?

    Reply
  • February 8, 2020 at 4:05 pm
    Permalink

    Thank you for the great lesson!

    Reply
  • February 8, 2020 at 4:43 pm
    Permalink

    this video saved my life

    Reply
  • February 8, 2020 at 4:53 pm
    Permalink

    Great Explanation.

    Reply
  • February 8, 2020 at 6:29 pm
    Permalink

    Is there any way to access labs
    ..

    Reply
  • February 8, 2020 at 6:32 pm
    Permalink

    8:36 I think you mean traditional artificial intelligence methods…

    Reply
  • February 8, 2020 at 6:55 pm
    Permalink

    Thanx for sharing knowledge it really helps to understand DL lots of love from INDIA

    Reply
  • February 8, 2020 at 7:58 pm
    Permalink

    If I may ask…are there any prerequisites for getting the most out of this course…for example, do I need to understand machine learning first?

    Reply
  • February 8, 2020 at 9:56 pm
    Permalink

    (50:50 Slide – Regularization 2: Early Stopping) Shouldn't the blue curve be labeled validation instead of testing? Considering that we have training, validation and testing steps…

    Reply
  • February 8, 2020 at 11:18 pm
    Permalink

    Hi! Thank you for uploading this on youtube!
    Do you plan to put it on edX?

    Reply
  • February 8, 2020 at 11:42 pm
    Permalink

    Thanks a billion ..please upload the entire video during the 6 week period๐Ÿ™Œ

    Reply
  • February 9, 2020 at 8:26 am
    Permalink

    Wow, so much information on the first lecture. Great work.

    Reply
  • February 9, 2020 at 9:22 am
    Permalink

    i am not expert in core tensor flow, i mostly do work in keras. Will the labs would be tough for me?

    Reply
  • February 9, 2020 at 7:46 pm
    Permalink

    Perfect…

    Reply
  • February 10, 2020 at 1:59 am
    Permalink

    Awesome. Thank you for sharing!

    Reply
  • February 10, 2020 at 3:15 am
    Permalink

    Thanks a ton for imparting the knowledge.

    Reply
  • February 10, 2020 at 4:28 am
    Permalink

    in 12:55, there appears mixing up of dot product and element-wise multiplication. Aren't they two different things?

    Reply
  • February 10, 2020 at 11:42 am
    Permalink

    Excellent lecture..please update the site with other videos as well. Looking forward to seeing the rest as well

    Reply
  • February 10, 2020 at 12:35 pm
    Permalink

    is this the full course or only the intro. I'm so excited to watch this

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *