머신러닝,딥러닝/Andrew Ng 머신러닝 코세라 강의 노트

Week 1 Lecture ML : Linear Regression ~ parameter learning

mcdn 2020. 8. 6. 14:37
반응형

More  formally, in supervised learning, we have a data set and this data set is called a  training set. So for housing prices example, we have a training set of  different housing prices and our job is to learn from this data how to predict prices  of the houses. So I'm gonna use lower case m throughout this course to  denote the number of training examples. So in this data set, if I have, you know,  let's say 47 rows in this table. Then I have 47 training examples and m equals 47.  Let me use lowercase x to denote the input variables often also called the  features. That would be the x is here, it would the input features. And I'm gonna  use y to denote my output variables or the target variable which I'm going to  predict and so that's the second column here.
lowercase h and h stands for hypothesis And what the job of the hypothesis is, is, is a function that takes as input the size of a house like maybe the size of the new house your friend's trying to sell so it takes in the value of x and it tries to output the estimated value of y for the corresponding house. So h is a function that maps from x's to y's. And why a linear function? Well, sometimes we'll want to  fit more complicated, perhaps non-linear functions as well. But since this linear  case is the simple building block, we will start with this example first of fitting  linear functions, and we will build on this to eventually have more complex  models, and more complex learning algorithms.  This model is called linear regression or this, for  example, is actually linear regression with one variable, with the variable being  x.And another name for  this model is univariate linear regression.And univariate is just a  fancy way of saying one variable. 

ML:Linear Regression with One Variable

Model Representation

Recall that in regression problems, we are taking input variables and trying to fit the output onto a continuous expected result function.

Linear regression with one variable is also known as "univariate linear regression."

Univariate linear regression is used when you want to predict a single output value y from a single input value x. We're doing supervised learning here, so that means we already have an idea about what the input/output cause and effect should be.

The Hypothesis Function

Our hypothesis function has the general form:

\hat{y} = h_\theta(x) = \theta_0 + \theta_1 x

Note that this is like the equation of a straight line. We give to h_\theta(x) values for \theta_0 and \theta_1 to get our estimated output \hat{y}. In other words, we are trying to create a function called h_\theta that is trying to map our input data (the x's) to our output data (the y's).

Example:

Suppose we have the following set of training data:

input xoutput y

0 4
1 7
2 7
3 8

Now we can make a random guess about our h_\theta function: \theta_0=2 and \theta_1=2. The hypothesis function becomes h_\theta(x)=2+2x.

So for input of 1 to our hypothesis, y will be 4. This is off by 3. Note that we will be trying out various values of \theta_0 and \theta_1 to try to find values which provide the best possible "fit" or the most representative "straight line" through the data points mapped on the x-y plane.

When sometimes called the squared error cost function and  it turns out that why do we take the squares of the erros.  It turns out that these squared error cost function is a reasonable choice and  works well for problems for most regression programs.  There are other cost functions that will work pretty well.  But the square cost function is probably the most commonly used one for  regression problems.  Later in this class we'll talk about alternative cost functions as well,  but this choice that we just had should be a pretty reasonable thing to try for  most linear regression problems.

Cost Function

We can measure the accuracy of our hypothesis function by using a cost function. This takes an average (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's compared to the actual output y's.

J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2 J(θ0​,θ1​)=2m1​i=1∑m​(y^​i​−yi​)2=2m1​i=1∑m​(hθ​(xi​)−yi​)2
To break it apart, it is \frac{1}{2}21​ \bar{x}xˉ where \bar{x}xˉ is the mean of the squares of h_\theta (x_{i}) - y_{i}hθ​(xi​)−yi​ , or the difference between the predicted value and the actual value. This function is otherwise called the "Squared error function", or "Mean squared error". The mean is halved \left(\frac{1}{2m}\right)(2m1​) as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the \frac{1}{2}21​ term.
https://wikidocs.net/4213
. Because  for the specific training set I have or my 3 training examples are (1, 1), (2, 2), (3,3). If theta  one is equal to one. Then h of x. H of x i. Is equal to y I exactly, let me write  this better. Right? And so, h of x minus y, each of these terms is equal to zero,  which is why I find that j of one is equal to zero. So, we now know that j of one Is  equal to zero. Let's plot that. What I'm gonna do on the right is plot my cost  function j. And notice, because my cost function is a function of my parameter  theta one, when I plot my cost function, the horizontal axis is now labeled with  theta one. So I have j of one zero zero so let's go ahead and plot that. End  up with. An X over there. Now lets look at some other examples. Theta-1 can take on a  range of different values. Right? So theta-1 can take on the negative values,  zero, positive values. 
So,  to wrap up. In this video, we looked up some plots. To understand the cost  function. To do so, we simplify the algorithm. So that it only had one  parameter theta one. And we set the parameter theta zero to be only zero. In the next video.  We'll go back to the original problem formulation and look at some  visualizations involving both theta zero and theta one. That is without setting theta  zero to zero. And hopefully that will give you, an even better sense of what the cost  function j is doing in the original linear regression formulation.
Right, that's, that's the vertical axis. The  height of the surface of the points indicates the value of J of theta zero, J  of theta one. And you can see it sort of has this bow like shape. Let me show you  the same plot in 3D. So here's the same figure in 3D, horizontal axis theta one and  vertical axis J(theta zero, theta one), and if I rotate this plot around. You kinda of a  get a sense, I hope, of this bowl shaped surface as that's what the cost  function J looks like. Now for the purpose of illustration in the rest of this video  I'm not actually going to use these sort of 3D surfaces to show you the cost  function J, instead I'm going to use contour plots. Or what I also call contour  figures.
 And so  the contour figures is a, is way to, is maybe a more convenient way to  visualize my function J. 

Now we are able to concretely measure the accuracy of our predictor function against the correct results we have so that we can predict new results we don't have.

If we try to think of it in visual terms, our training data set is scattered on the x-y plane. We are trying to make straight line (defined by h_\theta(x)) which passes through this scattered set of data. Our objective is to get the best possible line. The best possible line will be such so that the average squared vertical distances of the scattered points from the line will be the least. In the best case, the line should pass through all the points of our training data set. In such a case the value of J(\theta_0, \theta_1) will be 0.

And then you keep going.  From this new point you look around,  decide what direction would take you downhill most quickly.  Take another step, another step, and so  on until you converge to this local minimum down here. .... Turns out, that if you're standing at that point on the hill, you look all around and  you find that the best direction is to take a little step downhill  is roughly that direction.  Okay, and now you're at this new point on your hill.  You're gonna, again, look all around and  say what direction should I step in order to take a little baby step downhill?  And if you do that and take another step, you take a step in that direction.
Okay? So if I write a equals b,  then I'll asserting that the value of a equals to the value of b, right?  So the left hand side, that's the computer operation,  where we set the value of a to a new value.  The right hand side, this is asserting, I'm just making a claim that the values of  a and b are the same, and so whereas you can write a := a + 1, that means increment  a by 1, hopefully I won't ever write a = a + 1 because that's just wrong.  a and a + 1 can never be equal to the same values.  Okay?  So this is first part of the definition.  This alpha here is a number that is called the learning rate.
And the difference between the right hand side and  the left hand side implementations is that If you look down here,  you look at this step, if by this time you've already updated theta 0,  then you would be using the new value of theta 0 to compute this derivative term.  And so this gives you a different value of temp1, than the left-hand side, right?  Because you've now plugged in the new value of theta 0 into this equation.  And so, this on the right-hand side is not a correct implementation  of gradient descent, okay? 

ML:Gradient Descent

So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in hypothesis function. That's where gradient descent comes in.

Imagine that we graph our hypothesis function based on its fields \theta_0 and \theta_1 (actually we are graphing the cost function as a function of the parameter estimates). This can be kind of confusing; we are moving up to a higher level of abstraction. We are not graphing x and y itself, but the parameter range of our hypothesis function and the cost resulting from selecting particular set of parameters.

We put \theta_0 on the x axis and \theta_1 on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters.

We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum.

 

The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent, and the size of each step is determined by the parameter α, which is called the learning rate.

The gradient descent algorithm is: repeat until convergence: \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)θj​:=θj​−α∂θj​∂​J(θ0​,θ1​) where j=0,1 represents the feature index number. Intuitively, this could be thought of as: repeat until convergence: \theta_j := \theta_j - \alpha [\text{Slope of tangent aka derivative in j dimension}]θj​:=θj​−α[Slope of tangent aka derivative in j dimension][Slope of tangent aka derivative in j dimension]

Gradient Descent Intuition

In this video we explored the scenario where we used one parameter \theta_1 and plotted its cost function to implement a gradient descent. Our formula for a single parameter was :

Repeat until convergence:  \theta_1:=\theta_1-\alpha \frac{d}{d\theta_1} J(\theta_1)θ1​:=θ1​−αdθ1​d​J(θ1​

 

 

just to remind you this parameter, or this term alpha is called the learning rate. And it controls how big a step we take when updating my parameter theory j. 31초부터 동영상을 재생하고 스크립트 따르기0:31And this second term here is the derivative term 39초부터 동영상을 재생하고 스크립트 따르기0:39And what I wanna do in this video is give you that intuition about what each of these two terms is doing and why when put together, this entire update makes sense. In order to convey these intuitions, what I want to do is use a slightly simpler example, where we want to minimize the function of just one parameter. So say we have a cost function, j of just one parameter, theta one, like we did a few videos back, where theta one is a real number. So we can have one d plots, which are a little bit simpler to look at. Let's try to understand what gradient decent would do on this function.
Alpha the the learning, is always a positive number.  And, so we're going to take theta one is updated as theta one minus something.  So I'm gonna end up moving theta one to the left.  I'm gonna decrease theta one, and we can see this is the right  thing to do cuz I actually wanna head in this direction.  You know, to get me closer to the minimum over there.
And so I have theta 1 minus a negative number which means I'm actually going to  increase theta, because it's minus of a negative number,  means I'm adding something to theta.  And what that means is that I'm going to end up increasing theta  until it's not here, and increase theta wish again seems like the thing I wanted  to do to try to get me closer to the minimum.
But if my learning is too big,  I may take a huge step going from here all the way to out there.  So we end up being over there, right?  And if my is too big, we can take another huge step on the next elevation and  kind of overshoot and overshoot and so on, until you already notice  I'm actually getting further and further away from the minimum.  So if alpha is to large, it can fail to converge or even diverge. 
you have theta one cuz I updated this theta one minus alpha times zero.  And so what this means is that if you're already at the local optimum it leaves  theta 1 unchanged cause its updates as theta 1 equals theta 1.  So if your parameters are already at a local minimum one  step with gradient descent does absolutely nothing it doesn't your parameter  which is what you want because it keeps your solution at the local optimum
So just to recap, in gradient descent as we approach a local minimum,  gradient descent will automatically take smaller steps.  And that's because as we approach the local minimum,  by definition the local minimum is when the derivative is equal to zero.  As we approach local minimum, this derivative term will automatically get  smaller, and so gradient descent will automatically take smaller steps.  This is what so no need to decrease alpha or the time.

Regardless of the slope's sign for \frac{d}{d\theta_1} J(\theta_1), \theta_1 eventually converges to its minimum value. The following graph shows that when the slope is negative, the value of \theta_1 increases and when it is positive, the value of \theta_1 decreases.

On a side note, we should adjust our parameter \alpha to ensure that the gradient descent algorithm converges in a reasonable time. Failure to converge or too much time to obtain the minimum value imply that our step size is wrong.

 

How does gradient descent converge with a fixed step size \alpha?

The intuition behind the convergence is that \frac{d}{d\theta_1} J(\theta_1) approaches 0 as we approach the bottom of our convex function. At the minimum, the derivative will always be 0 and thus we get:

\theta_1:=\theta_1-\alpha * 0

 

  So we want to figure out what is this partial derivative for  both the theta 0 case and the theta 1 case.  And I'm just going to write out the answers.  It turns out this first term is, simplifies to 1/M  sum from over my training step of just that of X(i)- Y(i) and  for this term partial derivative let's write the theta 1,  it turns out I get this term.  Minus Y(i) times X(i).  Okay and  computing these partial derivatives, so we're going from this equation.  Right going from this equation to either of the equations down there.  Computing those partial derivative terms requires some multivariate calculus. 
편미분 예 https://j1w2k3.tistory.com/988
https://mangkyu.tistory.com/34 repeat until 수렴 
we saw how depending on where you initialize it,  you can end up at different local optima.  You will either wind up here or here.  But, it turns out that that the cost function for  linear regression is always going to be a bow shaped function like this.  The technical term for this is that this is called a convex function. And I'm not gonna give the formal definition for what is a convex function,  C, O, N, V, E, X.  But informally a convex function means a bowl shaped function and so  this function doesn't have any local optima except for the one global optimum.  And does gradient descent on this type of cost function which you get whenever  you're using linear regression it will always converge to the global optimum.  Because there are no other local optimum, global optimum.  So now let's see this algorithm in action.

 

until eventually I've now wound up at the global minimum and this global minimum  corresponds to this hypothesis, which gets me a good fit to the data. Finally just to give this another name it turns out that the algorithm  that we just went over is sometimes called batch gradient
So ever step of gradient descent we end up computing something like this that sums  over our m training examples and so the term batch gradient descent refers to  the fact that we're looking at the entire batch of training examples. And it turns out that there are sometimes other versions of gradient descent that  are not batch versions, but they are instead.  Do not look at the entire training set but  look at small subsets of the training sets at a time.  And we'll talk about those versions later in this course as well.  But for now using the algorithm we just learned about or using batch gradient  descent you now know how to implement gradient descent for linear regression.
Later in this course we'll talk about that method as well that just solves for  the minimum of the cost function j without needing these multiple steps of  gradient descent.  That other method is called the normal equations method.  But in case you've heard of that method it turns out that gradient  descent will scale better to larger data sets than that normal equation method.  And now that we know about gradient descent we'll be able to use it in lots  of different contexts and  we'll use it in lots of different machine learning problems as well.

Gradient Descent for Linear Regression

When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to (the derivation of the formulas are out of the scope of this course, but a really great one can be found here):

repeat until convergence: {θ0:=θ1:=}θ0−α1m∑i=1m(hθ(xi)−yi)θ1−α1m∑i=1m((hθ(xi)−yi)xi)

where m is the size of the training set, \theta_0 a constant that will be changing simultaneously with \theta_1 and x_{i}, y_{i}are values of the given training set (data).

Note that we have separated out the two cases for \theta_j into separate equations for \theta_0 and \theta_1; and that for \theta_1 we are multiplying x_{i} at the end due to the derivative.

The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these gradient descent equations, our hypothesis will become more and more accurate.

So, this is simply gradient descent on the original cost function J. This method looks at every example in the entire training set on every step, and is called batch gradient descent. Note that, while gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum. Indeed, J is a convex quadratic function. Here is an example of gradient descent as it is run to minimize a quadratic function.

The ellipses shown above are the contours of a quadratic function. Also shown is the trajectory taken by gradient descent, which was initialized at (48,30). The x’s in the figure (joined by straight lines) mark the successive values of θ that gradient descent went through as it converged to its minimum.

 

답 4
답 0.5
11
답 3, 4
답 1 only 

Gradient Descent for Linear Regression: visual worked example

Some may find the following video (https://www.youtube.com/watch?v=WnqQrPNYz5Q) useful as it visualizes the improvement of the hypothesis as the error function reduces.

 

반응형