머신러닝,딥러닝/Andrew Ng 머신러닝 코세라 강의 노트

Week 4 Lecture ML : Neural Network

mcdn 2020. 8. 10. 17:11
반응형

Non-linear hypothesis

So why do we need yet another learning algorithm? Consider a supervised learning classification problem where you have a training set like this. If you want to apply logistic regression to this problem, one thing you could do is apply logistic regression with a lot of nonlinear features like that.that separates the positive and negative examples. This particular method works well when you have only, say, two features - x1 and x2 - because you can then include all those polynomial terms of x1 and x2. But for many interesting machine learning problems would have a lot more features than just two.
And as we saw we can come up with quite a lot of features, maybe a hundred different features of different houses. 2분 0초부터 동영상을 재생하고 스크립트 따르기2:00 For a problem like this, if you were to include all the quadratic terms, all of these, even all of the quadratic that is the second or the polynomial terms, there would be a lot of them. There would be terms like x1 squared, So including all the quadratic features doesn't seem like it's maybe a good idea, because that is a lot of features and you might up overfitting the training set, and it can also be computationally expensive, you know, to
Many people wonder why computer vision could be difficult.  I mean when you and I  look at this picture it is so obvious what this is.  You wonder how is it  that a learning algorithm could possibly  fail to know what this picture is.
Concretely, when we use machine learning to build a car detector, what we do is we come up with a label training set, with, let's say, a few label examples of cars and a few label examples of things that are not cars, then we give our training set to the learning algorithm trained a classifier and then, you know, we may test it and show the new image and ask, "What is this new thing?". 6분 17초부터 동영상을 재생하고 스크립트 따르기6:17And hopefully it will recognize that that is a car.
Let's pick a couple of pixel  locations in our images, so  that's pixel one location and  pixel two location, and let's  plot this car, you know, at the  location, at a certain  point, depending on the intensities  of pixel one and pixel two. 6분 49초부터 동영상을 재생하고 스크립트 따르기6:49 And let's do this with a few other images.  So let's take a different example  of the car and you know,  look at the same two pixel locations

 

 

 so the dimension of  our feature size will be N  equals 2500 where our feature  vector x is a list  of all the pixel testings, you  know, the pixel brightness of pixel  one, the brightness of pixel  two, and so on down  to the pixel brightness of the  last pixel where, you know, in a  typical computer representation, each of  these may be values between say  0 to 255 if it gives  us the grayscale value.  So we have n equals 2500,  and that's if we  were using grayscale images.  If we were using RGB  images with separate red, green  and blue values, we would have n equals 7500.
So, if we were to  try to learn a nonlinear  hypothesis by including all  the quadratic features, that is  all the terms of the form, you know,  Xi times Xj, while with the  2500 pixels we would end  up with a total of three million features.  And that's just too large to  be reasonable; the computation would  be very expensive to find and  to represent all of these  three million features per training example.

 

이번주 짧다ㅜㅜ :D

Nerural Network 

But more recently, Neural Networks have had a major recent resurgence. 1분 13초부터 동영상을 재생하고 스크립트 따르기1:13One of the reasons for this resurgence is that Neural Networks are computationally some what more expensive algorithm and so, it was only, you know, maybe somewhat more recently that computers became fast enough to really run large scale Neural Networks and because of that as well as a few other technical reasons which we'll talk about later, modern Neural Networks today are the state of the art technique for many applications.
This is just a hypothesis but  let me share with you  some of the evidence for this.  This part of the brain, that little  red part of the brain, is  your auditory cortex and  the way you're understanding my voice  now is your ear is  taking the sound signal and routing  the sound signal to your auditory  cortex and that's what's  allowing you to understand my words.
Neuroscientists have done the following fascinating experiments where you cut the wire from the ears to the auditory cortex and you re-wire, 2분 50초부터 동영상을 재생하고 스크립트 따르기2:50in this case an animal's brain, so that the signal from the eyes to the optic nerve eventually gets routed to the auditory cortex. 2분 58초부터 동영상을 재생하고 스크립트 따르기2:58If you do this it turns out, the auditory cortex will learn 3분 2초부터 동영상을 재생하고 스크립트 따르기3:02to see. And this is in every single sense of the word see as we know it. So, if you do this to the animals, the animals can perform visual discrimination task and as they can look at images and make appropriate decisions based on the images and they're doing it with that piece of brain tissue.  Because of this and other  similar experiments, these are  called neuro-rewiring experiments.

 

 

On the upper left is an example of learning to see with your tongue. The way it works is--this is actually a system called BrainPort undergoing FDA trials now to help blind people see--but the way it works is, you strap a grayscale camera to your forehead, facing forward, that takes the low resolution grayscale image of what's in front of you and you then run a wire 4분 51초부터 동영상을 재생하고 스크립트 따르기4:51to an array of electrodes that you place on your tongue so that each pixel gets mapped to a location on your tongue where maybe a high voltage corresponds to a dark pixel and a low voltage corresponds to a bright pixel and, even as it does today, with this sort of system you and I will be able to learn to see, you know, in tens of minutes with our tongues. Here's a second example of human echo location or human sonar. And, some of the bizarre example, but  if you plug a third eye  into a frog, the frog  will learn to use that eye as well.

Nerural Network Model Representation 

 

  the neuron has a number of input wires, and these are called the dendrites.  You think of them as input wires, and these receive inputs from other locations.  And a neuron also has an output wire called an Axon, and  this output wire is what it uses to send signals to other neurons,  so to send messages to other neurons.   So, at a simplistic level what a neuron is, is a computational unit that  gets a number of inputs through it input wires and does some computation and  then it says outputs via its axon to other nodes or to other neurons in the brain.

 

So here is one neuron and what it does is if it wants a send a message what it  does is sends a little pulse of electricity.  Varis axon to some different neuron and here, this axon that is this open wire,  connects to the dendrites of this second neuron over here,  which then accepts this incoming message that some computation.  And they, in turn, decide to send out this message on this axon to other neurons,  and this is the process by which all human thought happens.
in an artificial neuron network that we've  implemented on the computer, we're going to use a very simple model of  what a neuron does we're going to model a neuron as just a logistic unit.  So, when I draw a yellow circle like that, you should think of that as a playing  a role analysis, who's maybe the body of a neuron, and  we then feed the neuron a few inputs who's various dendrites or input wiles.   And whenever I draw a diagram like this,  what this means is that this represents a computation of h of  x equals one over one plus e to the negative theta transpose x,  where as usual, x and theta are our parameter vectors, like so.
This x0 now that's sometimes called the bias unit or the bias neuron, but  because x0 is already equal to 1, sometimes, I draw this, sometimes  I won't just depending on whatever is more notationally convenient for that example.

 

What a neural network is, is just a group of this different neurons strong together.  Completely, here we have input units x1, x2, x3 and once again,  sometimes you can draw this extra note x0 and Sometimes not, just flow that in here.  And here we have three neurons which have written 81, 82, 83.  I'll talk about those indices later. 
And then, layer 2 in between, this is called the hidden layer.  The term hidden layer isn't a great terminology, but this ideation is that,  you know, you supervised early,  where you get to see the inputs and get to see the correct outputs, where  there's a hidden layer of values you don't get to observe in the training setup.  It's not x, and it's not y, and so we call those hidden.  And they try to see neural nets with more than one hidden layer but  in this example, we have one input layer, Layer 1, one hidden layer, Layer 2,  and one output layer, Layer 3.  But basically, anything that isn't an input layer and  isn't an output layer is called a hidden layer.
  I'm going to use a superscript j subscript i to denote the activation  of neuron i or of unit i in layer j.  So completely this gave superscript to sub group one,  that's the activation of the first unit in layer two, in our hidden layer.  And by activation I just mean the value that's computed by and  as output by a specific.  In addition, new network is parametrize by these matrixes, theta  super script j Where theta j is going to be a matrix of weights controlling  the function mapping form one layer, maybe the first layer to the second layer,  or from the second layer to the third layer.
So here are the computations that are represented by this diagram. 8분 34초부터 동영상을 재생하고 스크립트 따르기8:34This first hidden unit here has it's value computed as follows, there's a is a21 is equal to the sigma function of the sigma activation function, also called the logistics activation function, apply to this sort of linear combination of these inputs. And then this second hidden unit has this activation value computer as sigmoid of this. And similarly for this third hidden unit is computed by that formula. So here we have 3 theta 1 which is matrix of parameters governing our mapping from our three different units, our hidden units. Theta 1 is going to be a 3.
  To summarize, what we've done is shown how a picture like this over here defines  an artificial neural network which defines a function h  that maps with x's input values to hopefully to some space that provisions y.  And these hypothesis are parameterized by parameters  denoting with a capital theta so that, as we vary theta,  we get different hypothesis and we get different functions.  Mapping say from x to y.

Model Representation I

Let's examine how we will represent a hypothesis function using neural networks. At a very simple level, neurons are basically computational units that take inputs (dendrites) as electrical inputs (called "spikes") that are channeled to outputs (axons). In our model, our dendrites are like the input features x_1\cdots x_n, and the output is the result of our hypothesis function. In this model our x_0 input node is sometimes called the "bias unit." It is always equal to 1. In neural networks, we use the same logistic function as in classification, \frac{1}{1 + e^{-\theta^Tx}}, yet we sometimes call it a sigmoid (logistic) activation function. In this situation, our "theta" parameters are sometimes called "weights".

Visually, a simplistic representation looks like:

x0x1x2

\rightarrow

[   ]

\rightarrow h_\theta(x)

Our input nodes (layer 1), also known as the "input layer", go into another node (layer 2), which finally outputs the hypothesis function, known as the "output layer".

We can have intermediate layers of nodes between the input and output layers called the "hidden layers."

In this example, we label these intermediate or "hidden" layer nodes a^2_0 \cdots a^2_na02​⋯an2​ and call them "activation units." a(j)i="activation" of unit i in layer jΘ(j)=matrix of weights controlling function mapping from layer j to layer j+1 If we had one hidden layer, it would look like: ⎡⎣⎢⎢x0x1x2x3⎤⎦⎥⎥\rightarrow⎡⎣⎢⎢⎢a(2)1a(2)2a(2)3⎤⎦⎥⎥⎥\rightarrow h_\theta(x)[x0​x1​x2​x3​​]→[a1(2)​a2(2)​a3(2)​​]→hθ​(x) The values for each of the "activation" nodes is obtained as follows: a(2)1=g(Θ(1)10x0+Θ(1)11x1+Θ(1)12x2+Θ(1)13x3)a(2)2=g(Θ(1)20x0+Θ(1)21x1+Θ(1)22x2+Θ(1)23x3)a(2)3=g(Θ(1)30x0+Θ(1)31x1+Θ(1)32x2+Θ(1)33x3)hΘ(x)=a(3)1=g(Θ(2)10a(2)0+Θ(2)11a(2)1+Θ(2)12a(2)2+Θ(2)13a(2)3) 

This is saying that we compute our activation nodes by using a 3×4 matrix of parameters. We apply each row of the parameters to our inputs to obtain the value for one activation node. Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by yet another parameter matrix \Theta^{(2)} containing the weights for our second layer of nodes.

Each layer gets its own matrix of weights, \Theta^{(j)}.

The dimensions of these matrices of weights is determined as follows:

\text{If network has $s_j$ units in layer $j$ and $s_{j+1}$ units in layer $j+1$, then $\Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j + 1)$.}

The +1 comes from the addition in \Theta^{(j)} of the "bias nodes," x_0 and \Theta_0^{(j)}. In other words the output nodes will not include the bias nodes while the inputs will. The following image summarizes our model representation:

Example: If layer 1 has 2 input nodes and layer 2 has 4 activation nodes. Dimension of \Theta^{(1)} is going to be 4×3 where s_j = 2 and s_{j+1} = 4, so s_{j+1} \times (s_j + 1) = 4 \times 3.

 

 

 

Nerural Network Model Representation II

you may notice that that block of numbers corresponds suspiciously similar 2분 6초부터 동영상을 재생하고 스크립트 따르기2:06to the matrix vector operation, matrix vector multiplication of x1 times the vector x. Using this observation we're going to be able to vectorize this computation of the neural network. 2분 21초부터 동영상을 재생하고 스크립트 따르기2:21Concretely, let's define the feature vector x as usual to be the vector of x0, x1, x2, x3 where x0 as usual is always equal 1 and that defines z2 to be the vector of these z-values, you know, of z(2)1 z(2)2, z(2)3.

 

 what we're going to do is add an extra a0 superscript 2, that's equal to one, and after taking this step we now have that a2 is going to be a four dimensional feature vector because we just added this extra, you know, a0 which is equal to 1 corresponding to the bias unit in the hidden layer. And finally, 4분 35초부터 동영상을 재생하고 스크립트 따르기4:35to compute the actual value output of our hypotheses, we then simply need to compute 4분 42초부터 동영상을 재생하고 스크립트 따르기4:42z3. So z3 is equal to this term here that I'm just underlining. This inner term there is z3. 4분 53초부터 동영상을 재생하고 스크립트 따르기4:53And z3 is stated 2 times a2 and finally my hypotheses output h of x which is a3 that is the activation of my one and only unit in the output layer. So, that's just the real number. You can write it as a3 or as a(3)1 and that's g of z3. This process of computing h of x is also called forward propagation 5분 19초부터 동영상을 재생하고 스크립트 따르기5:19and is called that because we start of with the activations of the input-units and then we sort of forward-propagate that to the hidden layer and compute the activations of the hidden layer and then we sort of forward propagate that and compute the activations of

 

 let's say I cover up  the left path of this picture for now.  If you look at what's left in this picture.  This looks a lot like  logistic regression where what  we're doing is we're using  that note, that's just the  logistic regression unit and we're  using that to make a  prediction h of x. 
Just to say that again, what this neural network is doing is just like logistic regression, except that rather than using the original features x1, x2, x3, 7분 52초부터 동영상을 재생하고 스크립트 따르기7:52is using these new features a1, a2, a3. Again, we'll put the superscripts 7분 58초부터 동영상을 재생하고 스크립트 따르기7:58there, you know, to be consistent with the notation. 8분 2초부터 동영상을 재생하고 스크립트 따르기8:02And the cool thing about this, is that the features a1, a2, a3, they themselves are learned as functions of the input.
This is an example of a different neural network architecture 10분 7초부터 동영상을 재생하고 스크립트 따르기10:07and once again you may be able to get this intuition of how the second layer, here we have three heading units that are computing some complex function maybe of the input layer, and then the third layer can take the second layer's features and compute even more complex features in layer three so that by the time you get to the output layer, layer four, you can have even more complex features of what you are able to compute in layer three and so get very interesting nonlinear hypotheses.

Model Representation II

To re-iterate, the following is an example of a neural network:

 

In this section we'll do a vectorized implementation of the above functions. We're going to define a new variable z_k^{(j)} that encompasses the parameters inside our g function. In our previous example if we replaced by the variable z for all the parameters we would get:

 

In other words, for layer j=2 and node k, the variable z will be:

z_k^{(2)} = \Theta_{k,0}^{(1)}x_0 + \Theta_{k,1}^{(1)}x_1 + \cdots + \Theta_{k,n}^{(1)}x_n

The vector representation of x and z^{j} is:

x=x0x1xnz(j)=z(j)1z(j)2z(j)n

Setting x = a^{(1)}, we can rewrite the equation as:

z^{(j)} = \Theta^{(j-1)}a^{(j-1)}

We are multiplying our matrix \Theta^{(j-1)} with dimensions s_j\times (n+1) (where s_j is the number of our activation nodes) by our vector a^{(j-1)} with height (n+1). This gives us our vector z^{(j)} with height s_j. Now we can get a vector of our activation nodes for layer j as follows:

a^{(j)} = g(z^{(j)})

Where our function g can be applied element-wise to our vector z^{(j)}.

We can then add a bias unit (equal to 1) to layer j after we have computed a^{(j)}. This will be element a_0^{(j)} and will be equal to 1. To compute our final hypothesis, let's first compute another z vector:

z^{(j+1)} = \Theta^{(j)}a^{(j)}

We get this final z vector by multiplying the next theta matrix after \Theta^{(j-1)} with the values of all the activation nodes we just got. This last theta matrix \Theta^{(j)} will have only one row which is multiplied by one column a^{(j)} so that our result is a single number. We then get our final result with:

h_\Theta(x) = a^{(j+1)} = g(z^{(j+1)})

Notice that in this last step, between layer j and layer j+1, we are doing exactly the same thing as we did in logistic regression. Adding all these intermediate layers in neural networks allows us to more elegantly produce interesting and more complex non-linear hypotheses.

 

 

Nerural Network examples and Intuitions

This means not x1 or x2 and so, we're going to have positive examples of either  both are true or both are false and what have as y equals 1, y equals 1.  And we're going to have y equals 0 if only one of them is true and we're going to  figure out if we can get a neural network to fit to this sort of training set.
And if you look in this column this is exactly the logical and function.  So, this is computing h of x is  approximately x 1 and x 2.  In other words it outputs one If and only if x2,  x1 and x2, are both equal to 1.  So, by writing out our little truth table like this  we manage to figure what's the logical function

So we have constructed one of the fundamental operations in computers by using a small neural network rather than using an actual AND gate. Neural networks can also be used to simulate all the other logical gates. The following is an example of the logical operator 'OR', meaning either x_1 is true or x_2 is true, or both:

You find that's g of minus 10 which is approximately 0.  g of 10 which is approximately 1 and so on and these are approximately 1 and  approximately 1 and these numbers are essentially the logical OR function. So, hopefully with this you now understand how single neurons in a neural network  can be used to compute logical functions like AND and OR and so on. 

ddddd

 

Nerural Network examples and Intuitions II

In the last video we saw how a Neural Network can be used to  compute the functions x1 AND x2, and the function x1 OR  x2 when x1 and x2 are binary, that is when they take on values 0,1.  We can also have a network to compute negation,  that is to compute the function not x1.  Let me just write down the ways associated with this network.
x1 equals x2 equals 0.  All right since this is a logical function, this says NOT x1 means x1 must  be 0 and NOT x2, that means x2 must be equal to 0 as well.  So this logical function is equal to 1 if and only if both x1 and  x2 are equal to 0 and hopefully you should be able to figure out how to  make a small neural network to compute this logical function as well.
In the video that I'll show you this area here  is the input area that shows a canvasing character shown to the network.  This column here shows a visualization of the features computed by sort of the first  hidden layer of the network.  So that the first hidden layer of the network and so the first hidden layer,  this visualization shows different features.  Different edges and lines and so on detected.  This is a visualization of the next hidden layer. And shown over here is the final answer, it's the final predictive value for  what handwritten digit the neural network thinks it is being shown.  So let's take a look at the video. 
So I hope you enjoyed the video and that this hopefully gave you some intuition  about the source of pretty complicated functions neural networks can learn.  In which it takes its input this image, just takes this input, the raw pixels and  the first hidden layer computes some set of features.  The next hidden layer computes even more complex features and  even more complex features.  And these features can then be used by essentially the final  layer of the logistic classifiers to make accurate predictions  without the numbers that the network sees.

b

Nerural Network examples and Intuitions

ddddd

 

 

Nerural Network examples and Intuitions

답 수정 해야 : 3 4 !! xor function two로는 안됨 
1
답 :  1 z= theta1 * x, a22 = sigmoid(z) 대박.. theta1 * z처럼 벡터를 곱할 떄 i j , j 1 같이 가운데 숫자가 같아야 함. 그래서 순서는 꼭 theta1 * x 이여야 하고 밑에 x * theta1 은 안되는 것 ! 

중요!!!!! 4번 중요!!!!!

답 stay the same 1 
ㅋㅋㅋㅋㅋ 1 4 5 틀림 ㅋㅋㅋㅋ
쿠쿠쿠쿠.. 

 

 

반응형