머신러닝,딥러닝/Andrew Ng 머신러닝 코세라 강의 노트

Week 3 Lecture ML : Classification and Representation

mcdn 2020. 8. 7. 13:38
반응형

we'll talk about multi class problems as well where  therefore y may take on four values zero, one, two, and three.  This is called a multiclass classification problem.  But for the next few videos, let's start with the two class or the binary  classification problem and we'll worry about the multiclass setting later. 

 

But once we've added that extra example over here, if you now run linear  regression, you instead get a straight line fit to the data.  That might maybe look like this. 4분 57초부터 동영상을 재생하고 스크립트 따르기4:57 And if you know threshold hypothesis at 0.5,  you end up with a threshold that's around here, so  that everything to the right of this point you predict as positive and  everything to the left of that point you predict as negative.  So, applying linear regression to a classification problem  often isn't a great idea.  In the first example, before I added this extra training example,  previously linear regression was just getting lucky and it got us a hypothesis  that worked well for that particular example, but usually applying  linear regression to a data set, you might get lucky but often it isn't a good idea.  So I wouldn't use linear regression for classification problems.
Here's one other funny thing about what would happen if we were to use linear regression for a classification problem. For classification we know that y is either zero or one. But if you are using linear regression where the hypothesis can output values that are much larger than one or less than zero, even if all of your training examples have labels y equals zero or one. 6분 53초부터 동영상을 재생하고 스크립트 따르기6:53And it seems kind of strange that even though we know that the labels should be zero, one it seems kind of strange if the algorithm can output values much larger than one or much smaller than zero. 7분 9초부터 동영상을 재생하고 스크립트 따르기7:09So what we'll do in the next few videos is develop an algorithm called logistic regression, which has the property that the output, the predictions of logistic regression are always between zero and one, and doesn't become bigger than one or become less than zero.

ML:Logistic Regression

Now we are switching from regression problems to classification problems. Don't be confused by the name "Logistic Regression"; it is named that way for historical reasons and is actually an approach to classification problems, not regression problems.

 

To attempt classification, one method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. However, this method doesn't work well because classification is not actually a linear function.

The classification problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values. For now, we will focus on the binary classification problem in which y can take on only two values, 0 and 1. (Most of what we say here will also generalize to the multiple-class case.) For instance, if we are trying to build a spam classifier for email, then x^{(i)} may be some features of a piece of email, and y may be 1 if it is a piece of spam mail, and 0 otherwise. Hence, y∈{0,1}. 0 is also called the negative class, and 1 the positive class, and they are sometimes also denoted by the symbols “-” and “+.” Given x^{(i)}, the corresponding y^{(i)} is also called the label for the training example.

When we were using linear regression, this was the form of a hypothesis,  where h(x) is theta transpose x.  For logistic regression, I'm going to modify this a little bit and  make the hypothesis g of theta transpose x.  Where I'm going to define the function g as follows.  G(z), z is a real number, is equal to one over one plus e to the negative z.
 by the way, the terms sigmoid function and  logistic function are basically synonyms and mean the same thing.  So the two terms are basically interchangeable, and  either term can be used to refer to this function g.  And if we take these two equations and put them together,  then here's just an alternative way of writing out the form of my hypothesis.  I'm saying that h(x) Is 1 over 1 plus e to the negative theta transpose x.  And all I've do is I've taken this variable z,  z here is a real number, and plugged in theta transpose x.  So I end up with theta transpose x in place of z there.
And so because g(z) upwards values are between zero and  one, we also have that h(x) must be between zero and one.  Finally, given this hypothesis representation, what we need to do,  as before, is fit the parameters theta to our data.  So given a training set we need to a pick a value for  the parameters theta and this hypothesis will then let us make predictions.  We'll talk about a learning algorithm later for fitting the parameters theta,  but first let's talk a bit about the interpretation of this model.

보충 설명 : 

만약 classifcation문제에서 앞의 regression 문제처럼 linear regression을 사용하게 되면 0,1 만 나뉘어지는 자료에서 0,1외의 수가 나오는 것은 물론이고 위의 예처럼 완전 동떨어진 예가 있다면 그 중간 (평균)이 지나치게 치우쳐지는 문제가 일어난다. 따라서 classifcation 문제에서는 sigmoid function 즉 지수함수를 이용해서 h(x)를 표현하게 되고 그 결과 logistic regression 함수가 나오게 된다. classifcation 문제이지만 logistic 'regression'이름인것은 가볍게 인정하도록 하자. 

here's how I read this expression.  This is the probability that y is equal to one.  Given x, given that my patient has features x, so  given my patient has a particular tumor size represented by my features x.  And this probability is parameterized by theta.  So I'm basically going to count on my hypothesis to give me  estimates of the probability that y is equal to 1. 
If this equation looks a little bit complicated,  feel free to mentally imagine it without that x and theta.  And this is just saying that the product of y equals zero plus the product of y  equals one, must be equal to one.  And we know this to be true because y has to be either zero or one, and so  the chance of y equals zero, plus the chance that y is one. 
  So, you now know what the hypothesis representation is for  logistic regression and we're seeing what the mathematical formula is,  defining the hypothesis for logistic regression.  In the next video, I'd like to try to give you better intuition  about what the hypothesis function looks like.  And I wanna tell you about something called the decision boundary.  And we'll look at some visualizations together to try to get a better sense of  what this hypothesis function of logistic regression really looks like.

Binary Classification

Instead of our output vector y being a continuous range of values, it will only be 0 or 1.

y∈{0,1}

Where 0 is usually taken as the "negative class" and 1 as the "positive class", but you are free to assign any representation to it.

We're only doing two classes for now, called a "Binary Classification Problem."

One method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. This method doesn't work well because classification is not actually a linear function.

 

 

Hypothesis Representation

Our hypothesis should satisfy:

0 \leq h_\theta (x) \leq 1

Our new form uses the "Sigmoid Function," also called the "Logistic Function":

hθ(x)=g(θTx)z=θTxg(z)=11+ez

The function g(z), shown here, maps any real number to the (0, 1) interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification. Try playing with interactive plot of sigmoid function : (https://www.desmos.com/calculator/bgontvxotm).

We start with our old hypothesis (linear regression), except that we want to restrict the range to 0 and 1. This is accomplished by plugging \theta^Tx into the Logistic Function.

h_\theta will give us the probability that our output is 1. For example, h_\theta(x)=0.7 gives us the probability of 70% that our output is 1.

hθ(x)=P(y=1|x;θ)=1P(y=0|x;θ)P(y=0|x;θ)+P(y=1|x;θ)=1

Our probability that our prediction is 0 is just the complement of our probability that it is 1 (e.g. if probability that it is 1 is 70%, then the probability that it is 0 is 30%).

 

What I want to do is understand better when is it exactly that h of x will  be greater than or equal to 0.5, so that we'll end up predicting y is equal to 1.  If we look at this plot of the sigmoid function,  we'll notice that the sigmoid function, g of z is greater than or  equal to 0.5 whenever z is greater than or equal to zero.  So is in this half of the figure that g takes on values that are 0.5 and higher.  This notch here, that's 0.5, and so when z is positive, g of z,  the sigmoid function is greater than or equal to 0.5.
To summarize what we just worked out, we saw that if we decide to predict whether  y=1 or y=0 depending on whether the estimated probability is greater than or  equal to 0.5, or whether less than 0.5, then that's the same as saying that  when we predict y=1 whenever theta transpose x is greater than or equal to 0.  And we'll predict y is equal to 0 whenever theta transpose x is less than 0.  Let's use this to better understand  how the hypothesis of logistic regression makes those predictions. 
We haven't talked yet about how to fit the parameters of this model.  We'll talk about that in the next video.  But suppose that via a procedure to specified.  We end up choosing the following values for the parameters.  Let's say we choose theta 0 equals 3, theta 1 equals 1, theta 2 equals 1.  So this means that my parameter vector is going to be theta  equals minus 3, 1, 1.
so, the region where our hypothesis will predict y = 1, is this region,  just really this huge region, this half space over to the upper right.  And let me just write that down, I'm gonna call this the y = 1 region.  And, in contrast, the region where x1 + x2 is less than 3,  that's when we will predict that y is equal to 0.  And that corresponds to this region.  And there's really a half plane, but  that region on the left is the region where our hypothesis will predict y = 0.  I wanna give this line, this magenta line that I drew a name.  This line, there, is called the decision boundary.
Earlier when we were talking about polynomial regression or  when we're talking about linear regression, we talked about how we could  add extra higher order polynomial terms to the features.  And we can do the same for logistic regression.  Concretely, let's say my hypothesis looks like this where I've  added two extra features, x1 squared and x2 squared, to my features.  So that I now have five parameters, theta zero through theta four. What this means is that with this particular choose of parameters,  my parameter effect theta theta looks like minus one, zero, zero, one, one.

 

Following our earlier discussion, this means that my hypothesis will predict that  y=1 whenever -1 + x1 squared + x2 squared is greater than or equal to 0.  This is whenever theta transpose times my theta transfers,  my features is greater than or equal to zero.  And if I take minus 1 and just bring this to the right,  I'm saying that my hypothesis will predict that y is equal to 1 whenever x1  squared plus x2 squared is greater than or equal to 1. 
So what does this decision boundary look like? Well, if you were to plot the curve for x1 squared plus x2 squared equals 1 Some of you will recognize that, that is the equation for circle of radius one, centered around the origin. So that is my decision boundary. 12분 10초부터 동영상을 재생하고 스크립트 따르기12:10And everything outside the circle, I'm going to predict as y=1. So out here is my y equals 1 region, we'll predict y equals 1 out here and inside the circle is where I'll predict y is equal to 0. So by adding these more complex, or these polynomial terms to my features as well, I can get more complex decision boundaries that don't just try to separate the positive and negative examples in a straight line that I can get in this example, a decision boundary that's a circle.

 

 

So can we come up with even more complex decision boundaries then this? If I have even higher polynomial terms so things like 13분 32초부터 동영상을 재생하고 스크립트 따르기13:32X1 squared, X1 squared X2, X1 squared equals squared and so on. And have much higher polynomials, then it's possible to show that you can get even more complex decision boundaries and the regression can be used to find decision boundaries that may, for example, be an ellipse like that or maybe a little bit different setting of the parameters maybe you can get instead a different decision boundary which may even look like some funny shape like that.

Decision Boundary

In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:

hθ(x)0.5y=1hθ(x)<0.5y=0

The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:

g(z)0.5whenz0

Remember.-

z=0,e0=1g(z)=1/2z,e0g(z)=1z,eg(z)=0

So if our input to g is \theta^T X, then that means:

hθ(x)=g(θTx)0.5whenθTx0

From these statements we can now say:

θTx0y=1θTx<0y=0

The decision boundary is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function.

Example:

θ=510y=1if5+(1)x1+0x205x10x15x15

In this case, our decision boundary is a straight vertical line placed on the graph where x_1 = 5, and everything to the left of that denotes y = 1, while everything to the right denotes y = 0.

Again, the input to the sigmoid function g(z) (e.g. \theta^T X) doesn't need to be linear, and could be a function that describes a circle (e.g. z = \theta_0 + \theta_1 x_1^2 +\theta_2 x_2^2) or any shape to fit our data.

Here's the supervised learning problem of fitting logistic regression model. We have a training set of m training examples and as usual, each of our examples is represented by a that's n plus one dimensional, 32초부터 동영상을 재생하고 스크립트 따르기0:32and as usual we have x o equals one. First feature or a zero feature is always equal to one. And because this is a computation problem, our training set has the property that every label y is either 0 or 1. This is a hypothesis, and the parameters of a hypothesis is this theta over here. And the question that I want to talk about is given this training set, how do we choose, or how do we fit the parameter's theta?
And to simplify this equation a little bit more,  it's going to be convenient to get rid of those superscripts.  So just define cost of h of x comma y to be equal to  one half of this squared error.  And interpretation of this cost function is that, this is the cost I want my  learning algorithm to have to pay if it outputs that value,  if its prediction is h of x, and the actual label was y.  So just cross off the superscripts, right,  and no surprise for linear regression the cost we've defined is that or  the cost of this is that is one-half times the square difference between what  I predicted and the actual value that we have, 0 for y.  Now this cost function worked fine for linear regression.  But here, we're interested in logistic regression.  If we could minimize this cost function that is plugged into J here,  that will work okay.  But it turns out that if we use this particular cost function,  this would be a non-convex function of the parameter's data.
And so we're just left with, you know, this part of the curve, and  that's what this curve on the left looks like. 
But if h(x) = 1 then the cost is down here, is equal to 0.  And that's where we'd like it to be because if we correctly predict the output  y, then the cost is 0.  But now notice also that as h(x) approaches 0, so as  the output of a hypothesis approaches 0, the cost blows up and it goes to infinity.  And what this does is this captures the intuition that if a hypothesis of 0,  that's like saying a hypothesis saying the chance of y equals 1 is equal to 0.  It's kinda like our going to our medical patients and saying  the probability that you have a malignant tumor, the probability that y=1, is zero.  So, it's like absolutely impossible that your tumor is malignant.
 so if you plot  the cost function for the case of y equals 0, you find that it looks like this.  And what this curve does is it now goes up and  it goes to plus infinity as h of x goes to 1 because as I was saying,  that if y turns out to be equal to 0.  But we predicted that y is equal to 1 with almost certainly, probably 1,  then we end up paying a very large cost.  And conversely,  if h of x is equal to 0 and y equals 0, then the hypothesis melted.  The protected y of z is equal to 0, and it turns out y is equal to 0,  so at this point, the cost function is going to be 0.

Cost Function

We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function.

Instead, our cost function for logistic regression looks like:

J(θ)=1mi=1mCost(hθ(x(i)),y(i))Cost(hθ(x),y)=log(hθ(x))Cost(hθ(x),y)=log(1hθ(x))if y = 1if y = 0

When y = 1, we get the following plot for J(\theta) vs h_\theta (x):

Similarly, when y = 0, we get the following plot for J(\theta) vs h_\theta (x):

The more our hypothesis is off from y, the larger the cost function output. If our hypothesis is equal to y, then our cost is 0:

Cost(hθ(x),y)=0 if hθ(x)=yCost(hθ(x),y) if y=0andhθ(x)1Cost(hθ(x),y) if y=1andhθ(x)0

If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity.

If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs 1. If our hypothesis approaches 0, then the cost function will approach infinity.

Note that writing the cost function in this way guarantees that J(θ) is convex for logistic regression.

Our overall cost function is 1 over m times the sum over the trading set of  the cost of making different predictions on the different examples of labels y i.  And this is the cost of a single example that we worked out earlier.  And just want to remind you that for  classification problems in our training sets, and in fact even for examples,  now that our training set y is always equal to zero or one, right?  That's sort of part of the mathematical definition of y.
  We say that cost of H(x), y.  I'm gonna write this as -y  times log h(x)- (1-y)  times log (1-h(x)).  And I'll show you in a second that this expression, no, this equation,  is an equivalent way, or more compact way,  of writing out this definition of the cost function that we have up here.  Let's see why that's the case.
If y is equal to 1, than this equation is saying that the cost is equal to,  well if y is equal to 1, then this thing here is equal to 1.  And 1 minus y is going to be equal to 0, right.  So if y is equal to 1, then 1 minus y is 1 minus 1, which is therefore 0.  So the second term gets multiplied by 0 and goes away.  And we're left with only this first term, which is y times log- y times log (h(x)). 
what has changed is that the definition for this hypothesis has changed.  So as whereas for linear regression, we had h(x) equals theta transpose X,  now this definition of h(x) has changed.  And is instead now one over one plus e to the negative transpose x.  So even though the update rule looks cosmetically identical,  because the definition of the hypothesis has changed,  this is actually not the same thing as gradient descent for linear regression. 

Simplified Cost Function and Gradient Descent

We can compress our cost function's two conditional cases into one case:

\mathrm{Cost}(h_\theta(x),y) = - y \; \log(h_\theta(x)) - (1 - y) \log(1 - h_\theta(x))

Notice that when y is equal to 1, then the second term (1-y)\log(1-h_\theta(x)) will be zero and will not affect the result. If y is equal to 0, then the first term -y \log(h_\theta(x)) will be zero and will not affect the result.

We can fully write out our entire cost function as follows:

J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))]

A vectorized implementation is:

h=g(Xθ)J(θ)=1m(yTlog(h)(1y)Tlog(1h))

Gradient Descent

Remember that the general form of gradient descent is:

Repeat{θj:=θjαθjJ(θ)}

We can work out the derivative part using calculus to get:

Repeat{θj:=θjαmi=1m(hθ(x(i))y(i))x(i)j}

Notice that this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta.

A vectorized implementation is:

\theta := \theta - \frac{\alpha}{m} X^{T} (g(X \theta ) - \vec{y})

Partial derivative of J(θ)

First calculate derivative of sigmoid function (it will be useful while finding partial derivative of J(θ)):

σ(x)=(11+ex)=(1+ex)(1+ex)2=1(ex)(1+ex)2=0(x)(ex)(1+ex)2=(1)(ex)(1+ex)2=ex(1+ex)2=(11+ex)(ex1+ex)=σ(x)(+11+ex1+ex)=σ(x)(1+ex1+ex11+ex)=σ(x)(1σ(x))

Now we are ready to find out resulting partial derivative:

θjJ(θ)=θj1mi=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]=1mi=1m[y(i)θjlog(hθ(x(i)))+(1y(i))θjlog(1hθ(x(i)))]=1mi=1my(i)θjhθ(x(i))hθ(x(i))+(1y(i))θj(1hθ(x(i)))1hθ(x(i))=1mi=1my(i)θjσ(θTx(i))hθ(x(i))+(1y(i))θj(1σ(θTx(i)))1hθ(x(i))=1mi=1my(i)σ(θTx(i))(1σ(θTx(i)))θjθTx(i)hθ(x(i))+(1y(i))σ(θTx(i))(1σ(θTx(i)))θjθTx(i)1hθ(x(i))=1mi=1my(i)hθ(x(i))(1hθ(x(i)))θjθTx(i)hθ(x(i))(1y(i))hθ(x(i))(1hθ(x(i)))θjθTx(i)1hθ(x(i))=1mi=1m[y(i)(1hθ(x(i)))x(i)j(1y(i))hθ(x(i))x(i)j]=1mi=1m[y(i)(1hθ(x(i)))(1y(i))hθ(x(i))]x(i)j=1mi=1m[y(i)y(i)hθ(x(i))hθ(x(i))+y(i)hθ(x(i))]x(i)j=1mi=1m[y(i)hθ(x(i))]x(i)j=1mi=1m[hθ(x(i))y(i)]x(i)j

The vectorized version;

\nabla J(\theta) = \frac{1}{m} \cdot X^T \cdot \left(g\left(X\cdot\theta\right) - \vec{y}\right)

 Given code that  can do these two things, what  gradient descent does is it  repeatedly performs the following update.  Right?  So given the code that  we wrote to compute these partial  derivatives, gradient descent plugs  in here and uses that to update our parameters theta.
But let me just tell you about some of their properties. 2분 40초부터 동영상을 재생하고 스크립트 따르기2:40These three algorithms have a number of advantages. One is that, with any of this algorithms you usually do not need to manually pick the learning rate alpha. 2분 50초부터 동영상을 재생하고 스크립트 따르기2:50So one way to think of these algorithms is that given is the way to compute the derivative and a cost function. You can think of these algorithms as having a clever inter-loop. And, in fact, they have a clever 3분 1초부터 동영상을 재생하고 스크립트 따르기3:01inter-loop called a line search algorithm that automatically tries out different values for the learning rate alpha and automatically picks a good learning rate alpha so that it can even pick a different learning rate for every iteration. And so then you don't need to choose it yourself.
The value that minimizes it is going to be theta 1 equals 5, theta 2 equals equals five. 6분 15초부터 동영상을 재생하고 스크립트 따르기6:15Now, again, I know some of you know more calculus than others, but the derivatives of the cost function J turn out to be these two expressions. I've done the calculus. 6분 26초부터 동영상을 재생하고 스크립트 따르기6:26So if you want to apply one of the advanced optimization algorithms to minimize cost function J. So, you know, if we didn't know the minimum was at 5, 5, but if you want to have a cost function 5 the minimum numerically using something like gradient descent but preferably more advanced than gradient descent, what you would do is implement an octave function like this, so we implement a cost function,
 the first J-val, is how 6분 58초부터 동영상을 재생하고 스크립트 따르기6:58we would compute the cost function J. And so this says J-val equals, you know, theta one minus five squared plus theta two minus five squared. So it's just computing this cost function over here. 7분 10초부터 동영상을 재생하고 스크립트 따르기7:10And the second argument that this function returns is gradient. So gradient is going to be a two by one vector, 7분 18초부터 동영상을 재생하고 스크립트 따르기7:18and the two elements of the gradient vector correspond to the two partial derivative terms over here. 7분 27초부터 동영상을 재생하고 스크립트 따르기7:27Having implemented this cost function, 7분 29초부터 동영상을 재생하고 스크립트 따르기7:29you would, you can then 7분 31초부터 동영상을 재생하고 스크립트 따르기7:31call the advanced optimization 7분 34초부터 동영상을 재생하고 스크립트 따르기7:34function called the fminunc - it stands for function minimization unconstrained in Octave -and the way you call this is as follows. 
function called the fminunc  - it stands for function  minimization unconstrained in Octave  -and the way you call this is as follows.  You set a few options.  This is a options  as a data structure that stores the options you want.  So grant up on,  this sets the gradient objective parameter to on.  It just means you are indeed going to provide a gradient to this algorithm.  I'm going to set the maximum number  of iterations to, let's say, one hundred.  We're going give it an initial guess for theta.  There's a 2 by 1 vector.  And then this command calls fminunc.  This at symbol presents a  pointer to the cost function
  The function value at the  optimum is essentially 10 to the minus 30.  So that's essentially zero, which  is also what we're hoping for.  And the exit flag is  1, and this shows  what the convergence status of this.  And if you want you can do  help fminunc to  read the documentation for how  to interpret the exit flag.  But the exit flag let's you verify whether or not this algorithm thing has converged.

Advanced Optimization

"Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. A. Ng suggests not to write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they're already tested and highly optimized. Octave provides them.

We first need to provide a function that evaluates the following two functions for a given input value θ:

J(θ)θjJ(θ)

We can write a single function that returns both of these:

1

2

3

4

 

function [jVal, gradient] = costFunction(theta)

jVal = [...code to compute J(theta)...];

gradient = [...code to compute derivative of J(theta)...];

end

 

 

Then we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()". (Note: the value for MaxIter should be an integer, not a character string - errata in the video at 7:30)

1

2

3

4

 

options = optimset('GradObj', 'on', 'MaxIter', 100);

initialTheta = zeros(2,1);

[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta,

        options);

 

We give to the function "fminunc()" our cost function, our initial vector of theta values, and the "options" object that we created beforehand.

 

Multiclass Classification: One-vs-all

Now we will approach the classification of data into more than two categories. Instead of y = {0,1} we will expand our definition so that y = {0,1...n}.

In this case we divide our problem into n+1 (+1 because the index starts at 0) binary classification problems; in each one, we predict the probability that 'y' is a member of one of our classes.

We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.

1 3 
1 2
2 3
1  4 

반응형