Classification

Published:

Classification is used to predict a discrete label. The outputs fall under a finite set of possible outcomes. Many situations have only two possible outcomes. This is called binary classification (True/False, 0 or 1)

There are also two other common types of classification: multi-class classification and multi-label classification.

Multi-class classification has the same idea behind binary classification, except instead of two possible outcomes, there are three or more. To perform these classifications, we use models like Naïve Bayes, K-Nearest Neighbors, SVMs, as well as various deep learning models.

Logistic regression

We shouldn’t address the classification problem ignoring the fact that y is discrete-valued (i.e. y {0,1}). That is why instead of the old linear regression algorithm to try to predict y given x we will choose a logistic function or sigmoid function

g(z)=11+ez   (general formula for sigmoid function)

thus,

hθ(x)=g(θTx)=11+eθTx

alt text

Notice that g(z) tends towards 1 as z, and g(z) tends towards 0 as z. Moreover, g(z), and hence also h(x), is always bounded between 0 and 1. As before, we are keeping the convention of letting x0=1, so that

θTx=θ0+j=1dθjxj

Let us assume that the probability of y=1 given x and θ is our prediction.

P(y=1|x;θ)=hθ(x)P(y=0|x;θ)=1hθ(x)

This can be written more compactly as

p(y|x;θ)=(hθ(x))y(1hθ(x))1y

If y=1, the term becomes:

p(y=1|x;θ)=(hθ(x))1(1hθ(x))0=hθ(x)

which matches the desired probability for y = 1.

If y=0, the term becomes:

p(y=0|x;θ)=(hθ(x))0(1hθ(x))1=1hθ(x)

which matches the desired probability for y=0.

Application: Likelihood Function

In logistic regression, we are typically interested in maximizing the likelihood of observing the data. If we have multiple observations (xi,yi), the [[likelihood function]] would be the product of individual probabilities for all data points (remember [[Independent and identically distributed|IID]]):

L(θ)=i=1np(yi|xi;θ)

And to make optimization easier, we usually work with the log-likelihood:

logL(θ)=i=1n[yiloghθ(xi)+(1yi)log(1hθ(xi))]

This is what logistic regression attempts to maximize to find the best parameters θ.

Newton’s method

So far we’ve seen only one method for the [[optimization]] (minimization) of a function, [[Gradient Descent]].

Newton’s method is an optimization algorithm that uses second-order information about the function (i.e., the Hessian matrix) to update the solution in each iteration. Here’s how it works:

  1. Taylor Expansion: Newton’s method approximates the function f(x) around the current point xt using a second-order Taylor expansion:
f(x)f(xt)+f(xt)T(xxt)+12(xxt)TH(xt)(xxt)

where f(xt) is the gradient at xt and H(xt) is the Hessian (matrix of second-order partial derivatives).

  1. Optimal Step Size: The optimal step size (or the new point xt+1) can be found by solving for the point that minimizes the second-order approximation of the function. The update rule for Newton’s method is:

    xt+1=xtf(xt)H(xt)

    In other words, the next point is obtained by moving in the direction of the negative gradient, scaled by the inverse of the Hessian matrix.

  2. Quadratic Convergence: Newton’s method typically converges much faster than gradient descent because it incorporates curvature information (via the Hessian). When the function is well-approximated by a quadratic function near the minimum, the method can converge in very few iterations.

  3. Hessian Matrix: The key distinction between Newton’s method and gradient descent is the use of the Hessian matrix. This gives more precise direction and step size, making the method second-order, in contrast to the first-order gradient descent that only relies on gradient information.

  4. Drawbacks:

    • Computational Cost: Computing the Hessian and its inverse can be computationally expensive, especially for high-dimensional problems.
    • Ill-Conditioned Hessians: If the Hessian is not positive definite (or is close to singular), Newton’s method can fail or give undesirable results. Regularization techniques may be required in such cases.

Comments