Classification
Published:
Classification is used to predict a discrete label. The outputs fall under a finite set of possible outcomes. Many situations have only two possible outcomes. This is called binary classification (True/False, 0 or 1)
There are also two other common types of classification: multi-class classification and multi-label classification.
Multi-class classification has the same idea behind binary classification, except instead of two possible outcomes, there are three or more. To perform these classifications, we use models like Naïve Bayes, K-Nearest Neighbors, SVMs, as well as various deep learning models.
Logistic regression
We shouldn’t address the classification problem ignoring the fact that
thus,
Notice that
Let us assume that the probability of y=1 given
This can be written more compactly as
If
which matches the desired probability for
If
which matches the desired probability for
Application: Likelihood Function
In logistic regression, we are typically interested in maximizing the likelihood of observing the data. If we have multiple observations
And to make optimization easier, we usually work with the log-likelihood:
This is what logistic regression attempts to maximize to find the best parameters
Newton’s method
So far we’ve seen only one method for the [[optimization]] (minimization) of a function, [[Gradient Descent]].
Newton’s method is an optimization algorithm that uses second-order information about the function (i.e., the Hessian matrix) to update the solution in each iteration. Here’s how it works:
- Taylor Expansion: Newton’s method approximates the function
around the current point using a second-order Taylor expansion:
where
Optimal Step Size: The optimal step size (or the new point
) can be found by solving for the point that minimizes the second-order approximation of the function. The update rule for Newton’s method is:In other words, the next point is obtained by moving in the direction of the negative gradient, scaled by the inverse of the Hessian matrix.
Quadratic Convergence: Newton’s method typically converges much faster than gradient descent because it incorporates curvature information (via the Hessian). When the function is well-approximated by a quadratic function near the minimum, the method can converge in very few iterations.
Hessian Matrix: The key distinction between Newton’s method and gradient descent is the use of the Hessian matrix. This gives more precise direction and step size, making the method second-order, in contrast to the first-order gradient descent that only relies on gradient information.
Drawbacks:
- Computational Cost: Computing the Hessian and its inverse can be computationally expensive, especially for high-dimensional problems.
- Ill-Conditioned Hessians: If the Hessian is not positive definite (or is close to singular), Newton’s method can fail or give undesirable results. Regularization techniques may be required in such cases.
Comments