Classification- Logistic Regression
Welcome back! In the previous post, we saw how we can use linear regression to predict continuous outputs. What if we want to predict discrete value or even simply a True/False outcome?
Supervised learning problems are categorized into “Regression” and “Classification” problems.
In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.
For example, a spam filter that marks an incoming email as a “spam” or not is a classification problem. If the discrete values are either a 0 or a 1, we call this binary classification.
In linear regression, we modeled the relationship using the function of a straight line. For binary classification, a different function whose range is (0,1) is preferred. In Machine Learning, the standard convention is to use a “logistic” or “sigmoid” function.
Logistic Function
The decision between the ‘0’ or ‘1’ is based as follows:
If 0.5 ≤ g(x) implies ‘1’ and
If 0 < g(x) < 0.5 implies ‘0’
Logistic Regression Cost Function
The cost function for logistic regression looks like this:
Or in general and for all the complete dataset:
Decision Boundary
In classification we want to identify the boundary that separates the categories as in the figure below.
Thus we are looking for a line (or curve) that separates. As seen in the linear regression post, the function for such can be expressed as
So, when using a Sigmoid function we are looking for
Gradient
In order to fit the best curve, very similar to Linear Regression, our goal is to minimize the cost (or error) function.
We can start differentiating the values based on the theta (θ)
and with the learning rate,
Gradient Descent Algorithm
Using the same approach as in the earlier post, we will do the following
The corresponding matrix equivalent until it converges is
Thus we can find the coefficients (theta) that best fit the curve (the one with the least error).
When we try to classify any new observation, we would apply the function with the relevant coefficients. By applying the function for the inputs, if the value is < 0.5, it implies binary output of ‘0’ and if the value is ≥ implies binary output of ‘1’.
In this summary we used Logistic Regression to classify binary outputs. The same approach can be extended to multi-class problems as well.
Hope you found this and my previous post interesting. I’m even more excited to learn and absorb Machine learning fundamentals. Looking forward to my next step in this journey. Stay Tuned in this medium.
** Enjoy Machine Learning! **