Logistic Regression in Machine Learning

0
32
Logistic Regression in ML

Imagine you want to know if someone will click on an ad or not. You have some basic info like age income or search history. Logistic regression takes that info and gives you a number like 0.73 which means there is a 73 percent chance they will click. If the number is more than 0.5 we say yes if not then no. This is how machines make decisions in a smart way without being too complex.

In this blog we will explain logistic regression in a simple way we will talk about how it works, types of logistic regression in machine learning, some math behind it, examples in real life and how you can use it in Python. No heavy math just clear and easy to understand logic.

Logistic Regression

What is Logistic Regression?

Logistic regression is a basic machine learning algorithm that is used when we want to predict something that has only two outcomes like yes or no true or false buy or not buy. It looks at the data we already have and tries to find a pattern so it can guess what might happen next. For example if you give it details like a person’s age income and past purchases it can tell you if that person is likely to buy your product or not.

But instead of just saying yes or no right away it gives you a number between 0 and 1 which tells how likely it is to be a yes. If the number is more than 0.5 we take it as a yes and if it is less we take it as a no. This number is called probability and it helps us make better decisions.

Even though it has the word regression in its name it is mostly used for classification problems which means sorting things into groups like spam or not spam or pass or fail. It is easy to use works well for simple problems and is often the first algorithm people learn in machine learning.

Maths Behind Logistic Regression

In logistic regression we do not use a straight line like in linear regression because we are not predicting numbers we are predicting chances. So instead of using a normal line we use something called the sigmoid function. This function takes any number and turns it into a value between 0 and 1. That value shows the chance of something happening.

Let us say you give some input like age or salary to the model. First it multiplies those inputs with some weights and adds them up. This is just basic math like 5 times age plus 3 times salary. The result is just a number which can be big or small. Then this number is passed through the sigmoid function. The sigmoid changes it into a value between 0 and 1. For example it might change 2.4 into 0.91 which means there is a 91 percent chance of getting a yes.

The math also uses something called a cost function which checks how far the prediction is from the actual answer. If the model is wrong the cost will be high. So it keeps adjusting the weights again and again using a method called gradient descent until the cost becomes as low as possible. This is how the model learns from the data and gets better at making predictions.

Types of Logistic Regression

TypeNumber of ClassesOutput TypeReal-life ExampleWhen to Use
Binary Logistic Regression2Yes/No or 0/1Spam or Not SpamWhen your outcome has only two categories
Multinomial Logistic Regression3 or more (no order)One class from manyPredicting if a person chooses Apple Banana or OrangeWhen you have more than two options without any ranking
Ordinal Logistic Regression3 or more (ordered)One ranked categoryCustomer satisfaction: Poor Fair Good ExcellentWhen categories have a natural order or level

Assumption of Logistic Regression

Logistic regression is like teaching your computer how to guess something that has a yes or no answer. For example it can learn if a person has cancer or not based on some test results. It takes some data learns from it and then makes predictions. The more data it sees the better it gets at guessing. It is one of the easiest and most useful tools when you want to predict simple choices like true or false or yes or no.

import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

# Sample data
data = {
    'X1': [1, 2, 3, 4, 5, 6],
    'X2': [2, 4, 6, 8, 10, 12],
    'X3': [5, 3, 6, 9, 12, 15],
    'y':  [0, 0, 0, 1, 1, 1]
}

df = pd.DataFrame(data)
X = df[['X1', 'X2', 'X3']]
X = sm.add_constant(X)

vif = pd.DataFrame()
vif['Feature'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif)

How Logistic Regression Works?

StepWhat HappensExplanation in Simple Words
1Take input featuresUse data like age income marks etc
2Multiply inputs by weights and add biasDo basic math like 5 times age plus 2 times income plus a small extra number
3Apply the sigmoid functionTurn the result into a value between 0 and 1
4Get the probabilityThis number shows the chance of saying yes or no
5Use a threshold (like 0.5)If the chance is more than 0.5 say yes else say no
6Compare prediction with actual result (calculate error)Check if the model was right or wrong
7Update weights using gradient descentAdjust the math so the next guess is better
8Repeat the process till the error becomes smallKeep improving till the model becomes good enough

Evaluation Metrics For Logistic Regression

There are some evaluation metrics which we are taking care for the logistic regression:

  1. Accuracy
    Percentage of correctly predicted samples out of total samples.
  2. Precision
    Out of all predicted positives, how many are actually positive.
    Formula: Precision = TP / (TP + FP)
  3. Recall (Sensitivity)
    Out of all actual positives, how many were correctly predicted.
    Formula: Recall = TP / (TP + FN)
  4. F1 Score
    Harmonic mean of precision and recall, balances both.
    Formula: F1 = 2 * (Precision * Recall) / (Precision + Recall)
  5. ROC-AUC (Receiver Operating Characteristic – Area Under Curve)
    Measures the model’s ability to distinguish classes; higher is better.
  6. Log Loss (Cross-Entropy Loss)
    Measures how well the predicted probabilities match actual labels and also lower is better.

Advantages of Logistic Regression

  1. Easy to implement and understand.
  2. Works well for binary classification problem
  3. Provides probabilities for class predictions.
  4. Can handle both continuous and categorical variables.
  5. Requires less computation compared to complex models.

Disadvantages of Logistic Regression

  1. Can only handle linear relationships between features and log-odds.
  2. Not suitable for complex or non-linear problems.
  3. Sensitive to outliers and noisy data.
  4. Assumes no multicollinearity among input features.
  5. Can struggle with large number of features without proper regularization.

Applications of Logistic Regression

  1. Predicting whether a patient has a specific disease based on medical data.
  2. Assessing the likelihood that a loan applicant will default on payment.
  3. Forecasting if a customer will buy a product or respond to a campaign.
  4. Classifying emails accurately as spam or legitimate messages.
  5. Predicting if customers are likely to stop using a service or product.

Conclusion

Logistic regression is a powerful and easy-to-use classification method that helps predict binary outcomes. It works well when the relationship between features and the target is roughly linear on the log-odds scale. Despite some limitations like handling only linear relationships and sensitivity to outliers, it remains widely used in fields like healthcare, finance, and marketing because of its interpretability and efficiency. Understanding its assumptions and evaluation metrics helps build better and reliable models.

Frequently Asked Questions (FAQs)

What is logistic regression used for?

It is used to predict binary outcomes like yes/no or true/false decisions based on input features.

How is logistic regression different from linear regression?

Logistic regression predicts probabilities and class labels for classification problems, while linear regression predicts continuous numeric values.

What are the key assumptions of logistic regression?

The main assumptions are no multicollinearity among features, linearity between features and the log-odds of the outcome, and independent observations.