Imagine you want to know if someone will click on an ad or not. You have some basic info like age income or search history. Logistic regression takes that info and gives you a number like 0.73 which means there is a 73 percent chance they will click. If the number is more than 0.5 we say yes if not then no. This is how machines make decisions in a smart way without being too complex.
In this blog we will explain logistic regression in a simple way we will talk about how it works, types of logistic regression in machine learning, some math behind it, examples in real life and how you can use it in Python. No heavy math just clear and easy to understand logic.
What is Logistic Regression?
Logistic regression is a basic machine learning algorithm that is used when we want to predict something that has only two outcomes like yes or no true or false buy or not buy. It looks at the data we already have and tries to find a pattern so it can guess what might happen next. For example if you give it details like a person’s age income and past purchases it can tell you if that person is likely to buy your product or not.
But instead of just saying yes or no right away it gives you a number between 0 and 1 which tells how likely it is to be a yes. If the number is more than 0.5 we take it as a yes and if it is less we take it as a no. This number is called probability and it helps us make better decisions.
Even though it has the word regression in its name it is mostly used for classification problems which means sorting things into groups like spam or not spam or pass or fail. It is easy to use works well for simple problems and is often the first algorithm people learn in machine learning.
Maths Behind Logistic Regression
In logistic regression we do not use a straight line like in linear regression because we are not predicting numbers we are predicting chances. So instead of using a normal line we use something called the sigmoid function. This function takes any number and turns it into a value between 0 and 1. That value shows the chance of something happening.
Let us say you give some input like age or salary to the model. First it multiplies those inputs with some weights and adds them up. This is just basic math like 5 times age plus 3 times salary. The result is just a number which can be big or small. Then this number is passed through the sigmoid function. The sigmoid changes it into a value between 0 and 1. For example it might change 2.4 into 0.91 which means there is a 91 percent chance of getting a yes.
The math also uses something called a cost function which checks how far the prediction is from the actual answer. If the model is wrong the cost will be high. So it keeps adjusting the weights again and again using a method called gradient descent until the cost becomes as low as possible. This is how the model learns from the data and gets better at making predictions.
Types of Logistic Regression
Type | Number of Classes | Output Type | Real-life Example | When to Use |
Binary Logistic Regression | 2 | Yes/No or 0/1 | Spam or Not Spam | When your outcome has only two categories |
Multinomial Logistic Regression | 3 or more (no order) | One class from many | Predicting if a person chooses Apple Banana or Orange | When you have more than two options without any ranking |
Ordinal Logistic Regression | 3 or more (ordered) | One ranked category | Customer satisfaction: Poor Fair Good Excellent | When categories have a natural order or level |
Assumption of Logistic Regression
Logistic regression is like teaching your computer how to guess something that has a yes or no answer. For example it can learn if a person has cancer or not based on some test results. It takes some data learns from it and then makes predictions. The more data it sees the better it gets at guessing. It is one of the easiest and most useful tools when you want to predict simple choices like true or false or yes or no.
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
# Sample data
data = {
'X1': [1, 2, 3, 4, 5, 6],
'X2': [2, 4, 6, 8, 10, 12],
'X3': [5, 3, 6, 9, 12, 15],
'y': [0, 0, 0, 1, 1, 1]
}
df = pd.DataFrame(data)
X = df[['X1', 'X2', 'X3']]
X = sm.add_constant(X)
vif = pd.DataFrame()
vif['Feature'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
How Logistic Regression Works?
Step | What Happens | Explanation in Simple Words |
1 | Take input features | Use data like age income marks etc |
2 | Multiply inputs by weights and add bias | Do basic math like 5 times age plus 2 times income plus a small extra number |
3 | Apply the sigmoid function | Turn the result into a value between 0 and 1 |
4 | Get the probability | This number shows the chance of saying yes or no |
5 | Use a threshold (like 0.5) | If the chance is more than 0.5 say yes else say no |
6 | Compare prediction with actual result (calculate error) | Check if the model was right or wrong |
7 | Update weights using gradient descent | Adjust the math so the next guess is better |
8 | Repeat the process till the error becomes small | Keep improving till the model becomes good enough |
Evaluation Metrics For Logistic Regression
There are some evaluation metrics which we are taking care for the logistic regression:
- Accuracy
Percentage of correctly predicted samples out of total samples. - Precision
Out of all predicted positives, how many are actually positive.
Formula: Precision = TP / (TP + FP) - Recall (Sensitivity)
Out of all actual positives, how many were correctly predicted.
Formula: Recall = TP / (TP + FN) - F1 Score
Harmonic mean of precision and recall, balances both.
Formula: F1 = 2 * (Precision * Recall) / (Precision + Recall) - ROC-AUC (Receiver Operating Characteristic – Area Under Curve)
Measures the model’s ability to distinguish classes; higher is better. - Log Loss (Cross-Entropy Loss)
Measures how well the predicted probabilities match actual labels and also lower is better.
Advantages of Logistic Regression
- Easy to implement and understand.
- Works well for binary classification problem
- Provides probabilities for class predictions.
- Can handle both continuous and categorical variables.
- Requires less computation compared to complex models.
Disadvantages of Logistic Regression
- Can only handle linear relationships between features and log-odds.
- Not suitable for complex or non-linear problems.
- Sensitive to outliers and noisy data.
- Assumes no multicollinearity among input features.
- Can struggle with large number of features without proper regularization.
Applications of Logistic Regression
- Predicting whether a patient has a specific disease based on medical data.
- Assessing the likelihood that a loan applicant will default on payment.
- Forecasting if a customer will buy a product or respond to a campaign.
- Classifying emails accurately as spam or legitimate messages.
- Predicting if customers are likely to stop using a service or product.
Conclusion
Logistic regression is a powerful and easy-to-use classification method that helps predict binary outcomes. It works well when the relationship between features and the target is roughly linear on the log-odds scale. Despite some limitations like handling only linear relationships and sensitivity to outliers, it remains widely used in fields like healthcare, finance, and marketing because of its interpretability and efficiency. Understanding its assumptions and evaluation metrics helps build better and reliable models.
Frequently Asked Questions (FAQs)
What is logistic regression used for?
It is used to predict binary outcomes like yes/no or true/false decisions based on input features.
How is logistic regression different from linear regression?
Logistic regression predicts probabilities and class labels for classification problems, while linear regression predicts continuous numeric values.
What are the key assumptions of logistic regression?
The main assumptions are no multicollinearity among features, linearity between features and the log-odds of the outcome, and independent observations.