Logistic Regression
Introduction
Logistic regression uses the logistic sigmoid function to return a probability value from feature variables.
How logistic regression works ?
Examples
- A person is obese or not ?
- Does Mr A has cancer ?
- Will this team win the match today ?
- Email is spam or not ?
Why not linear regression
- Linear regression predicts output as continuous range from $-\infty$ to $+\infty$. But we are predicting discrete values like 0 and 1 in case of logistic regression.
- Moreover we can’t map all the output values onto a straight line as in case of linear function. There is huge chance that we miss predictions as shown in figure below
In logistic regression, the output of linear regression is passed to a sigmoid funtion to convert the predicted continuous to discrete categorical values.
Linear Regression
Logistic Regression
Let’s see the differences between Linear and Logistic
$\begin{array}{l|l|l} \hline & \text { Linear } & \text { Logistic } \ \hline \text { Target Variables } & \text { Continuous } & \text { Categorical } \ \hline \text { Problem Type } & \text { Regression } & \text { Classification } \ \hline \text { Hypothesis } & \theta^{T} x & sigmoid\left(\theta^{T} x\right) \ \hline \text { Loss } & \text { Mean Squared } & \text { Logistic } \ \end{array}$
Types
- Binary: Output dependent variabels mapped to 2 categorical values
- Multinomial: Three or more categorical values for classification
- Ordinal: Three or more categorical values with ordering
Math intro
Odds and Log Odds
Since the goal of logistic function is to map linear combination of input variabels into a probability, we need a link to map linear combination to probability, and that link is logit function. Before knowing about logit functions, let’s see what odds, log odds and odds ratio mean.
Odds
$\begin{aligned} \operatorname{odds}(Y=1) &=\frac{P(Y=1)}{P(Y=0)}=\frac{P(Y=1)}{1-P(Y=1)} \ &=\frac{p}{1-p} = \frac{Probability of event happening}{Probability of event not happening} \end{aligned}$
Lets check the odds for a sample data
import pandas as pd
data = [['CS', 'Dropout'], ['EE', 'Graduated'], ['CS', 'Dropout'], ['CS', 'Graduated'], ['EE', 'Dropout'], ['CS', 'Dropout'], ['CS', 'Dropout'],['EE','Graduated']]
df = pd.DataFrame(data, columns = ['Branch', 'Status'])
pd.crosstab(index=df['Branch'], columns= df['Status'], margins=True)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
# odds of cs graduated
odds_cs_grad = (1/5)/(4/5) # p/(1-p)
print("odds of cs graduated {}".format(odds_cs))
odds of cs graduated 0.25
# odds of EE graduated
odds_ee = (2/3)/(1/3) # p/(1-p)
print("odds of ee graduated {}".format(odds_ee))
odds of ee graduated 2.0
# Odds ratio
odds_ratio = odds_ee/odds_cs
print("odds ratio of ee to cs is {}".format(odds_ratio))
print("A EE student is {} times likely to graduate than CS".format(odds_ratio))
odds ratio of ee to cs is 8.0
A EE student is 8.0 times likely to graduate than CS
Lets plot log and log odds functions
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
def odds(p):
return p / (1 - p)
def log_odds(p):
return np.log(p / (1 - p))
x = np.arange(0.01, 1, 0.05)
odds_x = odds(x)
log_odds_x = log_odds(x)
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
plt.axvline(0)
plt.axhline(0)
axes[0].plot(x, odds_x)
axes[0].set_title("odds function")
axes[0].set(xlabel="x", ylabel="odds")
axes[1].plot(x, log_odds_x)
axes[1].set_title("log odds function")
axes[1].set(xlabel="x", ylabel="log_odds")
fig.tight_layout()
Logit function
$\operatorname{logit}(p)=\log \left(\frac{P}{1-P}\right), \text { for } 0 \leq p \leq 1$
This logit function is what we are trying to equate it to our linear combination of input variables.
$\log(\frac{P}{1-P}) = \theta_1 x_i + \theta_0$
$P = \frac{1}{1+e^(\theta_1 x_i + \theta_0)}$ This exactly looks like sigmoid function which we will study below.
$P$ = probability of success
$-\infty \leq x_i \leq \infty$;
Sigmoid function
Sigmoid function is used in the logistic regression to map infinite values into a finite discrete target values.
Equation of sigmoid function is $g(z)=\frac{1}{1+e^{-z}}$
The function is plotted below
$\begin{aligned} &\lim _{x \rightarrow \infty} g(z)=1\ &\lim _{x \rightarrow-\infty} g(z)=0 \end{aligned}$
Interesting thing about sigmoid function is, even the derivative of it can be expressed as the function itself. The first order derivate of sigmoid function is $\frac{d g(z)}{d z}=g(z)[1-g(z)]$
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
z = np.linspace (-10,10,100)
# sigmoid function
def sigmoid(z):
return 1 / (1 + np.exp(-z))
plt.figure(figsize=(10,6))
plt.plot(z,sigmoid(z))
plt.xlim([-10,10])
plt.ylim([-0.1,1.1])
plt.axvline(0)
plt.axhline(0)
plt.xlabel('z');
plt.ylabel('g(z)')
plt.title('Sigmoid function');
plt. show ( )
Bernoulli Distribution
We need to get some basics of Bernoulli Distribution here. Bernoulli says
$f_{\text {Bernoulli}}=\left{\begin{array}{ll} 1-P ; & \text { for } n=0 \ P ; \quad \text { for } n=1 \end{array}\right.$
where $n = 0$ is failure event and $n = 1$ is a successful event.
Hypothesis
This equation takes the featurs (x) and parameters ($\theta$) as input and predicts the output dependent variable.
The weighted combination of input variables is … $\theta_{1} \cdot x_{1}+\theta_{2} \cdot x_{2}+\ldots+\theta_{n} \cdot x_{n}$
Writing the above function in linear algebra from …
$\sum_{i=1}^{m} \theta_{i} x_{i}=\theta^{T} x$
Lets write this in matrix form
$\left[\begin{array}{c} \theta_{1} \ \theta_{2} \ \cdot \ \cdot \ \theta_{n} \end{array}\right]^{T} \cdot\left[\begin{array}{c} x_{1} \ x_{2} \ \cdot \ \cdot \ x_{n} \end{array}\right]=\left[\begin{array}{cccc} \theta_{1} & \theta_{2} \ldots \theta_{n} \end{array}\right] \cdot\left[\begin{array}{c} x_{1} \ x_{2} \ \cdot \ \cdot \ \cdot \ x_{n} \end{array}\right]=\theta_{1} x_{1}+\theta_{2} x_{2}+\ldots+\theta_{n} x_{n}$
If we pass this equation to sigmoid function ….
$P\left(\theta^{T} x\right)=g\left(\theta^{T} x\right) = \frac{1}{1+e^{-\theta^{T} x}}$
where $P\left(\theta^{T} x\right) = h_{\theta}(x)$ and $g()$ is called sigmoid function.
Now the hypothesis can be written as $h_{\theta}(x)=\frac{1}{1+e^{-\theta^{T} x}}$
where $h_{\Theta}(x)=P(Y=1 | X ; \theta )$
In words Probability that $Y=1$ for features $X$ with co-efficients $\theta$
Cost Function
we can’t use the cost function Sum of Squared errors (SSE) in logistic regression as it would give convex graph and we will get lot of local minima and makes it very difficult to reach to a point of global minima.
In linear regression, we have used Sum of squared errors (SSE) for calculating cost. In logistic regression we use slightly different approach. Suppose if a function predicts sucess % of 90 and seem to be a failure, we penalize it heavily than 30% probability prediction.
So for logistic regression, we go for a logarithemic cost function as below. The log cost function penalizes confident and wrong predictions heavily
$\operatorname{cost}\left(h_{\theta}(x), y\right)=\left{\begin{array}{ll} -\log \left(h_{\theta}(x)\right) & \text { if } y=1 \ -\log \left(1-h_{\theta}(x)\right) & \text { if } y=0 \end{array}\right.$
if we convert the above to one liner …
$\operatorname{cost}\left(h_{\theta}(x), y\right)=-y \log \left(h_{\theta}(x)\right)-(1-y) \log \left(1-h_{\theta}(x)\right)$
Finally the cost function for all the values will be
$\begin{aligned} J(\theta) &=\frac{1}{m} \sum_{i=1}^{m} \operatorname{cost}\left(h_{\theta}\left(x^{(i)}\right), y^{(i)}\right) \ &=-\frac{1}{m}\left[\sum_{i=1}^{m} y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right] \end{aligned}$
Minimize cost function
use gradient descent to minimize the cost function
$\frac{\partial J(\theta)}{\partial \theta_{j}}=\frac{1}{m} \sum_{i=1}^{m}\left(h\left(x^{i}\right)-y^{i}\right) x_{j}^{i}$
def costFunction(theta, X, y):
"""
Compute cost and gradient for logistic regression.
Parameters
-------------------------------------
theta : weight vector of shape (n+1, )
X : input of shape (m x n+1) m: no of training ex, n:no of features
y : predicted y (m, ).
Returns
-------------------------------------
cost : value of cost function
grad : vector of shape (n+1, ) -> gradient of the cost fun wrt weights
"""
m = X.shape[0] # number of training examples
# initialize Returns
cost = 0
grads = np.zeros(theta.shape)
#Prediction
sigmoid_result = sigmoid(x.dot(theta))
Y_T = y.T
cost = (-1/m)*(np.sum((Y_T*np.log(sigmoid_result)) + ((1-Y_T)*(np.log(1-sigmoid_result)))))
#
#Gradient calculation
dw = (1/m)*(np.dot(X.T, (sigmoid_result-Y.T).T))
db = (1/m)*(np.sum(sigmoid_result-Y.T))
grads = {"dw": dw, "db": db}
return cost, grads