Logistic Regression

sudheer included in ml

2020-04-28 1201 words 6 minutes

Contents

Introduction

Logistic regression uses the logistic sigmoid function to return a probability value from feature variables.

How logistic regression works ?

Examples

A person is obese or not ?
Does Mr A has cancer ?
Will this team win the match today ?
Email is spam or not ?

Why not linear regression

Linear regression predicts output as continuous range from $-\infty$ to $+\infty$. But we are predicting discrete values like 0 and 1 in case of logistic regression.
Moreover we can’t map all the output values onto a straight line as in case of linear function. There is huge chance that we miss predictions as shown in figure below

In logistic regression, the output of linear regression is passed to a sigmoid funtion to convert the predicted continuous to discrete categorical values.

Linear Regression

Logistic Regression

Let’s see the differences between Linear and Logistic

$\begin{array}{l|l|l} \hline & \text { Linear } & \text { Logistic } \ \hline \text { Target Variables } & \text { Continuous } & \text { Categorical } \ \hline \text { Problem Type } & \text { Regression } & \text { Classification } \ \hline \text { Hypothesis } & \theta^{T} x & sigmoid\left(\theta^{T} x\right) \ \hline \text { Loss } & \text { Mean Squared } & \text { Logistic } \ \end{array}$

Types

Binary: Output dependent variabels mapped to 2 categorical values
Multinomial: Three or more categorical values for classification
Ordinal: Three or more categorical values with ordering

Math intro

Odds and Log Odds

Since the goal of logistic function is to map linear combination of input variabels into a probability, we need a link to map linear combination to probability, and that link is logit function. Before knowing about logit functions, let’s see what odds, log odds and odds ratio mean.

Odds

$\begin{aligned} \operatorname{odds}(Y=1) &=\frac{P(Y=1)}{P(Y=0)}=\frac{P(Y=1)}{1-P(Y=1)} \ &=\frac{p}{1-p} = \frac{Probability of event happening}{Probability of event not happening} \end{aligned}$

Lets check the odds for a sample data

import pandas as pd

data = [['CS', 'Dropout'], ['EE', 'Graduated'], ['CS', 'Dropout'], ['CS', 'Graduated'], ['EE', 'Dropout'], ['CS', 'Dropout'], ['CS', 'Dropout'],['EE','Graduated']] 
df = pd.DataFrame(data, columns = ['Branch', 'Status'])

pd.crosstab(index=df['Branch'], columns= df['Status'], margins=True)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

# odds of cs graduated
odds_cs_grad = (1/5)/(4/5)  # p/(1-p)
print("odds of cs graduated {}".format(odds_cs))

odds of cs graduated 0.25

# odds of EE graduated
odds_ee = (2/3)/(1/3) # p/(1-p)
print("odds of ee graduated {}".format(odds_ee))

odds of ee graduated 2.0

# Odds ratio 

odds_ratio = odds_ee/odds_cs
print("odds ratio of ee to cs is {}".format(odds_ratio))

print("A EE student is {} times likely to graduate than CS".format(odds_ratio))

odds ratio of ee to cs is 8.0
A EE student is 8.0 times likely to graduate than CS

Lets plot log and log odds functions

import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

def odds(p):
    return p / (1 - p)

def log_odds(p):
    return np.log(p / (1 - p))

x = np.arange(0.01, 1, 0.05)
odds_x = odds(x)

log_odds_x = log_odds(x)

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
plt.axvline(0)
plt.axhline(0)
axes[0].plot(x, odds_x)
axes[0].set_title("odds function")
axes[0].set(xlabel="x", ylabel="odds")
axes[1].plot(x, log_odds_x)
axes[1].set_title("log odds function")
axes[1].set(xlabel="x", ylabel="log_odds")
fig.tight_layout()

Logit function

$\operatorname{logit}(p)=\log \left(\frac{P}{1-P}\right), \text { for } 0 \leq p \leq 1$

This logit function is what we are trying to equate it to our linear combination of input variables.

$\log(\frac{P}{1-P}) = \theta_1 x_i + \theta_0$

$P = \frac{1}{1+e^(\theta_1 x_i + \theta_0)}$ This exactly looks like sigmoid function which we will study below.

$P$ = probability of success

$-\infty \leq x_i \leq \infty$;

Sigmoid function

Sigmoid function is used in the logistic regression to map infinite values into a finite discrete target values.

Equation of sigmoid function is $g(z)=\frac{1}{1+e^{-z}}$

The function is plotted below

$\begin{aligned} &\lim _{x \rightarrow \infty} g(z)=1\ &\lim _{x \rightarrow-\infty} g(z)=0 \end{aligned}$

Interesting thing about sigmoid function is, even the derivative of it can be expressed as the function itself. The first order derivate of sigmoid function is $\frac{d g(z)}{d z}=g(z)[1-g(z)]$

import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline

z = np.linspace (-10,10,100)

# sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

plt.figure(figsize=(10,6))
plt.plot(z,sigmoid(z))
plt.xlim([-10,10])
plt.ylim([-0.1,1.1])
plt.axvline(0)
plt.axhline(0)
plt.xlabel('z');
plt.ylabel('g(z)')
plt.title('Sigmoid function');
plt. show ( )

Bernoulli Distribution

We need to get some basics of Bernoulli Distribution here. Bernoulli says

$f_{\text {Bernoulli}}=\left{\begin{array}{ll} 1-P ; & \text { for } n=0 \ P ; \quad \text { for } n=1 \end{array}\right.$

where $n = 0$ is failure event and $n = 1$ is a successful event.

Hypothesis

This equation takes the featurs (x) and parameters ($\theta$) as input and predicts the output dependent variable.

The weighted combination of input variables is … $\theta_{1} \cdot x_{1}+\theta_{2} \cdot x_{2}+\ldots+\theta_{n} \cdot x_{n}$

Writing the above function in linear algebra from …

$\sum_{i=1}^{m} \theta_{i} x_{i}=\theta^{T} x$

Lets write this in matrix form

$\left[\begin{array}{c} \theta_{1} \ \theta_{2} \ \cdot \ \cdot \ \theta_{n} \end{array}\right]^{T} \cdot\left[\begin{array}{c} x_{1} \ x_{2} \ \cdot \ \cdot \ x_{n} \end{array}\right]=\left[\begin{array}{cccc} \theta_{1} & \theta_{2} \ldots \theta_{n} \end{array}\right] \cdot\left[\begin{array}{c} x_{1} \ x_{2} \ \cdot \ \cdot \ \cdot \ x_{n} \end{array}\right]=\theta_{1} x_{1}+\theta_{2} x_{2}+\ldots+\theta_{n} x_{n}$

If we pass this equation to sigmoid function ….

$P\left(\theta^{T} x\right)=g\left(\theta^{T} x\right) = \frac{1}{1+e^{-\theta^{T} x}}$

where $P\left(\theta^{T} x\right) = h_{\theta}(x)$ and $g()$ is called sigmoid function.

Now the hypothesis can be written as $h_{\theta}(x)=\frac{1}{1+e^{-\theta^{T} x}}$

where $h_{\Theta}(x)=P(Y=1 | X ; \theta )$

In words Probability that $Y=1$ for features $X$ with co-efficients $\theta$

Cost Function

we can’t use the cost function Sum of Squared errors (SSE) in logistic regression as it would give convex graph and we will get lot of local minima and makes it very difficult to reach to a point of global minima.

In linear regression, we have used Sum of squared errors (SSE) for calculating cost. In logistic regression we use slightly different approach. Suppose if a function predicts sucess % of 90 and seem to be a failure, we penalize it heavily than 30% probability prediction.

So for logistic regression, we go for a logarithemic cost function as below. The log cost function penalizes confident and wrong predictions heavily

$\operatorname{cost}\left(h_{\theta}(x), y\right)=\left{\begin{array}{ll} -\log \left(h_{\theta}(x)\right) & \text { if } y=1 \ -\log \left(1-h_{\theta}(x)\right) & \text { if } y=0 \end{array}\right.$

if we convert the above to one liner …

$\operatorname{cost}\left(h_{\theta}(x), y\right)=-y \log \left(h_{\theta}(x)\right)-(1-y) \log \left(1-h_{\theta}(x)\right)$

Finally the cost function for all the values will be

$\begin{aligned} J(\theta) &=\frac{1}{m} \sum_{i=1}^{m} \operatorname{cost}\left(h_{\theta}\left(x^{(i)}\right), y^{(i)}\right) \ &=-\frac{1}{m}\left[\sum_{i=1}^{m} y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right] \end{aligned}$

Minimize cost function

use gradient descent to minimize the cost function

$\frac{\partial J(\theta)}{\partial \theta_{j}}=\frac{1}{m} \sum_{i=1}^{m}\left(h\left(x^{i}\right)-y^{i}\right) x_{j}^{i}$

def costFunction(theta, X, y):
    """
    Compute cost and gradient for logistic regression. 
    
    Parameters
    -------------------------------------
    theta : weight vector of shape (n+1, )
    X : input of shape (m x n+1) m: no of training ex, n:no of features
    y : predicted y (m, ).
    
    Returns
    -------------------------------------
    cost : value of cost function
    grad : vector of shape (n+1, ) -> gradient of the cost fun wrt weights
    """
    m = X.shape[0]  # number of training examples

    # initialize Returns
    cost = 0
    grads = np.zeros(theta.shape)
    
    #Prediction
    sigmoid_result = sigmoid(x.dot(theta))
    Y_T = y.T
    cost = (-1/m)*(np.sum((Y_T*np.log(sigmoid_result)) + ((1-Y_T)*(np.log(1-sigmoid_result)))))
    #
    
    #Gradient calculation
    dw = (1/m)*(np.dot(X.T, (sigmoid_result-Y.T).T))
    db = (1/m)*(np.sum(sigmoid_result-Y.T))
    
    grads = {"dw": dw, "db": db}
    
    return cost, grads