Statistics for Machine Learning

sudheer included in ml

2020-05-02 1233 words 6 minutes

Contents

Data types in stats

Examples of Numerical

# continous
mu = 20
sigma=2
data_continous = numpy.random.normal(mu, sigma, 1000) # generate from 100 to 150 with 0.1 difference
sns.distplot(data_continous, color="blue")
plt.show()

# discrete
import numpy as np
dice_rolls = [np.random.randint(1, 7) for _ in range(10)]
plt.hist(dice_rolls)
plt.show()

Nominal Data

Data you can’t order.

Gender
religion
hair color

data = {'Name': ['Jim','Jake','Jessy'],
        'Gender': ['Male','Male','Female']
        }

data = pd.DataFrame(data, columns=['Name', 'Gender'])
data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Ordinal Data

Data you can order, but can’t do arthematic.

customer ratings
economic status

data = {'Movie': ['Superman','Heman','Spiderman'],
        'Rating': [4.0,4.7,4.9]
        }

data = pd.DataFrame(data, columns=['Movie', 'Rating'])
data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Central Tendancy

Generate data

import numpy as np
import pandas as pd
from numpy.random import seed
from numpy.random import randint
from numpy import mean
import seaborn as sns
import matplotlib.pyplot as plt

# lets generate some weights of population

# seed the random number generator
seed(1)
# generate a sample of weights of population
weights = randint(low=120, high=200, size=10000)

Lets plot the histogram of weights of population and analyse it.

import matplotlib.pyplot as plt
sns.distplot(weights, color="blue")
plt.xlabel("weights")
plt.ylabel("frequency")
plt.show()

Mean

$\mu=\frac{\sum_{i=1}^{N} x_{i}}{N}$

one of the several values to descrie central tendancy of the data

import numpy
numpy.mean(weights)

159.6552

Median

if n is odd

$Median =\left(\frac{n+1}{2}\right)^{t h} term$

if n is even

$\text { Median }=\frac{\left(\frac{n}{2}\right)^{t h} \text {term}+\left(\frac{n}{2}+1\right)^{t h} \text { term }}{2}$

numpy.median(weights)

159.0

Mode

Mode is the most frequently occured value in our distribution.

from scipy import stats
stats.mode(weights)

ModeResult(mode=array([152]), count=array([152]))

How they change

Measure of spread

Range

Range is the difference between min and max value. It shows how much our data is spread.

np.max(weights) - np.min(weights)

Quartiles

from numpy import percentile

# calculate quartiles
quartiles = percentile(weights, [25, 50, 75])

print('Q1: %.3f' % quartiles[0])
print('Q2 or Median: %.3f' % quartiles[1])
print('Q3: %.3f' % quartiles[2])
print('Q4 or Max: %.3f' % np.max(weights))

Q1: 140.000
Q2 or Median: 159.000
Q3: 180.000
Q4 or Max: 199.000

Variance

Measures how much data is spread from the mean. For technical reasons, we use (n-1) in denominator.

$\text { Variance }=s^{2}=\frac{\sum(x_i-\bar{x})^{2}}{n-1}$

$(x_i -\bar{x})$ is deviation from mean for every value of sample, so variance is mean squared deviation

np.var(weights)

537.16871296

Std Deviation

Measures the spread from the mean. You can think of it like average distance of data from mean. To negate the squares applied earlier, we do square root here.

$s=\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}$

For a normal distribution let’s see how standard deviation varies from the mean

The below percentages are for most plots which are normally distributed.

np.std(weights)

23.1769004174415

Covariance & Co-relation

Covariance

covariance measures how two variables are dependent on each other. For a positive covariance 2nd variable increases if 1st increases. For negetive one decreases while other increases.

$\operatorname{cov}(X, Y)=\frac{\sum_{i=1}^{N}\left(x_{i}-\mu_{x}\right)\left(y_{i}-\mu_{y}\right)}{N}$

Corelation

corelation co-efficient : value lies always between -1 and 1

$\rho_{x, y}=\frac{\operatorname{cov}(X, Y)}{\sigma_{x}, \sigma_{y}}$

# Import pandas library 
import pandas as pd 
  
# initialize list of lists 
data = [[180, 160], [160, 175], [155, 125], [158, 148]] 
  
# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Height', 'Weight']) 
  
# print dataframe. 
df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

numpy.corrcoef(df['Height'], df['Weight'])

array([[1.        , 0.42121072],
       [0.42121072, 1.        ]])

Random Variable

Assigns a numerical value to the outcome of random experiment.

Distributions

Histogram

Plots frequency of values against values. A histogram can tell following in data

Probability Density Functions

Continuous
Discrete

Cumulative Density Function

The max on y axis for CDF will be 1 as all the probabilities will add upto 1.

Conditional Probability

Measure of probability of an event occurring given that another event has occurred.

For dependent Events

$P(\mathrm{A} | \mathrm{B})=\frac{\mathrm{P}(\mathrm{A} \cap \mathrm{B})}{P(\mathrm{B})}$

$\mathrm{P}(\mathrm{A} | \mathrm{B})=$ Probability of $\mathrm{A}$, given $\mathrm{B}$ occurs

$\mathrm{P}(\mathrm{A} \cap \mathrm{B})=$ Probability of $\mathrm{A}$ and $\mathrm{B}$ occurring

$\mathrm{P}(\mathrm{B})=$ Probability of $\mathrm{B}$

For Independent Events

$P(A | B)=P(A) \quad$ (if $A$ and $B$ independent)

Lets see some example of coin tossings.

Here is the tree diagram for combinations.

Finally if you sum up, all the combinations will result in sum of 1.

(1/4) + (1/4) + (1/4) + (1/4) = 1

# Lets see all possible combinations of coin tossings
from itertools import product 
tossings = set(product(['H', 'T'], repeat=3))
print("All possible combinations of coin 3 tossings")
tossings

All possible combinations of coin 3 tossings





{('H', 'H', 'H'),
 ('H', 'H', 'T'),
 ('H', 'T', 'H'),
 ('H', 'T', 'T'),
 ('T', 'H', 'H'),
 ('T', 'H', 'T'),
 ('T', 'T', 'H'),
 ('T', 'T', 'T')}

# filter by 1st trail is Head
first_head = {item for item in tossings if item[0] == 'H'}
first_head

{('H', 'H', 'H'), ('H', 'H', 'T'), ('H', 'T', 'H'), ('H', 'T', 'T')}

two_head = {item for item in tossings if item.count('H') == 2}
two_head

{('H', 'H', 'T'), ('H', 'T', 'H'), ('T', 'H', 'H')}

# p(first_head / two_head) : probability of first one being head given there are 2 heads

def probability(items):
  return len(items) / len(tossings)

def conditional_probability(A, B):
  return len(A & B) / len(B)

probability(first_head)

0.5

probability(two_head)

0.375

conditional_probability(first_head, two_head)

0.6666666666666666

Central Limit Theorem

The distribution of mean of all the samples will be normal distribution even if actual population is not normal.

import numpy as np
import pandas as pd
from numpy.random import seed
from numpy.random import randint
from numpy import mean
import seaborn as sns
import matplotlib.pyplot as plt

# seed the random number generator
seed(1)
# generate a sample of weights of population
weights = randint(low=120, high=200, size=10000)
print('The average weight is {} pounds'.format(mean(weights)))

weight_df = pd.DataFrame(data={'weight_in_pounds': weights})

weight_df.head()

The average weight is 159.6552 pounds


/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

# Lets visualize the population weight frequency graph

sns.distplot(weight_df['weight_in_pounds'], color="blue")
plt.xlabel("random variable of weights")
plt.ylabel("probability of occurence")
plt.title("Distribution of weight of people");

Lets collect the mean of all the samples taken from the population

no_of_samples_list = [20, 100, 1000] # total samples n
total_mean_list = [] # to store mean of each sample caclulated
mean_of_mean_list = [] 

for n in no_of_samples_list:
  mean_list_given_sample_num = []
  for sample_no in range(n):
    curr_sample = np.random.choice(weight_df['weight_in_pounds'], size = 100) # each sample size k
    mean = np.mean(curr_sample)
    mean_list_given_sample_num.append(mean)
  total_mean_list.append(mean_list_given_sample_num)
  mean_of_mean_list.append(np.mean(mean_list_given_sample_num))
  
# Lets view the distribution and frequency of mean of this samples 

# Make the graph 40 inches by 40 inches
# fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(15,5), sharex=True)

# plot numbering starts at 1
plot_number=1
for mean_list in total_mean_list:
    ax = sns.distplot(mean_list, color="blue")
    print("plt number {} and mean of mean {}".format(plot_number, np.mean(mean_list)))
    ax.set_title("no of samples {}".format(len(mean_list)))
    # Go to the next plot for the next loop
    plot_number = plot_number + 1
    plt.show()

plt number 1 and mean of mean 158.977

plt number 2 and mean of mean 159.8567

plt number 3 and mean of mean 159.63302000000002

when the number of samples increased, the distribution of mean of samples tends to become normal distribution function.

Lets see the mean of means for different no of samples values.

mean_of_mean_list

[160.49999999999997, 159.65560000000002, 159.65964000000002]

Lets visualize all the 3 in single plot to compare.

sns.distplot(total_mean_list[0], label="mean of samples for $n={}$".format(no_of_samples_list[0]))
sns.distplot(total_mean_list[1], label="mean of samples for $n={}$".format(no_of_samples_list[1]))
sns.distplot(total_mean_list[2], label="mean of samples for $n={}$".format(no_of_samples_list[2]))
plt.title("Distribution of Sample Means of People's Mass in Pounds", y=1.015, fontsize=20)
plt.xlabel("sample mean mass [pounds]")
plt.ylabel("frequency of occurence")
plt.legend();

References

https://dfrieds.com/math/central-limit-theorem.html