Statistics for Machine Learning
Data types in stats
Examples of Numerical
# continous
mu = 20
sigma=2
data_continous = numpy.random.normal(mu, sigma, 1000) # generate from 100 to 150 with 0.1 difference
sns.distplot(data_continous, color="blue")
plt.show()
# discrete
import numpy as np
dice_rolls = [np.random.randint(1, 7) for _ in range(10)]
plt.hist(dice_rolls)
plt.show()
Nominal Data
Data you can’t order.
- Gender
- religion
- hair color
data = {'Name': ['Jim','Jake','Jessy'],
'Gender': ['Male','Male','Female']
}
data = pd.DataFrame(data, columns=['Name', 'Gender'])
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Ordinal Data
Data you can order, but can’t do arthematic.
- customer ratings
- economic status
data = {'Movie': ['Superman','Heman','Spiderman'],
'Rating': [4.0,4.7,4.9]
}
data = pd.DataFrame(data, columns=['Movie', 'Rating'])
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Central Tendancy
Generate data
import numpy as np
import pandas as pd
from numpy.random import seed
from numpy.random import randint
from numpy import mean
import seaborn as sns
import matplotlib.pyplot as plt
# lets generate some weights of population
# seed the random number generator
seed(1)
# generate a sample of weights of population
weights = randint(low=120, high=200, size=10000)
Lets plot the histogram of weights of population and analyse it.
import matplotlib.pyplot as plt
sns.distplot(weights, color="blue")
plt.xlabel("weights")
plt.ylabel("frequency")
plt.show()
Mean
$\mu=\frac{\sum_{i=1}^{N} x_{i}}{N}$
- one of the several values to descrie central tendancy of the data
import numpy
numpy.mean(weights)
159.6552
Median
if n is odd
$Median =\left(\frac{n+1}{2}\right)^{t h} term$
if n is even
$\text { Median }=\frac{\left(\frac{n}{2}\right)^{t h} \text {term}+\left(\frac{n}{2}+1\right)^{t h} \text { term }}{2}$
numpy.median(weights)
159.0
Mode
Mode is the most frequently occured value in our distribution.
from scipy import stats
stats.mode(weights)
ModeResult(mode=array([152]), count=array([152]))
How they change
Measure of spread
Range
Range is the difference between min and max value. It shows how much our data is spread.
np.max(weights) - np.min(weights)
79
Quartiles
from numpy import percentile
# calculate quartiles
quartiles = percentile(weights, [25, 50, 75])
print('Q1: %.3f' % quartiles[0])
print('Q2 or Median: %.3f' % quartiles[1])
print('Q3: %.3f' % quartiles[2])
print('Q4 or Max: %.3f' % np.max(weights))
Q1: 140.000
Q2 or Median: 159.000
Q3: 180.000
Q4 or Max: 199.000
Variance
Measures how much data is spread from the mean. For technical reasons, we use (n-1) in denominator.
$\text { Variance }=s^{2}=\frac{\sum(x_i-\bar{x})^{2}}{n-1}$
$(x_i -\bar{x})$ is deviation from mean for every value of sample, so variance is mean squared deviation
np.var(weights)
537.16871296
Std Deviation
Measures the spread from the mean. You can think of it like average distance of data from mean. To negate the squares applied earlier, we do square root here.
$s=\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}$
For a normal distribution let’s see how standard deviation varies from the mean
The below percentages are for most plots which are normally distributed.
np.std(weights)
23.1769004174415
Covariance & Co-relation
Covariance
covariance measures how two variables are dependent on each other. For a positive covariance 2nd variable increases if 1st increases. For negetive one decreases while other increases.
$\operatorname{cov}(X, Y)=\frac{\sum_{i=1}^{N}\left(x_{i}-\mu_{x}\right)\left(y_{i}-\mu_{y}\right)}{N}$
Corelation
corelation co-efficient : value lies always between -1 and 1
$\rho_{x, y}=\frac{\operatorname{cov}(X, Y)}{\sigma_{x}, \sigma_{y}}$
# Import pandas library
import pandas as pd
# initialize list of lists
data = [[180, 160], [160, 175], [155, 125], [158, 148]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Height', 'Weight'])
# print dataframe.
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
numpy.corrcoef(df['Height'], df['Weight'])
array([[1. , 0.42121072],
[0.42121072, 1. ]])
Random Variable
Assigns a numerical value to the outcome of random experiment.
Distributions
Histogram
Plots frequency of values against values. A histogram can tell following in data
Probability Density Functions
- Continuous
- Discrete
Cumulative Density Function
The max on y axis for CDF will be 1 as all the probabilities will add upto 1.
Conditional Probability
Measure of probability of an event occurring given that another event has occurred.
For dependent Events
$P(\mathrm{A} | \mathrm{B})=\frac{\mathrm{P}(\mathrm{A} \cap \mathrm{B})}{P(\mathrm{B})}$
$\mathrm{P}(\mathrm{A} | \mathrm{B})=$ Probability of $\mathrm{A}$, given $\mathrm{B}$ occurs
$\mathrm{P}(\mathrm{A} \cap \mathrm{B})=$ Probability of $\mathrm{A}$ and $\mathrm{B}$ occurring
$\mathrm{P}(\mathrm{B})=$ Probability of $\mathrm{B}$
For Independent Events
$P(A | B)=P(A) \quad$ (if $A$ and $B$ independent)
Lets see some example of coin tossings.
Here is the tree diagram for combinations.
Finally if you sum up, all the combinations will result in sum of 1.
(1/4) + (1/4) + (1/4) + (1/4) = 1
# Lets see all possible combinations of coin tossings
from itertools import product
tossings = set(product(['H', 'T'], repeat=3))
print("All possible combinations of coin 3 tossings")
tossings
All possible combinations of coin 3 tossings
{('H', 'H', 'H'),
('H', 'H', 'T'),
('H', 'T', 'H'),
('H', 'T', 'T'),
('T', 'H', 'H'),
('T', 'H', 'T'),
('T', 'T', 'H'),
('T', 'T', 'T')}
# filter by 1st trail is Head
first_head = {item for item in tossings if item[0] == 'H'}
first_head
{('H', 'H', 'H'), ('H', 'H', 'T'), ('H', 'T', 'H'), ('H', 'T', 'T')}
two_head = {item for item in tossings if item.count('H') == 2}
two_head
{('H', 'H', 'T'), ('H', 'T', 'H'), ('T', 'H', 'H')}
# p(first_head / two_head) : probability of first one being head given there are 2 heads
def probability(items):
return len(items) / len(tossings)
def conditional_probability(A, B):
return len(A & B) / len(B)
probability(first_head)
0.5
probability(two_head)
0.375
conditional_probability(first_head, two_head)
0.6666666666666666
Central Limit Theorem
The distribution of mean of all the samples will be normal distribution even if actual population is not normal.
import numpy as np
import pandas as pd
from numpy.random import seed
from numpy.random import randint
from numpy import mean
import seaborn as sns
import matplotlib.pyplot as plt
# seed the random number generator
seed(1)
# generate a sample of weights of population
weights = randint(low=120, high=200, size=10000)
print('The average weight is {} pounds'.format(mean(weights)))
weight_df = pd.DataFrame(data={'weight_in_pounds': weights})
weight_df.head()
The average weight is 159.6552 pounds
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
import pandas.util.testing as tm
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
# Lets visualize the population weight frequency graph
sns.distplot(weight_df['weight_in_pounds'], color="blue")
plt.xlabel("random variable of weights")
plt.ylabel("probability of occurence")
plt.title("Distribution of weight of people");
Lets collect the mean of all the samples taken from the population
no_of_samples_list = [20, 100, 1000] # total samples n
total_mean_list = [] # to store mean of each sample caclulated
mean_of_mean_list = []
for n in no_of_samples_list:
mean_list_given_sample_num = []
for sample_no in range(n):
curr_sample = np.random.choice(weight_df['weight_in_pounds'], size = 100) # each sample size k
mean = np.mean(curr_sample)
mean_list_given_sample_num.append(mean)
total_mean_list.append(mean_list_given_sample_num)
mean_of_mean_list.append(np.mean(mean_list_given_sample_num))
# Lets view the distribution and frequency of mean of this samples
# Make the graph 40 inches by 40 inches
# fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(15,5), sharex=True)
# plot numbering starts at 1
plot_number=1
for mean_list in total_mean_list:
ax = sns.distplot(mean_list, color="blue")
print("plt number {} and mean of mean {}".format(plot_number, np.mean(mean_list)))
ax.set_title("no of samples {}".format(len(mean_list)))
# Go to the next plot for the next loop
plot_number = plot_number + 1
plt.show()
plt number 1 and mean of mean 158.977
plt number 2 and mean of mean 159.8567
plt number 3 and mean of mean 159.63302000000002
when the number of samples increased, the distribution of mean of samples tends to become normal distribution function.
Lets see the mean of means for different no of samples values.
mean_of_mean_list
[160.49999999999997, 159.65560000000002, 159.65964000000002]
Lets visualize all the 3 in single plot to compare.
sns.distplot(total_mean_list[0], label="mean of samples for $n={}$".format(no_of_samples_list[0]))
sns.distplot(total_mean_list[1], label="mean of samples for $n={}$".format(no_of_samples_list[1]))
sns.distplot(total_mean_list[2], label="mean of samples for $n={}$".format(no_of_samples_list[2]))
plt.title("Distribution of Sample Means of People's Mass in Pounds", y=1.015, fontsize=20)
plt.xlabel("sample mean mass [pounds]")
plt.ylabel("frequency of occurence")
plt.legend();