Feature selection is one of the most important step to make your model works well. In data mining, feature selection is the first step and it effects to all of process.
Feature selection help model on some points:
- The model will be trained faster
- Reduce overfitting
- Simplifying model
- Reduce the dimension of data
Hence, feature selection is kick-off step and it effects overall, especially in model.
There are 3 type of feature selection: Filter methods, wrapper methods and embedded methods.
Filter methods: this methods "filter" data based on correlation score. Normally, our data have many features, and a label. We calculate the correlations between features and label. After that, we only retrain the features that have a good (relevant) correlated score and remove others. In this type of method we have some ways to calculate the correlation.
- Pearson correlation: this one is based on covariance between 2 continuous variables.
$$ p_{X,Y} = \frac {Cov(X, Y)}{\sigma_X \sigma_Y} = \frac {E[(X-\mu_X)(Y-\mu_Y)]}{\sigma_X \sigma_Y} $$
- LDA: Linear discriminant analysis is used to find a linear combination of features that characterizes or separates two or more classes (or levels) of a categorical variable.
- ANOVA: ANOVA stands for Analysis of variance. It is similar to LDA except for the fact that it is operated using one or more categorical independent features and one continuous dependent feature. It provides a statistical test of whether the means of several groups are equal or not.
- Chi-Square: It is a is a statistical test applied to the groups of categorical features to evaluate the likelihood of correlation or association between them using their frequency distribution.
- Mutual information: This is a powerful way to calculate the correlation between 2 variables. It can apply for both categorical and numerical data.
$$ MI = \sum_i^m{p(x_i,y_i)log\frac {p(x_i,y_i)}{p(x_i)p(y_i)}}$$
Wrapper methods: The fundamental idea behind the wrapper methods is choosing the best subset of features subject to maximize the accuracy of model. We have 2 main ways to do this: forward and backward.
- forward:
step 1: initialize $ F= \emptyset $
step2:
repeat {
(a) for $ i = 1...n $ features:
$F_i = F \cup i $
validate model for $F_i$
(b) select the the best subset and assign to $F$
}
step 3: output the $F$
- backward:
step 1: initialize $F = {1, ..., n}$
step 2: repeat {
(a) for $ i = 1...n $ features:
$F_i = F \cap i$
validate model for $F_i$
(b) select the best subset and assign to $F$
}
step 3: output the $F$
This one is a clear method, however it wastes the effort to calculate.
Embedded methods: These methods based on the optimization concepts. There are 2 popular ways to do: LASSO and RIDGE. LASSO uses L1 regularization to perform optimization problem, it adds penalty equalvilent to absolute value of parameters. Meanwhile, RIDGE uses L2 regularization to perform optimization problem by adding square of value of parameters. Let see the least square problem $||Ax - b||_2 $:
LASSO: minimize ($||Ax-b ||_2 + \lambda||x||$)
RIDGE: minimize($||Ax-b ||_2 + \lambda||x||_2$)
Mutual information
As we know, Mutual information (MI) is one of the way to calculate the correlation between 2 variables with same type.
The $I(X;Y)$ is mutual information and can be calculated by simple code here:
function score = calculateMI(x,y)
%
% Calculate the mutual information between 2 vector x and y.
%
m = length(x);
# propability for each element in x and y
set_x = unique(x);
set_y = unique(y);
N1 = size(set_x,1);
N2 = size(set_y,1);
Hx = zeros(N1,1);
Hy = zeros(N2,1);
for i = 1: N1
Hx(i) = sum(x == set_x(i))/m +0.000001;
end
for i = 1: N2
Hy(i) = sum(y == set_y(i))/m +0.000001;
end
# joint probability of set_x and set_y
Hxy = zeros(N1,N2);
for i = 1:N1
for j = 1:N2
for k = 1: m
if(x(k) == set_x(i) && y(k) == set_y(j))
Hxy(i,j) = Hxy(i,j) + 1;
end
end
end
end
% Pair joint probability
Hxy = Hxy./sum(Hxy(:))+0.000001;
% Product of individual state probabilities as a matrix
SP = Hx*Hy';
% Compute MI
score = sum(sum(Hxy.*log2(Hxy./SP)));
end
However, in case of continuous variable and discrete variable, how do we calculate MI ?
function score = calculateMI(x,y)
%
% Calculate the mutual information between 2 vector x and y.
%
m = length(x);
# propability for each element in x and y
set_x = unique(x);
set_y = unique(y);
N1 = size(set_x,1);
N2 = size(set_y,1);
Hx = zeros(N1,1);
Hy = zeros(N2,1);
for i = 1: N1
Hx(i) = sum(x == set_x(i))/m +0.000001;
end
for i = 1: N2
Hy(i) = sum(y == set_y(i))/m +0.000001;
end
# joint probability of set_x and set_y
Hxy = zeros(N1,N2);
for i = 1:N1
for j = 1:N2
for k = 1: m
if(x(k) == set_x(i) && y(k) == set_y(j))
Hxy(i,j) = Hxy(i,j) + 1;
end
end
end
end
% Pair joint probability
Hxy = Hxy./sum(Hxy(:))+0.000001;
% Product of individual state probabilities as a matrix
SP = Hx*Hy';
% Compute MI
score = sum(sum(Hxy.*log2(Hxy./SP)));
end
However, in case of continuous variable and discrete variable, how do we calculate MI ?
One of the method is binning the continuous variable against discrete variable like that:
Mutual information and feature selection are the popular and important concepts in Machine Learning. We should spend time to find out the best solution for each case.
Comments