Chi-squared test for Feature Selection

October 8, 2018

Determining Feature Importance

Are there certain features that play a larger role in the final prediction? Chi-squared test is one approach for better understanding the importance of each feature by giving you an idea of how much impact each feature will have on the final prediction

What Is It?

Chi-squared is used in a "Goodness of fit" test that tells you if your sample data fits within the distribution of a given population. Does your data represent what you would expect to find from this given population?
Chi-squared is also used to test for independence -- comparing two variables in a table to see if they are related to each other. Do the distributions of these variables differ from each other? A small chi square test statistic means there is a relationship (the observed data fits the latter very well). A large chi square statistic means there was a poor fit, and therefore not much of a relationship

How to calculate:

import pandas as pd
import numpy as np
def chi_squared_test(dataframe, observed, expected):
    '''
    Function recieves in a dataframe that contains a
    column that represents your oberserved value, and
    a column that represents your expected value. It
    then calculates and returns the chi-squared value.

    args:
        dataframe <pandas> : Pandas Dataframe
        observed <string> : column name of your observed value
        expected <string> : column name of your expected value

    returns:
        chi-square <float> : chi squared value

    Notes:
        If the chi-squared value is about zero, then there is a
        near perfect relationship between the two. Take this chi
        squared value and compare it to a critical value form a 
        chi-squared table.
    '''

    diff = dataframe[observed] - dataframe[expected]
    dfff = np.power(diff, 2)
    diff = np.divide(diff, dataframe[expected])
    
    chi_squared = np.sum(diff, axis=1).squeeze()

    return chi_squared

Applications

Chi-squared helsp show the relationship between two categorical variables.