Feature selection examples for logistic regression

Dariga Kokenova
4 min readFeb 24, 2021

--

Let’s start with “Pump it Up: Data Mining the Water Table” dataset downloaded from drivendata.org. Dataset includes training set values and a separate target variable called ‘status_group’. Training set contains 59,400 observations and 39 columns. Data was provided by Tanzanian Ministry of Water and nonprofit organization Taarifa to predict which waterpoints are functional or not. An accurate understanding of which waterpoints are non operational can improve maintenance and enhance access to clean water across the communities.

Below is a list of changes applied to the dataset:

  • ‘Status_group’ target variable has 3 possible outcomes: functional, functional needs repair and non functional. Change functional to 0 and functional needs repair + non functional to 1.
  • Fill NaN in ‘public_meeting’ and ‘permit’ with False.
  • Variable ‘construction_year’ is equal to 0 for about 20,000 observations. To fill 0, first remap ‘construction_year’ based on region mean and then based on overall ‘construction_year’ mean.
  • For variables ‘installer’, ‘funder’, ‘wpt_name’, ‘subvillage’, ‘lga’, ‘ward’, ‘scheme_name’: if value count is less than 200 or NaN, assign it to category other.
  • Drop column ‘date_recorded’.
  • Create dummy variables for non continuous variables.

After performing the steps above, we will have 59,400 observations and 382 columns. That is the dataset we will apply logistic regression to.

  1. Logistic Regression

Let’s run a logistic regression on the dataset with 382 columns (features). Our output is 0 and 1. 0 means that the waterpoint is functional, and 1 means the waterpoint is non functional. The resulting metrics are:

  • Accuracy: how often are we correct at predicting functional and non functional waterpoints? Accuracy = (True Positive + True Negative) / (all predictions)
  • Precision: when we predict the waterpoint to be non functional, how often is that prediction correct? Precision = True Positive / (True Positive + False Positive)
  • Recall (Sensitivity): what proportion of truly non functional waterpoints was identified correctly? Recall (Sensitivity) = True Positive / (True Positive + False Negative)
  • F1 score: harmonic average of precision and recall metrics. F1 score = (2 * Precision * Recall) / (Precision + Recall)

2. Applying Recursive Feature Elimination (RFE)

A general rule of thumb is to use as many features as a square root of the number of observations. For this particular example, we need to take a square root of 59,400, which is approximately equal to 243.7. However, we have 382 features (columns) in our dataset. Let’s try to narrow it down to 250 features using sklearn.feature_selection.RFE.

Feature selection methods, such as RFE, reduce overfitting and improve accuracy of the model. Below are the metrics for logistic regression after RFE application, and you can see that all metrics have increased.

3. Applying SelectFromModel (SFM)

Another way to determine feature importance is sklearn.feature_selection.SelectFromModel. We will need to specify that we want max_features=250 and threshold=-np.inf. Otherwise, we can specify threshold, and SFM will determine how many features meet that requirement. The resulting metrics are higher than logistic regression without feature selection, but slightly lower than logistic regression with RFE.

4. Applying SequentialFeatureSelector (SFS)

We can also apply sklearn.feature_selection.SequentialFeatureSelector to the dataset. SFS adds (forward selection) or removes (backward selection) features to the feature subset. However, SFS is significantly slower than SFM due to its iterative process and cv scoring. Narrowing down the number of features from 382 to 250 was taking too long, and that process had to be cancelled. However, below is the code for SFS with 5 features to select and the resulting metrics. Please note that this is just an example to demonstrate how SFS is used and not recommended to be applied to large datasets.

Note: the lower left corner in the confusion matrix below is 1,925.

So far, logistic regression with RFE feature selection produces the best metrics. However, if we try decision tree with max_depth=20, the resulting metrics and confusion matrix are much better.

To conclude, applying feature selection methods to logistic regression will improve the accuracy of the model but other models, such as decision tree, might be even better for improving accuracy.

--

--