Applying Darwinian Evolution to feature selection with Kydavra GeneticAlgorithmSelector

Vasile Păpăluță
softplus-publication
3 min readJan 15, 2021

--

Maths almost always have a good answer in questions related to feature selection. However, sometimes good-old brute force algorithms can bring into the game a better and more practical answer.

Genetic algorithms are a family of algorithms inspired by biological evolution, that basically use the cycle — cross, mutate, try, developing the best combination of states depending on the scoring metric. So, let’s get to the code.

Using GeneticAlgorithmSelector from Kydavra library.

To install kydavra just write the following command in terminal:

pip install kydavra

Now you can import the Selector and apply it on your data set a follows:

from kydavra import GeneticAlgorithmSelectorselector = GeneticAlgorithmSelector()new_columns = selector.select(model, df, ‘target’)

As with every Kydavra selector that’s all. Now let’s try it on the Heart disease dataset.

import pandas as pddf = pd.read_csv(‘cleaned.csv’)

I highly recommend you to shuffle your dataset before applying the selector, because it uses metrics (and right now cross_val_score isn’t implemented in this selector).

df = df.sample(frac=1).reset_index(drop=True)

Now we can apply our selector. To mention it has some parameters:

  • nb_children (int, default = 4) the number of best children that the algorithm will choose for the next generation.
  • nb_generation (int, default = 200) the number of generations that will be created, technically speaking the number of iterations.
  • scoring_metric (sklearn scoring metric, default = accuracy_score) The metric score used to select the best feature combination.
  • max (boolean, default=True) if is set to True, the algorithm will select the combinations with the highest score if False the lowest scores will be chosen.

But for now, we will use the basic setting except for the scoring_metric, because we have there a problem of disease diagnosis, so it will better to use Precision instead of accuracy.

from kydavra import GeneticAlgorithmSelector
from sklearn.metrics import precision_score
from sklearn.ensemble import RandomForestClassifier
selector = GeneticAlgorithmSelector(scoring_metric=precision_score)
model = RandomForestClassifier()

So now let’s find the best features. GAS (short version for GeneticAlgorithmSelector) need a sklearn model to train during the process of choosing features, the data frame itself and of course the name of target column:

selected_cols = selector.select(model, df, 'target')

Now let’s evaluate the result. Before feature selection, the precision score of the Random Forest was — 0.805. GAS choose the following features:

['age', 'sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']

Which gave the following precision score — 0.823. Which is a good result, knowing that in the majority of cases it is very hard to level up the scoring metrics.

If you want to find out more about Genetic Algorithms, at the bottom of the article are some useful links. If you tried Kydavra and have some issues or feedback, please contact me on medium or please fill this form.

Made with ❤ by Sigmoid

Useful links:

--

--

Vasile Păpăluță
softplus-publication

A young and passionate student about Data Science and Machine Learning, dreaming of becoming one day an AI Engineer.