Nowadays the majority of data sets in the industry are unbalanced, meaning that a class has a higher frequency than others. Very often classifiers in such cases due to the unbalance of the data predict all samples as the most frequent class. To solve this problem we decided at Sigmoid to create a package that will have implemented all oversampling methods. We named it Crucio, and in this article, I will tell you about ICOTE (Immune centroids over-sampling method for multi-class classification).

How does ICOTE work?

ICOTE has a very simple logic behind that. It can be split into 2 parts:

  1. Clone generations: In…

50 books to become an expert in any subject.

Photo by Dorina Pantaz

I don’t really remember where I heard or read this quote, but I decided that it sounds like a good goal, for me, a person that aims to become an Artificial Intelligence Engineer. So from August to November, I read 50 books on Artificial Intelligence, Machine Learning, Higher Maths, python, Project Management, and more fields related to AI. Know I decided to share my own top of books with a short recommendation.

# 1 Artificial Intelligence: A Modern Approach by Peter Norvig and Stuart J. Russell.

NaN values are one of the biggest problems in Machine Learning. However, problems are coming not from its presence, but from not knowing what they are meaning. Sometimes it is a full join that generates NaN values, in other cases, it means an imperfection of a sensor, in other cases, only the god knows what this NaN value means.

That’s why we at Sigmoid decided to add to kydavra a method that will decide which columns with NaN values are informative and what not.

Using ShannonSelector from Kydavra library.

If you still haven’t installed Kydavra just type the following in the command line.

pip install…

In the last article, we discussed the MUSESelector. This kydavra selector performs feature selection based on a data frame. The biggest drawback of this method is that it is good only for binary classification problems. There comes into play an extension of this method — M3U — (Minimum Mean Minimum Uncertainty), implemented in kydavra as M3USelector, for multiclass classification.

Using MUSESelector from Kydavra library.

If you still haven’t installed Kydavra just type the following in the following in the command line.

pip install kydavra

If you already have installed the first version of kydavra, please upgrade it by running the following command.

pip install --upgrade…

One of the most intuitive ways to select features would be to find how much the distribution of the classes is different from each other. However on some intervals, the distribution of the feature by the classes can be different, but on other intervals, it can be practically the same. So, we can deduce that the features that have the most intervals where the distribution of classes differ are the best features. This logic is implemented in Minimum Uncertainty and Sample Elimination (or shortly MUSE) implemented in kydavra as MUSESelector.

Using MUSESelector from Kydavra library.

If you still haven’t installed Kydavra just type the following…

PCA — more than just dimensional redution.

Principal Component Analysis is known as one of the most popular dimension reduction techniques. However few know that it has a very interesting property — the reduced data can be brought back to the original dimension. Even more, the data brought back to its original size is more cleaned. So, at Sigmoid we decided to create a module, to easily apply this property on pandas data frames.

Using PCAFilter from Kydavra library.

Principal Component Analysis is a dimensional reduction technique that reduces your data frame into n predefined columns, however, unlike LDA it doesn’t take into account the…

Many times, we have some features that are strongly correlated with the target column. However, sometimes they are correlated with each other, generating in such a way the problem of multicollinearity. One way is to reduce one of these columns. But, we at sigmoid want to propose to you a new solution to this problem implemented in kydavra.

Using LDAReducer from Kydavra library.

Linear Discriminant Analysis is a dimensional reduction technique that reduces your data frame into n predefined columns, however, unlike PCA it takes into account the target vector.

At sigmoid, we thought what would be if instead of reducing the whole data frame…

We all know the Occam’s Razor:

From a set of solutions took the one that is the simplest.

This principle is applied in the regularization of the linear models in Machine Learning. L1-regularisation (also known as LASSO) tend to shrink wights of the linear model to 0 while L2-regularisation (known as Ridge) tend to keep overall complexity as simple as possible, by minimizing the norm of vector weights of the model. One of Kydavra’s selectors uses Lasso for selecting the best features. So let’s see how to apply it.

Using Kydavra LassoSelector.

If you still haven’t installed Kydavra just type the following in…

So how we said in previous articles about Kydavra library, Feature selection is a very important part of Machine Learning model development. Unfortunately, there is not only one unique way to get the ideal model, mostly because of the fact that data almost every time has different forms, but this also implies different approaches. In this article, I would like to share a way to select the categorical features using Kydavra ChiSquaredSelector created by Sigmoid.

Using ChiSquaredSelector from Kydavra library.

As always, for those that are there mostly just for the solution to their problem their are the commands and the code:

To install kydavra…

Almost every person in data science or Machine Learning knows that one of the easiest ways to find relevant features for predicted value y is to find the features that are most correlated with y. However few (if not a mathematician) know that there are many types of correlation. In this article, I will shortly tell you about the 3 most popular types of Correlation and how you can easily apply them with Kydavra for feature selection.

Pearson correlation.

Pearson’s correlation coefficient in the covariance of two variables divided by the product of their standard deviations.

Vasile Păpăluță

A young and passionate student about Data Science and Machine Learning, dreaming of becoming one day an AI Engineer.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store