Nowadays the majority of data sets in the industry are unbalanced. Meaning that a class has a higher frequency than others. Very often classifiers in such cases due to the unbalance of the data predict all samples as the most frequent class. To solve this problem we decided at Sigmoid to create a package that will have implemented all oversampling methods. We named it Crucio, and in this article, I will tell you about MTDF(Mega-Trend Diffusion-Function).

How does MTDF work?

MTDF uses a common diffusion function to diffuse a set of data. Everything starts with computing the hset parameter of the minority class —…


Nowadays the majority of data sets in the industry are unbalanced. Meaning that a class has a higher frequency than others. Very often classifiers in such cases due to the unbalance of the data predict all samples as the most frequent class. To solve this problem we decided at Sigmoid to create a package that will have implemented all oversampling methods. We named it Crucio, and in this article, I will tell you about MWMOTE (Majority Weighted Minority Oversampling Technique).

How does MWMOTE work?

Everything in MWMOTE starts with searching all minority samples with only k1-nearest neighbors as majority class samples. This helps to…


Nowadays the majority of data sets in the industry are unbalanced. Meaning that a class has a higher frequency than others. Very often classifiers in such cases due to the unbalance of the data predict all samples as the most frequent class. To solve this problem we decided at Sigmoid to create a package that will have implemented all oversampling methods. We named it Crucio, and in this article, I will tell you about TKRKNN(Top-K Reversed KNN).

How does TKRKNN work?

Firstly we start by finding the K-nearest neighbors for all samples of the minority class. Then, TKRKNN finds out the number of samples…


Nowadays the majority of data sets in the industry are unbalanced. Meaning that a class has a higher frequency than others. Very often classifiers in such cases due to the unbalance of the data predict all samples as the most frequent class. To solve this problem we decided at Sigmoid to create a package that will have implemented all oversampling methods. We named it Crucio, and in this article, I will tell you about ADASYN(Adaptive Synthetic).

How does the ADASYN work?

Firstly we must calculate the number of samples to generate.


Nowadays the majority of data sets in the industry are unbalanced, meaning that a class has a higher frequency than others. Very often classifiers in such cases due to the unbalance of the data predict all samples as the most frequent class. To solve this problem we decided at Sigmoid to create a package that will have implemented all oversampling methods. We named it Crucio, and in this article, I will tell you about ICOTE (Immune centroids over-sampling method for multi-class classification).

How does ICOTE work?

ICOTE has a very simple logic behind that. It can be split into 2 parts:

  1. Clone generations: In…


50 books to become an expert in any subject.

Photo by Dorina Pantaz

I don’t really remember where I heard or read this quote, but I decided that it sounds like a good goal, for me, a person that aims to become an Artificial Intelligence Engineer. So from August to November, I read 50 books on Artificial Intelligence, Machine Learning, Higher Maths, python, Project Management, and more fields related to AI. Know I decided to share my own top of books with a short recommendation.

# 1 Artificial Intelligence: A Modern Approach by Peter Norvig and Stuart J. Russell.


NaN values are one of the biggest problems in Machine Learning. However, problems are coming not from its presence, but from not knowing what they are meaning. Sometimes it is a full join that generates NaN values, in other cases, it means an imperfection of a sensor, in other cases, only the god knows what this NaN value means.

That’s why we at Sigmoid decided to add to kydavra a method that will decide which columns with NaN values are informative and what not.

Using ShannonSelector from Kydavra library.

If you still haven’t installed Kydavra just type the following in the command line.

pip install…


In the last article, we discussed the MUSESelector. This kydavra selector performs feature selection based on a data frame. The biggest drawback of this method is that it is good only for binary classification problems. There comes into play an extension of this method — M3U — (Minimum Mean Minimum Uncertainty), implemented in kydavra as M3USelector, for multiclass classification.

Using MUSESelector from Kydavra library.

If you still haven’t installed Kydavra just type the following in the following in the command line.

pip install kydavra

If you already have installed the first version of kydavra, please upgrade it by running the following command.

pip install --upgrade…


One of the most intuitive ways to select features would be to find how much the distribution of the classes is different from each other. However on some intervals, the distribution of the feature by the classes can be different, but on other intervals, it can be practically the same. So, we can deduce that the features that have the most intervals where the distribution of classes differ are the best features. This logic is implemented in Minimum Uncertainty and Sample Elimination (or shortly MUSE) implemented in kydavra as MUSESelector.

Using MUSESelector from Kydavra library.

If you still haven’t installed Kydavra just type the following…


PCA — more than just dimensional redution.

Principal Component Analysis is known as one of the most popular dimension reduction techniques. However few know that it has a very interesting property — the reduced data can be brought back to the original dimension. Even more, the data brought back to its original size is more cleaned. So, at Sigmoid we decided to create a module, to easily apply this property on pandas data frames.

Using PCAFilter from Kydavra library.

Principal Component Analysis is a dimensional reduction technique that reduces your data frame into n predefined columns, however, unlike LDA it doesn’t take into account the…

Vasile Păpăluță

A young and passionate student about Data Science and Machine Learning, dreaming of becoming one day an AI Engineer.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store