One of the biggest problems with data in Data Science is its distribution, it almost every single time isn’t normal. It happens because we cannot have all samples in the world in one data set. However, there exists a bunch of methods that can change that. In one of the previous articles, we looked at Box-Cox transformation. Today we will take a look at the Yeo-Johnson transformation.

Yeo-Johnson transformation is a transformation of a non-normal variable into a normal one, following the same aim as Box-Cox. However, Yeo-Johnson is going even further, allowing y to have 0 and even negative…

As we said in the article about FrequencyImputationTransoferm,

usually, categorical values are replaced by integer numbers. However, this approach is very dangerous for linear models, because of the false correlation that may appear. A step forward from this technique is One Hot Encoding (or Dummy variables). But even Dummy variables have their drawbacks — the matrix is becoming very large, and even worse it became sparse. We also said that Frequency imputation has a big drawback, if there are two or more categories with the same frequency then this category will collide in this representation. …

Outliers are a big problem of almost all types of Machine Learning algorithms. Would be fine if somehow with a data preprocessing technique would be possible to eliminate or at least to reduce their influence on data. You can consider yourself lucky because it exists — Spatial Sign is its name.

By applying an l2 normalization on the data, this technique if the data is normalized then it brings all points in a circle doesn’t matter if it’s an outlier or not, like the plot below. Even if the data isn’t normalized, it will bring all samples closer to each…

One of the biggest problems with data in Data Science is its distribution, it almost every single time isn’t normal. It happens because we cannot have all samples in the world in one data set. However, there exists a bunch of methods that can change that. Today we will take a look at the Log-Transform.

Log transformation is a type of power transform, where we replace the x with the log(x). The base of the logarithm is usually the Euler’s number, however, it can be changed. The effects of applying Log transform on a variable are the following;

- It reduces…

Usually, categorical values are replaced by integer numbers. However, this approach is very dangerous for linear models, because of the false correlation that may appear. A step forward from this technique is One Hot Encoding (or Dummy variables). But even Dummy variables have its drawbacks — the matrix is becoming very large, and even worse it became sparse. That’s why we decided to add some methods for handling cases when you have a lot of categories in a column — FrequencyImputationTransformer.

The idea behind Frequency Imputation is very simple — you just replace the category with the frequency of the…

Very often we would like to combine different transformations on different columns of our feature matrix in a pipeline. That need motivated us to a special module that gives you this opportunity in imperio — CombinatorTransformer.

CombinatorTransformer allows you to apply a specific feature transformation on a specific set of columns. Mostly it was created to apply a transformation to numerical columns and another to categorical columns as shown below:

Feature engineering is the process of transforming your input data in such a way that it will be more representative of the Machine Learning Algorithms. However, it is very often forgotten because of the inexistence of an easy-to-use package. That’s why we decided to create the one — imperio, the third our unforgivable curse.

One of the biggest problems with data in Data Science is its distribution, it almost every single time isn’t normal. It happens because we cannot have all samples in the world in one data set. However, there exists a bunch of methods that can change that. Today…

Nowadays the majority of data sets in the industry are unbalanced. Meaning that a class has a higher frequency than others. Very often classifiers in such cases due to the unbalance of the data predict all samples as the most frequent class. To solve this problem we decided at Sigmoid to create a package that will have implemented all oversampling methods. We named it Crucio, and in this article, I will tell you about MTDF(Mega-Trend Diffusion-Function).

MTDF uses a common diffusion function to diffuse a set of data. Everything starts with computing the hset parameter of the minority class —…

Nowadays the majority of data sets in the industry are unbalanced. Meaning that a class has a higher frequency than others. Very often classifiers in such cases due to the unbalance of the data predict all samples as the most frequent class. To solve this problem we decided at Sigmoid to create a package that will have implemented all oversampling methods. We named it Crucio, and in this article, I will tell you about MWMOTE (Majority Weighted Minority Oversampling Technique).

Everything in MWMOTE starts with searching all minority samples with only k1-nearest neighbors as majority class samples. This helps to…

Nowadays the majority of data sets in the industry are unbalanced. Meaning that a class has a higher frequency than others. Very often classifiers in such cases due to the unbalance of the data predict all samples as the most frequent class. To solve this problem we decided at Sigmoid to create a package that will have implemented all oversampling methods. We named it Crucio, and in this article, I will tell you about TKRKNN(Top-K Reversed KNN).

Firstly we start by finding the K-nearest neighbors for all samples of the minority class. Then, TKRKNN finds out the number of samples…