Search:

## How To Choose The Best Machine Learning Algorithm For A Particular Problem? – Analytics India Magazine

Posted: October 19, 2020 at 3:56 am

How do you know what machine learning algorithm to choose for your problem? Why dont we try all the machine learning algorithms or some of the algorithms which we consider will give good accuracy. If we apply each and every algorithm it will take a lot of time. So, it is better to apply a technique to identify the algorithm that can be used.

Choosing the right algorithm is linked up with the problem statement. It can save both money and time. So, it is important to know what type of problem we are dealing with.

In this article, we will be discussing the key techniques that can be used to choose the right machine algorithm in a particular work. Through this article, we will discuss how we can decide to use which machine learning model using the plotting of dataset properties. We will also discuss how the size of the dataset can be a considerable measure in choosing a machine learning algorithm.

The dataset is taken from Kaggle, you can find it here. It has information about the diabetic patient and whether or not each patient will have an onset of diabetes. It has 9 columns and 767 rows. Rows and columns represent patient numbers and details.

Practical Implication:

First of all, we will import the required libraries.

After it we will proceed by reading the csv file.

By applying the pair plot we will be able to understand which algorithm to choose.

From the plot, we can see that there is a lot of overlap between the data points.KNN should be preferred as it works on the principle of Euclidean distance. In case KNN is not performing as per the expectation then we can use the Decision Tree or Random Forest algorithm.

A decision tree or Random Forest works on the principle of non-linear classification. We can use it if some of the data points are overlapping with each other.

Many algorithms work on the assumption that classes can be separated by a straight line. In such cases, Logistic regression or Support Vector Machine should be preferred. It easily separates the data points by drawing a line that divides the target class. Linear regression algorithms assume that data trends follow a straight line. These algorithms perform well for the present case.

Import the various algorithm classifiers to check the training time of small and large dataset.

Split the data into train and test. Now we can proceed by applying Decision Tree, Logistic Regression, Random Forest and Support Vector Machine algorithms to check the training time for a classification problem.

Now, we will fit several machine learning models on this dataset and check the training time taken by these models.

From the above results, we can conclude that Decision Trees will take much less time than all algorithms for small dataset. Hence, it is recommended to use a low bias/high variance classifier like a decision tree.

The dataset is taken from Kaggle, you can find it here. It has information about credit card fraud that occurred in two days. Feature Class is a target variable and it takes 1 in case of fraud and 0 otherwise. It has 284807 rows and 31columns.

#Train-Test Split

Now again, on this second dataset, we will fit the above machine learning models on this dataset and check the training time taken by these models.

With the huge dataset size depth of Decision Tree grows, it implements multiple if-else statements which increase complexity and time. Both Random Forest and Xgboost use the Decision Tree algorithm which takes more time. The result shows Logistic regression outperforms others.

I have concluded my analysis in selecting the correct machine learning algorithm. Furthermore, it is always advisable to use two algorithms for addressing the problem statement. This could provide a good reference point for the audience.