Even though deep learning is hottest topic in the media, decision trees dominates the real world challenges. Recently, I’ve announced a decision tree based framework – Chefboost. It supports regular decision tree algorithms such as ID3, C4.5, CART, Regression Trees and some advanced methods such as Adaboost, Random Forest and Gradient Boosting Trees.
This post aims to show how to use these algorithms in python with a few line of codes. The background of the algorithms is out of the scope. You can follow the links if you wonder the mathematical background of these algorithms. They show step by step examples.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
You can also find the Youtube playlist here.
It is lightweight
There are already much powerful gradient boosting framework in the market such as XGBoost or LightGBM. They expect you to transform the data set into numerical features and target values. Herein, you should just feed the data set to Chefboost. It does not matter the data type of your data set. It could be nominal or numerical.
Besides, built decision trees are stored as dedicated python files. It just includes python if statements. That’s why, you can easily read, understand and manipulate the built decision trees.
Framework Installation
The easiest way to install the framework is PyPI. Just running the following command handles installation.
pip install chefboost
Alternatively, if you do not prefer to install it with pip, code repository of the framework is pushed to the GitHub. All you need to do is to run the following command in your command prompt. If you do not feel comfortable in command line, you can alternatively follow the clone or download and then download zip steps in the repository.
git clone https://github.com/serengil/chefboost.git
Hello, World!
We will create a Dispatcher.py file in the chefboost directory. Both Chefboost.py and Dispatcher.py files must be stored in the same directory.
ID3 Algorithm
ID3 is the oldest and one of the common decision tree algorithm. It expects nominal features and nominal target values. We can use the dataset/golf.txt file as a dataset. It stores previous golf playing decisions based on some features such as outlook, temperature, humidity and wind.
import pandas as pd df = pd.read_csv("dataset/golf.txt")
Building decision trees is handled by fit command. We just feed the data set and algorithm as a configuration.
from chefboost import Chefboost as chef config = {'algorithm': 'ID3'} model = chef.fit(df, config)
Calling the fit command builds the decision rules under outputs/rules folder. Built tree checks outlook feature first. For example, we will play golf if outlook is overcast. We will check humidity when outlook is sunny. We will play golf when outlook is sunny and humidity is normal. On the other hand, we will not play golf when outlook is sunny and humidity is high. As seen, it is very easy to read and understand the outcomes of decision trees.
def findDecision(obj): #obj[0]: Outlook, obj[1]: Temp., obj[2]: Humidity, obj[3]: Wind if obj[0] == 'Sunny': if obj[2] == 'High': return 'No' elif obj[2] == 'Normal': return 'Yes' elif obj[0] == 'Rain': if obj[3] == 'Weak': return 'Yes' elif obj[3] == 'Strong': return 'No' elif obj[0] == 'Overcast': return 'Yes'
Making Predictions
We can make predictions when the decision rules built. We will call predict function and pass built model in the previous step and feature values to make custom prediction.
feature = ['Sunny','Hot','High','Weak'] prediction = chef.predict(model, feature)
Moreover, we can predict the custom item in our data set. The following example retrieves the features of the 1st item in the train set and make prediction.
prediction = chef.predict(model, df.iloc[0])
Obviously, we can build a for loop to make predictions for all items in a data frame.
for index, instance in df.iterrows(): prediction = chef.predict(model, instance)
ID3 algorithm uses information gain to find the most dominant feature.
C4.5 Algorithm
C4.5 algorithm can handle numerical feature values but it still expects nominal target values. dataset/golf2.txt file is in this form.
df = pd.read_csv("dataset/golf2.txt")
Similar to ID3, we will just mention the algorithm to build a decision tree.
config = {'algorithm': 'C4.5'} model = chef.fit(df.copy(), config)
C4.5 algorithm transforms the numerical features into binary split points. For example, temperature is a numerical value in the raw data set. Decision rule check the temperature is greater than or equal to 83 or less than 83 in the first step. Temperature values actually transformed to boolean true false values based on this rule.
def findDecision(obj): #obj[0]: Outlook, obj[1]: Temp., obj[2]: Humidity, obj[3]: Wind if obj[1]<=83: if obj[0] == 'Rain': if obj[3] == 'Weak': return 'Yes' elif obj[3] == 'Strong': return 'No' elif obj[0] == 'Sunny': if obj[2]>65: if obj[3] == 'Weak': return 'Yes' elif obj[3] == 'Strong': return 'Yes' elif obj[0] == 'Overcast': return 'Yes' elif obj[1]>83: return 'No'
C4.5 algorithm uses gain ratio to find the most dominant feature.
CART Algorithm
Similar to C4.5, CART algorithm can handle both numerical and nominal features but target value must be nominal. You just specify the algorithm and apply same steps similar to previous ones.
config = {'algorithm': 'CART'}
CART algorithm uses Gini index value to find the most dominant feature in every step.
Regression Trees
Previous algorithms can handle just string target values. Herein, target values must be numerical in Regression Trees. Data sets under dataset/golf3.txt and dataset/golf4.txt are in this form. Previous data sets store playing golf decision as yes or no but here number of golf players based on some features stored.
df = pd.read_csv("dataset/golf4.txt")
The framework can decide the Regression Tree algorithm to apply even if you specify a different algorithm for a data set including numerical target value.
config = {'algorithm': 'Regression'}
Decision rules will return numerical values in regression trees.
def findDecision(obj): #obj[0]: Outlook, obj[1]: Temp., obj[2]: Humidity, obj[3]: Wind if obj[0] == 'Sunny': if obj[1]<=83: return 37.75 elif obj[1]>83: return 25 elif obj[0] == 'Rain': if obj[3] == 'Weak': return 47.666666666666664 elif obj[3] == 'Strong': return 26.5 elif obj[0] == 'Overcast': return 46.25
Regression trees use standard deviation to find the most dominant feature.
Advanced Methods
We’ve mentioned regualar decision tree algorithms. Advanced algorithms use these regular decision tree algorithm in a different perspective.
Random Forest
Decision tree algorithms tend to overfit for a large scale data sets.
Imagine a wise person in your company. He can know all business processes and every one asks him. On the other hand, in a horizontal organization every employee knows a single business process. Employees can come together and they can answer every question based on their knowledge. You might think a regular decision tree algorithm as a wise person in your company. Then, horizontal organization would become random forest.
Data set will be separated into sub data sets and we will build several decision trees into these sub data sets. In this case, every decision tree will answer some answer to a question, e.g. yes or no. The class which has more answers will become the final answer. To avoid tied game, the data set will be separated into a prime number.
We can use the data set under dataset/car.data. It has thousands of instances.
df = pd.read_csv("dataset/car.data")
We will mention to enable the random forest improvement. It will build several ID3 decision trees in this case.
config = {'algorithm': 'ID3', 'enableRandomForest': True, 'num_of_trees': 5} model = chef.fit(df.copy(), config)
5 different decision rules will be created (rules_0.py to rules_4.py) in this case. Every decision rule consists of hundreds of lines. Calling predict function will ask the prediction of every sub decision tree and will return the dominant answer.
prediction = chef.predict(model, ['vhigh','vhigh','2','2','small','low'])
Gradient Boosting
Here, gradient boosted trees will create the sage. We will feed all data set and build a regression tree. Based on its errors, we will build another decision tree. Based on the errors of second regression tree, we will build a third one. This goes several times. The final prediction will be the sum of each decision tree result.
Error will decrease over iterations and we will predict much more accurate
Notice that gradient boosting can run just for regression problems. If you are going to apply gradient boosting for a classification problem, the framework will transform the data set into regression first and boost errors.
config = {'enableGBM': True, 'epochs': 7, 'learning_rate': 1}
Predict function will call built decision trees and give you the final prediction.
prediction = chef.predict(model, ['Sunny',85,85,'Weak'])
Adaboost
In contrast to Random Forest and Gradient Boosting, Adaboost would not use regular decision tree algorithms. It runs decision stumps which is one level decision tree. You might think that predicting the gender based on height. If one’s height is greater than 1.70 meters, it would be male otherwise female. This decision stump will fail but it would have 51% accuracy. Adaboost will create several decision stumps and set them to some weights.
config = {'enableAdaboost': True, 'num_of_weak_classifier': 3} model = chef.fit(pd.read_csv("dataset/adaboost.txt"), config)
prediction = chef.predict(model, [4, 3.5])
The final prediction will be the sum of each decision stumps weight times decision stump’s prediction.
The basic idea is that weak classifiers come together and become a strong classifier. You might think that poor employees come together and move a heavy rock.
Conclusion
So, we’ve mentioned a lightweight decision tree framework in Python supporting regular both regular decision tree algorithms including ID3, C4.5, CART, Regression Trees and some advanced bagging and boosting methods including Random Forest, Gradient Boosting and Adaboost. You just need to write a few lines of code.
There are many ways to support a project – starring the GitHub repos is one.
Support this blog if you do like!
Dear, thanks for your library. I’m testing it with a CART and a dataset of 200k rows and it takes more than 10 mins to give me a result. Is there a way to make it faster? Does it depend on rows number or features number?
Thanks!
Currently, the framework creates leafs and branches serially. I plan to make it parallel soon. It would solve time problem.
I think the issue is in the findDecision() function, it takes time to get data in big dataframes. Don’t know if dictionary can resolve
Thank you for your work ,, how we can calculate the Errors & Root Mean Square Error
These error metrics are calculated when you call fit command
How to plot the tree? There is no function in plotting the tree
The framework does not support to plot tree. You can just see the if statements.
How to do Post Rule pruning after applying C4.5 decision tree classifier ?
This is not handled in chefboost unfortunately.
How to make cross validation in case of usinv chefboost c4.5
it is not a boosting algorithm, you cannot do it.