A Gentle Introduction to Chefboost for Applied Machine Learning

Even though deep learning is hottest topic in the media, decision trees dominates the real world challenges. Recently, I’ve announced a decision tree based framework – Chefboost. It supports regular decision tree algorithms such as ID3, C4.5, CART, Regression Trees and some advanced methods such as Adaboost, Random Forest and Gradient Boosting Trees.

Burger

This post aims to show how to use these algorithms in python with a few line of codes. The background of the algorithms is out of the scope. You can follow the links if you wonder the mathematical background of these algorithms. They show step by step examples.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

You can also find the Youtube playlist here.

It is lightweight

There are already much powerful gradient boosting framework in the market such as XGBoost or LightGBM. They expect you to transform the data set into numerical features and target values. Herein, you should just feed the data set to Chefboost. It does not matter the data type of your data set. It could be nominal or numerical.

Besides, built decision trees are stored as dedicated python files. It just includes python if statements. That’s why, you can easily read, understand and manipulate the built decision trees.

Framework Installation

The easiest way to install the framework is PyPI. Just running the following command handles installation.

pip install chefboost

Alternatively, if you do not prefer to install it with pip, code repository of the framework is pushed to the GitHub. All you need to do is to run the following command in your command prompt. If you do not feel comfortable in command line, you can alternatively follow the clone or download and then download zip steps in the repository.

git clone https://github.com/serengil/chefboost.git

Hello, World!

We will create a Dispatcher.py file in the chefboost directory. Both Chefboost.py and Dispatcher.py files must be stored in the same directory.

ID3 Algorithm

ID3 is the oldest and one of the common decision tree algorithm. It expects nominal features and nominal target values. We can use the dataset/golf.txt file as a dataset. It stores previous golf playing decisions based on some features such as outlook, temperature, humidity and wind.

import pandas as pd
df = pd.read_csv("dataset/golf.txt")
golf-1-txt
golf.txt

Building decision trees is handled by fit command. We just feed the data set and algorithm as a configuration.





from chefboost import Chefboost as chef
config = {'algorithm': 'ID3'}
model = chef.fit(df, config)

Calling the fit command builds the decision rules under outputs/rules folder. Built tree checks outlook feature first. For example, we will play golf if outlook is overcast. We will check humidity when outlook is sunny. We will play golf when outlook is sunny and humidity is normal. On the other hand, we will not play golf when outlook is sunny and humidity is high. As seen, it is very easy to read and understand the outcomes of decision trees.

def findDecision(obj): #obj[0]: Outlook, obj[1]: Temp., obj[2]: Humidity, obj[3]: Wind
   if obj[0] == 'Sunny':
      if obj[2] == 'High':
         return 'No'
      elif obj[2] == 'Normal':
         return 'Yes'
   elif obj[0] == 'Rain':
      if obj[3] == 'Weak':
         return 'Yes'
      elif obj[3] == 'Strong':
         return 'No'
   elif obj[0] == 'Overcast':
      return 'Yes'

Making Predictions

We can make predictions when the decision rules built. We will call predict function and pass built model in the previous step and feature values to make custom prediction.

feature = ['Sunny','Hot','High','Weak']
prediction = chef.predict(model, feature)

Moreover, we can predict the custom item in our data set. The following example retrieves the features of the 1st item in the train set and make prediction.

prediction = chef.predict(model, df.iloc[0])

Obviously, we can build a for loop to make predictions for all items in a data frame.

for index, instance in df.iterrows():
	prediction = chef.predict(model, instance)

ID3 algorithm uses information gain to find the most dominant feature.

C4.5 Algorithm

C4.5 algorithm can handle numerical feature values but it still expects nominal target values. dataset/golf2.txt file is in this form.

df = pd.read_csv("dataset/golf2.txt")
golf-2-txt
golf2.txt

Similar to ID3, we will just mention the algorithm to build a decision tree.

config = {'algorithm': 'C4.5'}
model = chef.fit(df.copy(), config)

C4.5 algorithm transforms the numerical features into binary split points. For example, temperature is a numerical value in the raw data set. Decision rule check the temperature is greater than or equal to 83 or less than 83 in the first step. Temperature values actually transformed to boolean true false values based on this rule.

def findDecision(obj): #obj[0]: Outlook, obj[1]: Temp., obj[2]: Humidity, obj[3]: Wind
   if obj[1]<=83:
      if obj[0] == 'Rain':
         if obj[3] == 'Weak':
            return 'Yes'
         elif obj[3] == 'Strong':
            return 'No'
      elif obj[0] == 'Sunny':
         if obj[2]>65:
            if obj[3] == 'Weak':
               return 'Yes'
            elif obj[3] == 'Strong':
               return 'Yes'
      elif obj[0] == 'Overcast':
         return 'Yes'
   elif obj[1]>83:
      return 'No'

C4.5 algorithm uses gain ratio to find the most dominant feature.

CART Algorithm

Similar to C4.5, CART algorithm can handle both numerical and nominal features but target value must be nominal. You just specify the algorithm and apply same steps similar to previous ones.





config = {'algorithm': 'CART'}

CART algorithm uses Gini index value to find the most dominant feature in every step.

will-smith-aladdin-gin
Genie in Aladdin

Regression Trees

Previous algorithms can handle just string target values. Herein, target values must be numerical in Regression Trees. Data sets under dataset/golf3.txt and dataset/golf4.txt are in this form. Previous data sets store playing golf decision as yes or no but here number of golf players based on some features stored.

df = pd.read_csv("dataset/golf4.txt")
golf-4-txt
golf4.txt

The framework can decide the Regression Tree algorithm to apply even if you specify a different algorithm for a data set including numerical target value.

config = {'algorithm': 'Regression'}

Decision rules will return numerical values in regression trees.

def findDecision(obj): #obj[0]: Outlook, obj[1]: Temp., obj[2]: Humidity, obj[3]: Wind
   if obj[0] == 'Sunny':
      if obj[1]<=83:
         return 37.75
      elif obj[1]>83:
         return 25
   elif obj[0] == 'Rain':
      if obj[3] == 'Weak':
         return 47.666666666666664
      elif obj[3] == 'Strong':
         return 26.5
   elif obj[0] == 'Overcast':
      return 46.25

Regression trees use standard deviation to find the most dominant feature.

Advanced Methods

We’ve mentioned regualar decision tree algorithms. Advanced algorithms use these regular decision tree algorithm in a different perspective.

Random Forest

Decision tree algorithms tend to overfit for a large scale data sets.

Imagine a wise person in your company.  He can know all business processes and every one asks him. On the other hand, in a horizontal organization every employee knows a single business process. Employees can come together and they can answer every question based on their knowledge. You might think a regular decision tree algorithm as a wise person in your company. Then, horizontal organization would become random forest.

Data set will be separated into sub data sets and we will build several decision trees into these sub data sets. In this case, every decision tree will answer some answer to a question, e.g. yes or no. The class which has more answers will become the final answer. To avoid tied game, the data set will be separated into a prime number.

We can use the data set under dataset/car.data. It has thousands of instances.





df = pd.read_csv("dataset/car.data")
car-data
car.data

We will mention to enable the random forest improvement. It will build several ID3 decision trees in this case.

config = {'algorithm': 'ID3', 'enableRandomForest': True, 'num_of_trees': 5}
model = chef.fit(df.copy(), config)

5 different decision rules will be created (rules_0.py to rules_4.py) in this case. Every decision rule consists of hundreds of lines. Calling predict function will ask the prediction of every sub decision tree and will return the dominant answer.

prediction = chef.predict(model, ['vhigh','vhigh','2','2','small','low'])

Gradient Boosting

Here, gradient boosted trees will create the sage. We will feed all data set and build a regression tree. Based on its errors, we will build another decision tree. Based on the errors of second regression tree, we will build a third one. This goes several times. The final prediction will be the sum of each decision tree result.

Rafiki_2019_small
Rafiki as a Sage in The Lion King

Error will decrease over iterations and we will predict much more accurate

gradient-boosted-trees
Gradient boosted trees

Notice that gradient boosting can run just for regression problems. If you are going to apply gradient boosting for a classification problem, the framework will transform the data set into regression first and boost errors.

config = {'enableGBM': True, 'epochs': 7, 'learning_rate': 1}

Predict function will call built decision trees and give you the final prediction.

prediction = chef.predict(model, ['Sunny',85,85,'Weak'])

Adaboost

In contrast to Random Forest and Gradient Boosting, Adaboost would not use regular decision tree algorithms. It runs decision stumps which is one level decision tree. You might think that predicting the gender based on height. If one’s height is greater than 1.70 meters, it would be male otherwise female. This decision stump will fail but it would have 51% accuracy. Adaboost will create several decision stumps and set them to some weights.

config = {'enableAdaboost': True, 'num_of_weak_classifier': 3}
model = chef.fit(pd.read_csv("dataset/adaboost.txt"), config)
adaboost-calculations-pruned-v2
Decision stumps and weights
prediction = chef.predict(model, [4, 3.5])

The final prediction will be the sum of each decision stumps weight times decision stump’s prediction.

adaboost-final-decision-v2
Adaboost prediction

The basic idea is that weak classifiers come together and become a strong classifier. You might think that poor employees come together and move a heavy rock.

1_KuOM67No3IQABZIIVLHrAA
Moving a heavy rock

Conclusion

So, we’ve mentioned a lightweight decision tree framework in Python supporting regular both regular decision tree algorithms including ID3, C4.5, CART, Regression Trees and some advanced bagging and boosting methods including Random Forest, Gradient Boosting and Adaboost. You just need to write a few lines of code.





There are many ways to support a project – starring the GitHub repos is one.


Like this blog? Support me on Patreon

Buy me a coffee


11 Comments

  1. Dear, thanks for your library. I’m testing it with a CART and a dataset of 200k rows and it takes more than 10 mins to give me a result. Is there a way to make it faster? Does it depend on rows number or features number?
    Thanks!

    1. Currently, the framework creates leafs and branches serially. I plan to make it parallel soon. It would solve time problem.

      1. I think the issue is in the findDecision() function, it takes time to get data in big dataframes. Don’t know if dictionary can resolve

  2. Thank you for your work ,, how we can calculate the Errors & Root Mean Square Error

    1. The framework does not support to plot tree. You can just see the if statements.

Comments are closed.