A Gentle Introduction to Chefboost for Applied Machine Learning

Even though deep learning is hottest topic in the media, decision trees dominates the real world challenges. Recently, I’ve announced a decision tree based framework – Chefboost. It supports regular decision tree algorithms such as ID3, C4.5, CART, Regression Trees and some advanced methods such as Adaboost, Random Forest and Gradient Boosting Trees.

This post aims to show how to use these algorithms in python with a few line of codes. The background of the algorithms is out of the scope. You can follow the links if you wonder the mathematical background of these algorithms. They show step by step examples.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

You can also find the Youtube playlist here.

It is lightweight

There are already much powerful gradient boosting framework in the market such as XGBoost or LightGBM. They expect you to transform the data set into numerical features and target values. Herein, you should just feed the data set to Chefboost. It does not matter the data type of your data set. It could be nominal or numerical.

Besides, built decision trees are stored as dedicated python files. It just includes python if statements. That’s why, you can easily read, understand and manipulate the built decision trees.

Framework Installation

The easiest way to install the framework is PyPI. Just running the following command handles installation.

pip install chefboost

Alternatively, if you do not prefer to install it with pip, code repository of the framework is pushed to the GitHub. All you need to do is to run the following command in your command prompt. If you do not feel comfortable in command line, you can alternatively follow the clone or download and then download zip steps in the repository.

git clone https://github.com/serengil/chefboost.git

Hello, World!

We will create a Dispatcher.py file in the chefboost directory. Both Chefboost.py and Dispatcher.py files must be stored in the same directory.

ID3 Algorithm

ID3 is the oldest and one of the common decision tree algorithm. It expects nominal features and nominal target values. We can use the dataset/golf.txt file as a dataset. It stores previous golf playing decisions based on some features such as outlook, temperature, humidity and wind.

import pandas as pd
df = pd.read_csv("dataset/golf.txt")

Building decision trees is handled by fit command. We just feed the data set and algorithm as a configuration.

from chefboost import Chefboost as chef
config = {'algorithm': 'ID3'}
model = chef.fit(df, config)

Calling the fit command builds the decision rules under outputs/rules folder. Built tree checks outlook feature first. For example, we will play golf if outlook is overcast. We will check humidity when outlook is sunny. We will play golf when outlook is sunny and humidity is normal. On the other hand, we will not play golf when outlook is sunny and humidity is high. As seen, it is very easy to read and understand the outcomes of decision trees.

def findDecision(obj): #obj[0]: Outlook, obj[1]: Temp., obj[2]: Humidity, obj[3]: Wind
   if obj[0] == 'Sunny':
      if obj[2] == 'High':
         return 'No'
      elif obj[2] == 'Normal':
         return 'Yes'
   elif obj[0] == 'Rain':
      if obj[3] == 'Weak':
         return 'Yes'
      elif obj[3] == 'Strong':
         return 'No'
   elif obj[0] == 'Overcast':
      return 'Yes'

Making Predictions

We can make predictions when the decision rules built. We will call predict function and pass built model in the previous step and feature values to make custom prediction.

feature = ['Sunny','Hot','High','Weak']
prediction = chef.predict(model, feature)

Moreover, we can predict the custom item in our data set. The following example retrieves the features of the 1st item in the train set and make prediction.

prediction = chef.predict(model, df.iloc[0])

Obviously, we can build a for loop to make predictions for all items in a data frame.

for index, instance in df.iterrows():
	prediction = chef.predict(model, instance)

ID3 algorithm uses information gain to find the most dominant feature.

C4.5 Algorithm

C4.5 algorithm can handle numerical feature values but it still expects nominal target values. dataset/golf2.txt file is in this form.

df = pd.read_csv("dataset/golf2.txt")

Similar to ID3, we will just mention the algorithm to build a decision tree.

config = {'algorithm': 'C4.5'}
model = chef.fit(df.copy(), config)

C4.5 algorithm transforms the numerical features into binary split points. For example, temperature is a numerical value in the raw data set. Decision rule check the temperature is greater than or equal to 83 or less than 83 in the first step. Temperature values actually transformed to boolean true false values based on this rule.

def findDecision(obj): #obj[0]: Outlook, obj[1]: Temp., obj[2]: Humidity, obj[3]: Wind
   if obj[1]&amp;amp;amp;amp;amp;amp;amp;lt;=83:
      if obj[0] == 'Rain':
         if obj[3] == 'Weak':
            return 'Yes'
         elif obj[3] == 'Strong':
            return 'No'
      elif obj[0] == 'Sunny':
         if obj[2]&amp;amp;amp;amp;amp;amp;amp;gt;65:
            if obj[3] == 'Weak':
               return 'Yes'
            elif obj[3] == 'Strong':
               return 'Yes'
      elif obj[0] == 'Overcast':
         return 'Yes'
   elif obj[1]&amp;amp;amp;amp;amp;amp;amp;gt;83:
      return 'No'

C4.5 algorithm uses gain ratio to find the most dominant feature.

CART Algorithm

Similar to C4.5, CART algorithm can handle both numerical and nominal features but target value must be nominal. You just specify the algorithm and apply same steps similar to previous ones.

config = {'algorithm': 'CART'}

CART algorithm uses Gini index value to find the most dominant feature in every step.

will-smith-aladdin-gin — Genie in Aladdin

Regression Trees

Previous algorithms can handle just string target values. Herein, target values must be numerical in Regression Trees. Data sets under dataset/golf3.txt and dataset/golf4.txt are in this form. Previous data sets store playing golf decision as yes or no but here number of golf players based on some features stored.

df = pd.read_csv("dataset/golf4.txt")

The framework can decide the Regression Tree algorithm to apply even if you specify a different algorithm for a data set including numerical target value.

config = {'algorithm': 'Regression'}

Decision rules will return numerical values in regression trees.

def findDecision(obj): #obj[0]: Outlook, obj[1]: Temp., obj[2]: Humidity, obj[3]: Wind
   if obj[0] == 'Sunny':
      if obj[1]&amp;amp;amp;amp;amp;amp;amp;lt;=83:
         return 37.75
      elif obj[1]&amp;amp;amp;amp;amp;amp;amp;gt;83:
         return 25
   elif obj[0] == 'Rain':
      if obj[3] == 'Weak':
         return 47.666666666666664
      elif obj[3] == 'Strong':
         return 26.5
   elif obj[0] == 'Overcast':
      return 46.25

Regression trees use standard deviation to find the most dominant feature.

Advanced Methods

We’ve mentioned regualar decision tree algorithms. Advanced algorithms use these regular decision tree algorithm in a different perspective.

Random Forest

Decision tree algorithms tend to overfit for a large scale data sets.

Imagine a wise person in your company. He can know all business processes and every one asks him. On the other hand, in a horizontal organization every employee knows a single business process. Employees can come together and they can answer every question based on their knowledge. You might think a regular decision tree algorithm as a wise person in your company. Then, horizontal organization would become random forest.

Data set will be separated into sub data sets and we will build several decision trees into these sub data sets. In this case, every decision tree will answer some answer to a question, e.g. yes or no. The class which has more answers will become the final answer. To avoid tied game, the data set will be separated into a prime number.

We can use the data set under dataset/car.data. It has thousands of instances.

df = pd.read_csv("dataset/car.data")

We will mention to enable the random forest improvement. It will build several ID3 decision trees in this case.

config = {'algorithm': 'ID3', 'enableRandomForest': True, 'num_of_trees': 5}
model = chef.fit(df.copy(), config)

5 different decision rules will be created (rules_0.py to rules_4.py) in this case. Every decision rule consists of hundreds of lines. Calling predict function will ask the prediction of every sub decision tree and will return the dominant answer.

prediction = chef.predict(model, ['vhigh','vhigh','2','2','small','low'])

Gradient Boosting

Here, gradient boosted trees will create the sage. We will feed all data set and build a regression tree. Based on its errors, we will build another decision tree. Based on the errors of second regression tree, we will build a third one. This goes several times. The final prediction will be the sum of each decision tree result.

Rafiki_2019_small — Rafiki as a Sage in The Lion King

Error will decrease over iterations and we will predict much more accurate

gradient-boosted-trees — Gradient boosted trees

Notice that gradient boosting can run just for regression problems. If you are going to apply gradient boosting for a classification problem, the framework will transform the data set into regression first and boost errors.

config = {'enableGBM': True, 'epochs': 7, 'learning_rate': 1}

Predict function will call built decision trees and give you the final prediction.

prediction = chef.predict(model, ['Sunny',85,85,'Weak'])

Adaboost

In contrast to Random Forest and Gradient Boosting, Adaboost would not use regular decision tree algorithms. It runs decision stumps which is one level decision tree. You might think that predicting the gender based on height. If one’s height is greater than 1.70 meters, it would be male otherwise female. This decision stump will fail but it would have 51% accuracy. Adaboost will create several decision stumps and set them to some weights.

config = {'enableAdaboost': True, 'num_of_weak_classifier': 3}
model = chef.fit(pd.read_csv("dataset/adaboost.txt"), config)

adaboost-calculations-pruned-v2 — Decision stumps and weights

prediction = chef.predict(model, [4, 3.5])

The final prediction will be the sum of each decision stumps weight times decision stump’s prediction.

adaboost-final-decision-v2 — Adaboost prediction

The basic idea is that weak classifiers come together and become a strong classifier. You might think that poor employees come together and move a heavy rock.

1_KuOM67No3IQABZIIVLHrAA — Moving a heavy rock

Conclusion

So, we’ve mentioned a lightweight decision tree framework in Python supporting regular both regular decision tree algorithms including ID3, C4.5, CART, Regression Trees and some advanced bagging and boosting methods including Random Forest, Gradient Boosting and Adaboost. You just need to write a few lines of code.

There are many ways to support a project – starring the GitHub repos is one.

Support this blog if you do like!

11 Comments

Sahar Saeed says:

September 29, 2022 at 2:42 am

How to make cross validation in case of usinv chefboost c4.5

Log in to Reply
1. Sefik Serengil says:
  
  September 30, 2022 at 10:34 pm
  
  it is not a boosting algorithm, you cannot do it.
  
  Log in to Reply

A Gentle Introduction to Chefboost for Applied Machine Learning

It is lightweight

Framework Installation

Hello, World!

ID3 Algorithm

Making Predictions

C4.5 Algorithm

CART Algorithm

Regression Trees

Advanced Methods

Random Forest

Gradient Boosting

Adaboost

Conclusion

Related

11 Comments

Leave a Reply Cancel reply

It is lightweight

Framework Installation

Hello, World!

ID3 Algorithm

Making Predictions

C4.5 Algorithm

CART Algorithm

Regression Trees

Advanced Methods

Random Forest

Gradient Boosting

Adaboost

Conclusion

Related

11 Comments

Leave a Reply Cancel reply

Discover more from Sefik Ilkin Serengil