A Step by Step CHAID Decision Tree Example

CHAID is the oldest decision tree algorithm in the history. It was raised in 1980 by Gordon V. Kass. Then, CART was found in 1984, ID3 was proposed in 1986 and C4.5 was announced in 1993. It is the acronym of chi-square automatic interaction detection. Here, chi-square is a metric to find the significance of a feature. The higher the value, the higher the statistical significance. Similar to the others, CHAID builds decision trees for classification problems. This means that it expects data sets having a categorical target variable.

living-trees-lord-of-the-rings
Living trees in the Lord of the Rings (2001)

Vlog

Here, you should watch the following video to understand how decision tree algorithms work. No matter which decision tree algorithm you are running: ID3, C4.5, CART, CHAID or Regression Trees. They all look for the feature offering the highest information gain. Then, they add a decision rule for the found feature and build an another decision tree for the sub data set recursively until they reached a decision.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Besides, regular decision tree algorithms are designed to create branches for categorical features. Still, we are able to build trees with continuous and numerical features. The trick is here that we will convert continuos features into categorical. We will split the numerical feature where it offers the highest information gain.

CHAID in Python

This blog post mentions the deeply explanation of CHAID algorithm and we will solve a problem step by step. On the other hand, you might just want to run CHAID algorithm and its mathematical background might not attract your attention.

Herein, you can find the python implementation of CHAID algorithm here. This package supports the most common decision tree algorithms such as ID3, C4.5, CART or Regression Trees, also some bagging methods such as random forest and some boosting methods such as gradient boosting and adaboost.

chaid-in-chefboost
CHAID in chefboost for python

Here, you can find a hands-on video as well.

Objective

Decision rules will be found based on chi-square values of features.

Formula

CHAID uses chi-square tests to find the most dominant feature whereas ID3 uses information gain, C4.5 uses gain ratio and CART uses GINI index. Chi-square testing was raised by Karl Pearson. He is also the founder of correlation. Today, most programming  libraries (e.g. Pandas for Python) use Pearson metric for correlation by default.

The formula of chi-square testing is easy.

√((y – y’)2 / y’)





where y is actual and y’ is expected.

Data set

We are going to build decision rules for the following data set. Decision column is the target we would like to find based on some features.

BTW, we will ignore the day column because it just states the row number.

Day Outlook Temp. Humidity Wind Decision
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

We need to find the most dominant feature in this data set.

Outlook feature

Outlook feature has 3 classes: sunny, rain and overcast. There are 2 decisions: yes and no. We firstly find the number of yes decisions and no decision for each class.

Yes No Total Expected Chi-square Yes Chi-square No
Sunny 2 3 5 2.5 0.316 0.316
Overcast 4 0 4 2 1.414 1.414
Rain 3 2 5 2.5 0.316 0.316

Total column is the sum of yes and no decisions for each row. Expected values are the half of total column because there are 2 classes in the decision. It is easy to calculate the chi-squared values based on this table.

For example, chi-square yes for sunny outlook is √((2 – 2.5)2 / 2.5) = 0.316 whereas actual is 2 and expected is 2.5.

Chi-square value of outlook is the sum of chi-square yes and no columns.

0.316 + 0.316 + 1.414 + 1.414 + 0.316 + 0.316 = 4.092

Now, we will find chi-square values for other features. The feature having the maximum chi-square value will be the decision point.





Temperature feature

This feature has 3 classes: hot, mild and cool. The following table summarizes the chi-square values for these classes.

Yes No Total Expected Chi-square Yes Chi-square No
Hot 2 2 4 2 0 0
Mild 4 2 6 3 0.577 0.577
Cool 3 1 4 2 0.707 0.707

Chi-square value of temperature feature will be

0 + 0 + 0.577 + 0.577 + 0.707 + 0.707 = 2.569

This is a value less than the chi-square value of outlook. This means that the feature outlook is more important than the feature temperature based on chi-square testing.

Humidity feature

Humidity has 2 classes: high and normal. Let’s summarize the chi-square values.

Yes No Total Expected Chi-square Yes Chi-square No
High 3 4 7 3.5 0.267 0.267
Normal 6 1 7 3.5 1.336 1.336

So, the chi-square value of humidity feature is

0.267 + 0.267 + 1.336 + 1.336 = 3.207

This is less than the chi-square value of outlook as well. What about wind feature?

Wind feature

Wind feature has 2 classes: weak and strong. The following table is the pivot table.

Yes No Total Expected Chi-square Yes Chi-square No
Weak 5 2 7 3.5 0.802 0.802
Strong 3 3 6 3 0.000 0.000

Herein, the chi-square test value of the wind feature is





0.802 + 0.802 + 0 + 0 = 1.604

We’ve found the chi square values of all features. Let’s see them all in a table.

Feature Chi-square value
Outlook 4.092
Temperature 2.569
Humidity 3.207
Wind 1.604

As seen, outlook feature has the highest chi-square value. This means that it is the most significant feature. So, we will put this feature to the root node.

chaid-tree-1
Initial form of CHAID tree

We’ve filtered the raw data set based on the outlook classes on the illustration above. For example, overcast branch just has yes decisions in the sub data set. This means that CHAID tree returns YES if outlook is overcast.

The both sunny and rain branches have yes and no decisions. We will apply chi-square tests for these sub data sets.

Outlook = Sunny branch

This branch has 5 instances. Now, we look for the most dominant feature. BTW, we will ignore the outlook column now because they are all same. In other words, we will find the most dominant feature among temperature, humidity and wind.

Day Outlook Temp. Humidity Wind Decision
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes

Temperature feature for sunny outlook

Yes No Total Expected Chi-square Yes Chi-square No
Hot 0 2 2 1 1 1
Mild 1 1 2 1 0 0
Cool 1 0 1 0.5 0.707 0.707

So, chi-square value of temperature feature for sunny outlook is

1 + 1 + 0 + 0 + 0.707 + 0.707 = 3.414

Humidity feature for sunny outlook

Yes No Total Expected Chi-square Yes Chi-square No
High 0 3 3 1.5 1.225 1.225
Normal 2 0 2 1 1 1

Chi-square value of humidity feature for sunny outlook is

1.225 + 1.225 + 1 + 1 = 4.449





Wind feature for sunny outlook

Yes No Total Expected Chi-square Yes Chi-square No
Weak 1 2 3 1.5 0.408 0.408
Strong 1 1 2 1 0 0

Chi-square value of wind feature for sunny outlook is

0.408 + 0.408 + 0 + 0 = 0.816

We’ve found chi-square values for sunny outlook. Let’s see them all in a table.

Feature Chi-square
Temperature 3.414
Humidity 4.449
Wind 0.816

Now, humidity is the most dominant feature for the sunny outlook branch. We will put this feature as a decision rule.

chaid-tree-2
The second phase of CHAID tree

Now, the both humidity branches for sunny outlook have just one decisions as illustrated above. CHAID tree will return NO for sunny outlook and high humidity and it will return YES for sunny outlook and normal humidity.

Rain outlook branch

This branch still has the both yes and no decisions. We need to apply chi-square test for this branch to find an exact decisions. This branch has 5 instances as shown in the following sub data set. Let’s find the most dominant feature among temperature, humidity and wind.

Day Outlook Temp. Humidity Wind Decision
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
10 Rain Mild Normal Weak Yes
14 Rain Mild High Strong No

Temperature feature for rain outlook

This feature has 2 classes: mild and cool. Notice that even though hot temperature appears in the raw data set, this branch has no hot instance.

Yes No Total Expected Chi-square Yes Chi-square No
Mild 2 1 3 1.5 0.408 0.408
Cool 1 1 2 1 0 0

Chi-square value of temperature feature for rain outlook is

0.408 + 0.408 + 0 + 0 = 0.816

Humidity feature for rain outlook

This feature in this branch has 2 classes: high and normal.





Yes No Total Expected Chi-square Yes Chi-square No
High 1 1 2 1 0 0
Normal 2 1 3 1.5 0.408 0.408

Chi-square value of humidity feature for rain outlook is

0 + 0 + 0.408 + 0.408 = 0.816

Wind feature for rain outlook

This feature in this branch has 2 classes: weak and strong.

Yes No Total Expected Chi-square Yes Chi-square No
Weak 3 0 3 1.5 1.225 1.225
Strong 0 2 2 1 1 1

So, chi-squre value of wind feature for rain outlook is

1.225 + 1.225 + 1 + 1 = 4.449

We’ve found all chi-square values for rain outlook branch.Let’s see them all in a single table.

Feature Chi-squared
Temperature 0.816
Humidity 0.816
Wind 4.449

So, wind feature is the winner for rain outlook branch. Put this feature in the related branch and see the corresponding sub data sets.

chaid-tree-3
The third phase of the CHAID tree

As seen, all branches have sub data sets having a single decision. So, we can build the CHAID tree as illustrated below.

chaid-tree-4
The final form of the CHAID tree

Feature importance

Decision trees are naturally explainable and interpretable algorithms. Besides, they offer to find feature importance as well to understand built model well.

Gradient Boosting Decision Trees

Nowadays, gradient boosting decision trees are very popular in machine learning community. They are actually not different than the decision tree algorithm mentioned in this blog post. They mainly builds sequantial decision trees based on the errors in the previous loop.





Random Forest vs Gradient Boosting

The both random forest and gradient boosting are an approach instead of a core decision tree algorithm itself. They require to run core decision tree algorithms. They also build many decision trees in the background. So, we will discuss how they are similar and how they are different in the following video.

Adaboost vs Gradient Boosting

The both gradient boosting and adaboost are boosting techniques for decision tree based machine learning models. We will discuss how they are similar and how they are different than each other.

Conclusion

So, we’ve built a CHAID decision tree step by step in this post. CHAID uses chi-square metric to find the most dominant feature and apply this recursively until sub data sets having a single decision. Even though this is a legacy decision tree algorithm, it is still common way for classification problems.


Support this blog if you do like!

Buy me a coffee      Buy me a coffee


11 Comments

  1. there is an error :
    Wind feature
    Wind feature has 2 classes: weak and strong. The following table is the pivot table.
    Yes No Total Expected Chi-square Yes Chi-square No
    Weak 5 2 7 3.5 0.802 0.802
    Strong 3 3 6 3 0.000 0.000
    the correct figures for weak : 6 yes 2 No and 8 for total, consequently chi is higher

  2. Really cleared everything up and helped me in my studies! Thank you very much!
    P.S.: From my understanding, CHAID isn’t actually the oldest decision tree learning algorithm, although it’s among the oldest. AID (Automatic Interaction Detector) and THAID (THeta Automatic Interaction Detector) were published in the ’70s.

  3. import pandas as pd
    from chefboost import Chefboost as cb

    df = pd.read_csv(‘exECLAT.csv’)
    config = {‘algorithm’: ‘CHAID’}
    CHAIDtree = cb.fit(df, config)

    what is Error why ??
    assert group is None, ‘group argument must be None for now’
    AssertionError: group argument must be None for now

    A,B,C,D,E,F,Decision
    2,2,0,0,0,1,1
    1,3,0,1,0,3,0
    3,3,3,3,3,3,0

  4. Greeting,
    How can we change the code in the case that the name of the target variable is not “Decision”‘?
    I get this error
    Please confirm that name of the target column is “Decision” and it is put to the right in pandas data frame

  5. Grt work. In case we have a huge data set ~5L cases to study and run the CHAID it is taking a lot of time (more than 5-6 hrs). Is this common? Could you please help on this.

  6. Note bene – This is not the complete, original CHAID algorithm. ID3, for example, will pick a value Outlook because Overcast was really good but Rainy and Sunny may not be particularly good predictors. However, in ID3, we are forced to split the data on all three feature values, which may diffuse the “signal” in the data.

    What CHAID had, however, was a test to see if all the values of a feature (“categories” is what Kass called them) were good in their own right. If they weren’t, they were merged. This allows the introduction of OR. So you might have Overcast as one split and Rainy OR Sunny as a second split. (p. 121 of his paper, Steps 2 and 3 are missing here). You can see the results of CHAID on the tree of page 126. At the first layer, note the merger of categories 3 and 4.

    Otherwise, you simply have ID3/C4.5 with Chi-square instead of Entropy/Information Gain as your metric but it’s not the original CHAID algorithm.

Comments are closed.