A Step by Step CHAID Decision Tree Example

CHAID is the oldest decision tree algorithm in the history. It was raised in 1980 by Gordon V. Kass. Then, CART was found in 1984, ID3 was proposed in 1986 and C4.5 was announced in 1993. It is the acronym of chi-square automatic interaction detection. Here, chi-square is a metric to find the significance of a feature. The higher the value, the higher the statistical significance. Similar to the others, CHAID builds decision trees for classification problems. This means that it expects data sets having a categorical target variable.

living-trees-lord-of-the-rings — Living trees in the Lord of the Rings (2001)

Vlog

Here, you should watch the following video to understand how decision tree algorithms work. No matter which decision tree algorithm you are running: ID3, C4.5, CART, CHAID or Regression Trees. They all look for the feature offering the highest information gain. Then, they add a decision rule for the found feature and build an another decision tree for the sub data set recursively until they reached a decision.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Besides, regular decision tree algorithms are designed to create branches for categorical features. Still, we are able to build trees with continuous and numerical features. The trick is here that we will convert continuos features into categorical. We will split the numerical feature where it offers the highest information gain.

CHAID in Python

This blog post mentions the deeply explanation of CHAID algorithm and we will solve a problem step by step. On the other hand, you might just want to run CHAID algorithm and its mathematical background might not attract your attention.

Herein, you can find the python implementation of CHAID algorithm here. This package supports the most common decision tree algorithms such as ID3, C4.5, CART or Regression Trees, also some bagging methods such as random forest and some boosting methods such as gradient boosting and adaboost.

chaid-in-chefboost — CHAID in chefboost for python

Here, you can find a hands-on video as well.

Objective

Decision rules will be found based on chi-square values of features.

Formula

CHAID uses chi-square tests to find the most dominant feature whereas ID3 uses information gain, C4.5 uses gain ratio and CART uses GINI index. Chi-square testing was raised by Karl Pearson. He is also the founder of correlation. Today, most programming libraries (e.g. Pandas for Python) use Pearson metric for correlation by default.

The formula of chi-square testing is easy.

√((y – y’)² / y’)

where y is actual and y’ is expected.

Data set

We are going to build decision rules for the following data set. Decision column is the target we would like to find based on some features.

BTW, we will ignore the day column because it just states the row number.

Day	Outlook	Temp.	Humidity	Wind	Decision
1	Sunny	Hot	High	Weak	No
2	Sunny	Hot	High	Strong	No
3	Overcast	Hot	High	Weak	Yes
4	Rain	Mild	High	Weak	Yes
5	Rain	Cool	Normal	Weak	Yes
6	Rain	Cool	Normal	Strong	No
7	Overcast	Cool	Normal	Strong	Yes
8	Sunny	Mild	High	Weak	No
9	Sunny	Cool	Normal	Weak	Yes
10	Rain	Mild	Normal	Weak	Yes
11	Sunny	Mild	Normal	Strong	Yes
12	Overcast	Mild	High	Strong	Yes
13	Overcast	Hot	Normal	Weak	Yes
14	Rain	Mild	High	Strong	No

We need to find the most dominant feature in this data set.

Outlook feature

Outlook feature has 3 classes: sunny, rain and overcast. There are 2 decisions: yes and no. We firstly find the number of yes decisions and no decision for each class.

	Yes	No	Total	Expected	Chi-square Yes	Chi-square No
Sunny	2	3	5	2.5	0.316	0.316
Overcast	4	0	4	2	1.414	1.414
Rain	3	2	5	2.5	0.316	0.316

Total column is the sum of yes and no decisions for each row. Expected values are the half of total column because there are 2 classes in the decision. It is easy to calculate the chi-squared values based on this table.

For example, chi-square yes for sunny outlook is √((2 – 2.5)² / 2.5) = 0.316 whereas actual is 2 and expected is 2.5.

Chi-square value of outlook is the sum of chi-square yes and no columns.

0.316 + 0.316 + 1.414 + 1.414 + 0.316 + 0.316 = 4.092

Now, we will find chi-square values for other features. The feature having the maximum chi-square value will be the decision point.

Temperature feature

This feature has 3 classes: hot, mild and cool. The following table summarizes the chi-square values for these classes.

	Yes	No	Total	Expected	Chi-square Yes	Chi-square No
Hot	2	2	4	2	0	0
Mild	4	2	6	3	0.577	0.577
Cool	3	1	4	2	0.707	0.707

Chi-square value of temperature feature will be

0 + 0 + 0.577 + 0.577 + 0.707 + 0.707 = 2.569

This is a value less than the chi-square value of outlook. This means that the feature outlook is more important than the feature temperature based on chi-square testing.

Humidity feature

Humidity has 2 classes: high and normal. Let’s summarize the chi-square values.

	Yes	No	Total	Expected	Chi-square Yes	Chi-square No
High	3	4	7	3.5	0.267	0.267
Normal	6	1	7	3.5	1.336	1.336

So, the chi-square value of humidity feature is

0.267 + 0.267 + 1.336 + 1.336 = 3.207

This is less than the chi-square value of outlook as well. What about wind feature?

Wind feature

Wind feature has 2 classes: weak and strong. The following table is the pivot table.

	Yes	No	Total	Expected	Chi-square Yes	Chi-square No
Weak	5	2	7	3.5	0.802	0.802
Strong	3	3	6	3	0.000	0.000

Herein, the chi-square test value of the wind feature is

0.802 + 0.802 + 0 + 0 = 1.604

We’ve found the chi square values of all features. Let’s see them all in a table.

Feature	Chi-square value
Outlook	4.092
Temperature	2.569
Humidity	3.207
Wind	1.604

As seen, outlook feature has the highest chi-square value. This means that it is the most significant feature. So, we will put this feature to the root node.

chaid-tree-1 — Initial form of CHAID tree

We’ve filtered the raw data set based on the outlook classes on the illustration above. For example, overcast branch just has yes decisions in the sub data set. This means that CHAID tree returns YES if outlook is overcast.

The both sunny and rain branches have yes and no decisions. We will apply chi-square tests for these sub data sets.

Outlook = Sunny branch

This branch has 5 instances. Now, we look for the most dominant feature. BTW, we will ignore the outlook column now because they are all same. In other words, we will find the most dominant feature among temperature, humidity and wind.

Day	Outlook	Temp.	Humidity	Wind	Decision
1	Sunny	Hot	High	Weak	No
2	Sunny	Hot	High	Strong	No
8	Sunny	Mild	High	Weak	No
9	Sunny	Cool	Normal	Weak	Yes
11	Sunny	Mild	Normal	Strong	Yes

Temperature feature for sunny outlook

	Yes	No	Total	Expected	Chi-square Yes	Chi-square No
Hot	0	2	2	1	1	1
Mild	1	1	2	1	0	0
Cool	1	0	1	0.5	0.707	0.707

So, chi-square value of temperature feature for sunny outlook is

1 + 1 + 0 + 0 + 0.707 + 0.707 = 3.414

Humidity feature for sunny outlook

	Yes	No	Total	Expected	Chi-square Yes	Chi-square No
High	0	3	3	1.5	1.225	1.225
Normal	2	0	2	1	1	1

Chi-square value of humidity feature for sunny outlook is

1.225 + 1.225 + 1 + 1 = 4.449

Wind feature for sunny outlook

	Yes	No	Total	Expected	Chi-square Yes	Chi-square No
Weak	1	2	3	1.5	0.408	0.408
Strong	1	1	2	1	0	0

Chi-square value of wind feature for sunny outlook is

0.408 + 0.408 + 0 + 0 = 0.816

We’ve found chi-square values for sunny outlook. Let’s see them all in a table.

Feature	Chi-square
Temperature	3.414
Humidity	4.449
Wind	0.816

Now, humidity is the most dominant feature for the sunny outlook branch. We will put this feature as a decision rule.

chaid-tree-2 — The second phase of CHAID tree

Now, the both humidity branches for sunny outlook have just one decisions as illustrated above. CHAID tree will return NO for sunny outlook and high humidity and it will return YES for sunny outlook and normal humidity.

Rain outlook branch

This branch still has the both yes and no decisions. We need to apply chi-square test for this branch to find an exact decisions. This branch has 5 instances as shown in the following sub data set. Let’s find the most dominant feature among temperature, humidity and wind.

Day	Outlook	Temp.	Humidity	Wind	Decision
4	Rain	Mild	High	Weak	Yes
5	Rain	Cool	Normal	Weak	Yes
6	Rain	Cool	Normal	Strong	No
10	Rain	Mild	Normal	Weak	Yes
14	Rain	Mild	High	Strong	No

Temperature feature for rain outlook

This feature has 2 classes: mild and cool. Notice that even though hot temperature appears in the raw data set, this branch has no hot instance.

	Yes	No	Total	Expected	Chi-square Yes	Chi-square No
Mild	2	1	3	1.5	0.408	0.408
Cool	1	1	2	1	0	0

Chi-square value of temperature feature for rain outlook is

0.408 + 0.408 + 0 + 0 = 0.816

Humidity feature for rain outlook

This feature in this branch has 2 classes: high and normal.

	Yes	No	Total	Expected	Chi-square Yes	Chi-square No
High	1	1	2	1	0	0
Normal	2	1	3	1.5	0.408	0.408

Chi-square value of humidity feature for rain outlook is

0 + 0 + 0.408 + 0.408 = 0.816

Wind feature for rain outlook

This feature in this branch has 2 classes: weak and strong.

	Yes	No	Total	Expected	Chi-square Yes	Chi-square No
Weak	3	0	3	1.5	1.225	1.225
Strong	0	2	2	1	1	1

So, chi-squre value of wind feature for rain outlook is

1.225 + 1.225 + 1 + 1 = 4.449

We’ve found all chi-square values for rain outlook branch.Let’s see them all in a single table.

Feature	Chi-squared
Temperature	0.816
Humidity	0.816
Wind	4.449

So, wind feature is the winner for rain outlook branch. Put this feature in the related branch and see the corresponding sub data sets.

chaid-tree-3 — The third phase of the CHAID tree

As seen, all branches have sub data sets having a single decision. So, we can build the CHAID tree as illustrated below.

chaid-tree-4 — The final form of the CHAID tree

Feature importance

Decision trees are naturally explainable and interpretable algorithms. Besides, they offer to find feature importance as well to understand built model well.

Gradient Boosting Decision Trees

Nowadays, gradient boosting decision trees are very popular in machine learning community. They are actually not different than the decision tree algorithm mentioned in this blog post. They mainly builds sequantial decision trees based on the errors in the previous loop.

Random Forest vs Gradient Boosting

The both random forest and gradient boosting are an approach instead of a core decision tree algorithm itself. They require to run core decision tree algorithms. They also build many decision trees in the background. So, we will discuss how they are similar and how they are different in the following video.

Adaboost vs Gradient Boosting

The both gradient boosting and adaboost are boosting techniques for decision tree based machine learning models. We will discuss how they are similar and how they are different than each other.

Conclusion

So, we’ve built a CHAID decision tree step by step in this post. CHAID uses chi-square metric to find the most dominant feature and apply this recursively until sub data sets having a single decision. Even though this is a legacy decision tree algorithm, it is still common way for classification problems.

Support this blog if you do like!