The post An Overview to Vanishing Gradient Problem appeared first on Sefik Ilkin Serengil.

]]>It has been discovered that multi layered perceptrons can handle non-linear problems in 1986. The discovery causes to pass AI winter away. Unfortunately, that was the 1st AI winter!

This discovery requires to transform activation units to differentiable functions. In this way, we can back-propagate errors and apply learning. Herein, sigmoid and tanh are one of the most common activation functions. However, these functions come with a huge defects.

Sigmoid function is meaningful for inputs between (-5, +5). In this scale, it has a derivative different than 0. This means that we can back-propage errors and apply learning.

Ian Goodfellow represents meaningfulness as mobility of Bart Simpson with his skateboard. Gravity contributes Bart to move if he is in range of [-5, +5].

On the other hand, gravity will not contribute Bart to move if he is in a point grater than 5 or less than -5. This representation describes gradient vanishing problem very well. If derivative of activation function would always produce 0, then we cannot update weights. But this is the result. The question is that what causes to happen this result?Wide and deep networks would cause to produce large outputs in every layer. Constructing wide and deep network with sigmoid activation unit reveals gradient vanishing or exploding problem. This ends us up in the **AI winter** again.

2nd AI winter passed away in just 2006. Raising a simple activation function named ReLU shows us again sunny days. This function is identity function for positive inputs whereas it produces zero for negative inputs.

Let’s imagine that Bart’s mobility on this new function. Gravity causes Bart to move for any positive input.

Wide network structures tend to produce mostly large positive inputs among layers. That’s why, most of gradient vanishing problems would be solved even though gravity would not contribute Bart to move for negative inputs.

You might consider to use Leaky ReLU as activation unit to handle this issue for negative inputs. Bart can move at any point for this new function! Leaky ReLU is a non-linear function, it is differentiable, and its derivative is different than 0 for any point except 0.

Let’s construct a wide and deep neural networks model. Basically, I’ll create a model for handwritten digit classification. There are 4 hidden layers consisting of 128, 64, 32 and 16 units respectively. Actually, it is not so deep.

classifier = tf.contrib.learn.DNNClassifier( feature_columns=feature_columns , n_classes=10 #0 to 9 - 10 classes , hidden_units=[128, 64, 32, 16] , optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.1) , activation_fn = tf.nn.sigmoid )

As seen, model make disappointment. Accuracy is very low.

Only we need is to switch activation function to ReLU.

classifier = tf.contrib.learn.DNNClassifier( feature_columns=feature_columns , n_classes=10 #0 to 9 - 10 classes , hidden_units=[128, 64, 32, 16] , optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.1) , activation_fn = tf.nn.relu )

As seen, accuracy will increase dramatically if activation unit were ReLU.

BTW, I’ve pushed the code into GitHub.

So, AI studies had unproductive period for almost 20 years between 1986 – 2006 because of activation units. Funnily, this challenging problem can be solved with a simple function usage. ReLU is the reason why we are much stronger in AI studies for these days.

The post An Overview to Vanishing Gradient Problem appeared first on Sefik Ilkin Serengil.

]]>The post Official Guide To Fermat’s Little Theorem appeared first on Sefik Ilkin Serengil.

]]>Remember the Fermat’s Little Theorem

a^{p} – a = 0 (mod p)

We’ve already known that the statement is true while a = 0. Also, statement is still valid while a = 1.

Now, we’ll jump to a = n and suppose that the statement would be true for this condition. If we can prove that statement is true while a = n + 1 based on the previous assumption, then we could prove the correctness proof of statement. This approach is called as **proof by induction**.

(n+1)^{p} – (n+1)

Here, we can use binomial theorem to expand the term.

(x+y)^{n} = C(n, 0)x^{n}y^{0} + C(n, 1)x^{n-1}y^{1} + C(n, 2)x^{n-2}y^{2} + … + C(n, n-1)x^{1}y^{n-1} + C(n, n)x^{0}y^{n}

Let’s apply binomial theorem to expand n+1 to the power of p.

(n+1)^{p} = C(p, 0)n^{p} + C(p, 1)n^{p-1} + C(p, 2)n^{p-2} + … + C(p, p-1)n^{1} + C(p, p)n^{0}

Now, replace the power term in main statement.

(n+1)^{p} – (n+1) = C(p, 0)n^{p} + C(p, 1)n^{p-1} + C(p, 2)n^{p-2} + … + C(p, p-1)n^{1} + C(p, p)n^{0} – (n+1)

C coefficients refer to combination and it can be calculated as

C(i, j) = i! / (j! (i-j)!)

That’s why, both C(p, 0) and C(p, p) terms are equal to 1.

(n+1)^{p} – (n+1) = n^{p} + C(p, 1)n^{p-1} + C(p, 2)n^{p-2} + … + C(p, p-1)n^{1} + n^{0} – n – 1

Would you realize that (n^{p} – n) exists in the equation above. Let’s group them. Also, n^{0} is equal to 1, and there is a -1 term in the equation. Let’s remove them.

(n+1)^{p} – (n+1) = (n^{p} – n) + C(p, 1)n^{p-1} + C(p, 2)n^{p-2} + … + C(p, p-1)n^{1}

Notice that our assumption is (n^{p} – n) can be divided by p. That’s why, we can remove it and focus on the rest of the equation. The question is that the following term can be divided by p or not.

C(p, 1)n^{p-1} + C(p, 2)n^{p-2} + … + C(p, p-1)n^{1}

We should focus on coefficient terms. Herein, imagine the binomial expansion and pascal’s triangle.

Pow |
Expansion |

0 | 1 |

1 | 1 1 |

2 | 1 2 1 |

3 | 1 3 3 1 |

4 | 1 4 6 41 |

5 | 1 5 10 10 5 1 |

6 | 1 6 15 20 20 15 6 1 |

7 | 1 7 21 35 35 21 7 1 |

We’ve already known that p is a prime. That’s why, we should focus on only prime pow lines.

Pow |
Expansion |

2 | 1 2 1 |

3 | 1 3 3 1 |

5 | 1 5 10 10 5 1 |

7 | 1 7 21 35 35 21 7 1 |

Remember that C(p, 0) and C(p, p) coefficients are equal to 1 and we separated them because we supposed that sum of their multipliers can be divided by p. That’s why, I’ll remove 1 terms in expansion column.

Pow |
Expansion |

2 | 2 |

3 | 3 3 |

5 | 5 10 10 5 |

7 | 7 21 35 35 21 7 |

Notice that pow can be divided by all terms in expansions. However, we have to prove this to be convinced.

Let’s focus on a concrete example. I pick p as 7 and my alphabet would be {1, 2, 3, 4, 5, 6, 7}. And I would like to produce strings length of 4. In other words, I wonder the C(7, 4).

Remember that order doesn’t matter in combination. I mean that both (1, 3, 5, 7) and (3, 5, 7, 1) sets are same in my combination space. How can I manipulate this set? The answer is easy. I can increase all item values for module 7.

(1, 3, 5, 7); (2, 4, 6, 1); (3, 5, 7, 2); (4, 6, 1, 3); (5, 7, 2, 4); (6, 1, 3, 5); (7, 2, 4, 6)

If final set (7, 2, 4, 6) were increased one more time, then it would be equal to first one (1, 3, 5, 7). As seen, any demonstration can be manipulated 7 times. This means all sets can be divided by 7.

To sum up, C(p, x) can be divided by p if p is a prime. Let’s turn back to the main statement.

(n+1)^{p} – (n+1) = (n^{p} – n) + C(p, 1)n^{p-1} + C(p, 2)n^{p-2} + … + C(p, p-1)n^{1}

We have supposed that (n^{p} – n) can be divided by p and try to prove (n+1)^{p} – (n+1) can be divided by p based on this assumption. We also proved that C(p, x) can be divided by p if p is a prime. That’s why, all terms in the equation illustrated above can be divided by p.

So, we can prove the correctness of Fermat’s Little Theorem based on proof by induction. We’ve benefited from binomial theorem and pascal triangle (binomial expansion). This approach requires more powerful math background than the necklace method.

The post Official Guide To Fermat’s Little Theorem appeared first on Sefik Ilkin Serengil.

]]>The post The Math Behind RSA Algorithm appeared first on Sefik Ilkin Serengil.

]]>a^{n} = a (mod n)

Notice that Fermat’s little theorem states if n is a prime, and a is any integer coprime to n, then a^{n} – a is divisible by n without remainder.

Remember that n is a prime number in that expression. Herein, we can divide both side of the term to number a because a and n must be co-primes.

a^{n-1} = 1 (mod n)

Let’s express n-1 as totient function. The expression would be modified as the following form and it is called Fermat-Euler generalization.

a^{ϕ(n)} = 1 (mod n)

Actually, totient function ϕ(n) is number of integers less than or equal to n that are relatively prime to n. Notice that n is a prime number in Fermat’s Little Theorem. In this case, totient must be n-1.

In Euler’s statement, n does not have to be prime. It says any module n, and any integer a coprime to n. Totient function still means number of integers between 1 and n that are coprime to n. However, we will define n as multiplication of two primes.

We’ll turn back to **Fermat-Euler generalization**.

We should pick the global module n as multiplication of two primes and named them as p and q.

n = p . q

So, we can express Fermat’s Little Theorem for both prime numbers.

a^{ϕ(p)} = 1 (mod p)

a^{ϕ(q)} = 1 (mod q)

Notice that ϕ(p) and ϕ(q) are (p-1) and (q-1) respectively because they both are prime numbers. The question is that what is ϕ(n) or ϕ(pq) because it is not prime.

Herein, totient function is a multiplicative function. This property is based on Chinese Remainder Theorem.

ϕ(n) = ϕ(p.q) = ϕ(p).ϕ(q) = (p-1).(q-1)

That is why, we can produce a generalized demonstration.

a^{ϕ(n)} = 1 (mod n)

a^{ϕ(p)ϕ(q)} = 1 (mod pq)

a^{(p-1)(q-1)} = 1 (mod pq)

Then, the algorithm instructs to pick a random number **e** that co-prime to ϕ(n). Then, we’ll find its multiplicative inverse for module ϕ(n) and named it **d**.

e.d = 1 mod ϕ(n)

We can express this term as e.d = k . ϕ(n) + 1

Suppose that m is the message. Let’s calculate m to e to d.

(m^{e})^{d} mod n = m^{e}^{d} mod n = m^{k . ϕ(n) + 1} mod n = (m^{k . ϕ(n) } . m) mod n = (m^{ϕ(n)})^{k} . m mod n

Now, please focus on the term in the parenthesis. Is it familiar?

m^{ϕ(n)} mod n

Yes, it is Fermat-Euler generalization. It must be equal to 1.

m^{ϕ(n)} mod n = 1 mod n

Let’s replace it.

(m^{ϕ(n)})^{k} . m mod n = (1)^{k} . m mod n = 1 . m mod n = m mod n

As seen, we can restore the message again. Herein, e and d are correlate numbers. We can use one for encryption and the other one for decryption. In other words, let’s say c is the ciphertext. Then, message can be encrypted as illustrated below.

c = m^{e} mod n

Message can be restored as demonstrated below.

m = c^{d} mod n

This is the proof of RSA algorithm.

So, RSA algorithm is based on Fermat – Euler generalization. Today, most of digital signatures and certificates are launched on this algorithm. Interestingly, Fermat lived in 17th century and Euler lived in 18th century. However, we’ve just put these genius mathematicians’ discoveries in backbones of security world in 21th century. This is the satisfactory part of math, it does not change. On the other hand, all other sciences would change. Even the physics could change, consider Galileo’s physics and Einstein’s physics.

The post The Math Behind RSA Algorithm appeared first on Sefik Ilkin Serengil.

]]>The post Unofficial Guide to Fermat’s Little Theorem appeared first on Sefik Ilkin Serengil.

]]>Or we can express Fermat’s little theorem as illustrated below.

a^{p} – a = 0 (mod p)

Suppose that our alphabet consists of **a** symbols, and we are going to produce all possible strings length of **p**. To make it concrete, put real numbers to a and p pair.

a = 8, p = 5

Here, I would like to define symbols in my alphabet 1 to 8.

alphabet = {1, 2, 3, 4, 5, 6, 7, 8}

I wonder that how many different strings length of 5 can be written for this alphabet. First character of the string can be 1 to 8. Similarly, second character of the string can still be 1 to 8. In this same way, 3rd, 4rd and 5th characters can be 1 to 8. This means that there are **a ^{p}** strings can be written in defined space.

Now, consider strings consist of same symbol.

11111, 22222, 33333, 44444, 55555, 66666, 77777, 88888

As seen, there are **a** strings consisting of same symbol.

Then, consider string consisting of at least two symbols (E.g. 11112). Shifting symbols in that string would produce p different strings.

11112, 11121, 11211, 12111, 21111

11113, 11131, 11311, 13111, 31111

11114, 11141, 11411, 14111, 41111

11115, 11151, 11511, 15111, 51111

…

It goes like this.

Subtracting number of all possible strings to number of strings consisting of same symbol is the number of strings consisting of at least two different symbols. In other words, we can say that there are **a ^{p}** –

In literature, restricted size of strings are called necklace. Connecting begin and end character of string reveals the necklace. One can rotate the necklace and have different appearance.

So, we’ve demonstrated the easiest proof of Fermat’s little theorem. Actually, there are lots of different ways to prove this theorem. Fermat’s little theorem is very important topic in cryptography world. RSA algorithm is based on its modified version named Euler-Fermat generalization.

The post Unofficial Guide to Fermat’s Little Theorem appeared first on Sefik Ilkin Serengil.

]]>The post A Step By Step C4.5 Decision Tree Example appeared first on Sefik Ilkin Serengil.

]]>We are going to create a decision table for the following dataset. It informs about decision making factors to play tennis at outside for previous 14 days. The dataset might be familiar from the ID3 post. The difference is that temperature and humidity columns have continuous values instead of nominal ones.

Day | Outlook | Temp. | Humidity | Wind | Decision |
---|---|---|---|---|---|

1 | Sunny | 85 | 85 | Weak | No |

2 | Sunny | 80 | 90 | Strong | No |

3 | Overcast | 83 | 78 | Weak | Yes |

4 | Rain | 70 | 96 | Weak | Yes |

5 | Rain | 68 | 80 | Weak | Yes |

6 | Rain | 65 | 70 | Strong | No |

7 | Overcast | 64 | 65 | Strong | Yes |

8 | Sunny | 72 | 95 | Weak | No |

9 | Sunny | 69 | 70 | Weak | Yes |

10 | Rain | 75 | 80 | Weak | Yes |

11 | Sunny | 75 | 70 | Strong | Yes |

12 | Overcast | 72 | 90 | Strong | Yes |

13 | Overcast | 81 | 75 | Weak | Yes |

14 | Rain | 71 | 80 | Strong | No |

We will do what we have done in ID3 example. Firstly, we need to calculate global entropy. There are 14 examples; 9 instances refer to yes decision, and 5 instances refer to no decision.

Entropy(Decision) = ∑ – p(I) . log_{2}p(I) = – p(Yes) . log_{2}p(Yes) – p(No) . log_{2}p(No) = – (9/14) . log_{2}(9/14) – (5/14) . log_{2}(5/14) = 0.940

In ID3 algorithm, we’ve calculated gains for each attribute. Here, we need to calculate gain ratios instead of gains.

GainRatio(A) = Gain(A) / SplitInfo(A)

SplitInfo(A) = -∑ |Dj|/|D| x log_{2}|Dj|/|D|

Wind is a nominal attribute. Its possible values are weak and strong.

Gain(Decision, Wind) = Entropy(Decision) – ∑ ( p(Decision|Wind) . Entropy(Decision|Wind) )

Gain(Decision, Wind) = Entropy(Decision) – [ p(Decision|Wind=Weak) . Entropy(Decision|Wind=Weak) ] + [ p(Decision|Wind=Strong) . Entropy(Decision|Wind=Strong) ]

There are 8 weak wind instances. 2 of them are concluded as no, 6 of them are concluded as yes.

Entropy(Decision|Wind=Weak) = – p(No) . log_{2}p(No) – p(Yes) . log_{2}p(Yes) = – (2/8) . log_{2}(2/8) – (6/8) . log_{2}(6/8) = 0.811

Entropy(Decision|Wind=Strong) = – (3/6) . log_{2}(3/6) – (3/6) . log_{2}(3/6) = 1

Gain(Decision, Wind) = 0.940 – (8/14).(0.811) – (6/14).(1) = 0.940 – 0.463 – 0.428 = 0.049

There are 8 decisions for weak wind, and 6 decisions for strong wind.

SplitInfo(Decision, Wind) = -(8/14).log_{2}(8/14) – (6/14).log_{2}(6/14) = 0.461 + 0.524 = 0.985

GainRatio(Decision, Wind) = Gain(Decision, Wind) / SplitInfo(Decision, Wind) = 0.049 / 0.985 = 0.049

Outlook is a nominal attribute, too. Its possible values are sunny, overcast and rain.

Gain(Decision, Outlook) = Entropy(Decision) – ∑ ( p(Decision|Outlook) . Entropy(Decision|Outlook) ) =

Gain(Decision, Outlook) = Entropy(Decision) – p(Decision|Outlook=Sunny) . Entropy(Decision|Outlook=Sunny) – p(Decision|Outlook=Overcast) . Entropy(Decision|Outlook=Overcast) – p(Decision|Outlook=Rain) . Entropy(Decision|Outlook=Rain)

There are 5 sunny instances. 3 of them are concluded as no, 2 of them are concluded as yes.

Entropy(Decision|Outlook=Sunny) = – p(No) . log_{2}p(No) – p(Yes) . log_{2}p(Yes) = -(3/5).log_{2}(3/5) – (2/5).log_{2}(2/5) = 0.441 + 0.528 = 0.970

Entropy(Decision|Outlook=Overcast) = – p(No) . log_{2}p(No) – p(Yes) . log_{2}p(Yes) = -(0/4).log_{2}(0/4) – (4/4).log_{2}(4/4) = 0

Notice that log

_{2}(0) is actually equal to -∞ but assume that it is equal to 0.

Entropy(Decision|Outlook=Rain) = – p(No) . log_{2}p(No) – p(Yes) . log_{2}p(Yes) = -(2/5).log_{2}(2/5) – (3/5).log_{2}(3/5) = 0.528 + 0.441 = 0.970

Gain(Decision, Outlook) = 0.940 – (5/14).(0.970) – (4/14).(0) – (5/14).(0.970) – (5/14).(0.970) = 0.246

There are 5 instances for sunny, 4 instances for overcast and 5 instances for rain

SplitInfo(Decision, Outlook) = -(5/14).log_{2}(5/14) -(4/14).log_{2}(4/14) -(5/14).log_{2}(5/14) = 1.577

GainRatio(Decision, Outlook) = Gain(Decision, Outlook)/SplitInfo(Decision, Outlook) = 0.246/1.577 = 0.155

As an exception, humidity is a continuous attribute. We need to convert continuous values to nominal ones. C4.5 proposes to perform binary split based on a threshold value. Threshold should be a value which offers maximum gain for that attribute. Let’s focus on humidity attribute. Firstly, we need to sort humidity values smallest to largest.

Day | Humidity | Decision |
---|---|---|

7 | 65 | Yes |

6 | 70 | No |

9 | 70 | Yes |

11 | 70 | Yes |

13 | 75 | Yes |

3 | 78 | Yes |

5 | 80 | Yes |

10 | 80 | Yes |

14 | 80 | No |

1 | 85 | No |

2 | 90 | No |

12 | 90 | Yes |

8 | 95 | No |

4 | 96 | Yes |

Now, we need to iterate on all humidity values and seperate dataset into two parts as instances less than or equal to current value, and instances greater than the current value. We would calculate the gain or gain ratio for every step. The value which maximizes the gain would be the threshold.

Check 65 as a threshold for humidity

Entropy(Decision|Humidity<=65) = – p(No) . log_{2}p(No) – p(Yes) . log_{2}p(Yes) = -(0/1).log_{2}(0/1) – (1/1).log_{2}(1/1) = 0

Entropy(Decision|Humidity>65) = -(5/13).log_{2}(5/13) – (8/13).log_{2}(8/13) =0.530 + 0.431 = 0.961

Gain(Decision, Humidity<> 65) = 0.940 – (1/14).0 – (13/14).(0.961) = 0.048

SplitInfo(Decision, Humidity<> 65) = -(1/14).log_{2}(1/14) -(13/14).log_{2}(13/14) = 0.371

GainRatio(Decision, Humidity<> 65) = 0.126

Check 70 as a threshold for humidity

Entropy(Decision|Humidity<=70) = – (1/4).log_{2}(1/4) – (3/4).log_{2}(3/4) = 0.811

Entropy(Decision|Humidity>70) = – (4/10).log_{2}(4/10) – (6/10).log_{2}(6/10) = 0.970

Gain(Decision, Humidity<> 70) = 0.940 – (4/14).(0.811) – (10/14).(0.970) = 0.940 – 0.231 – 0.692 = 0.014

SplitInfo(Decision, Humidity<> 70) = -(4/14).log_{2}(4/14) -(10/14).log_{2}(10/14) = 0.863

GainRatio(Decision, Humidity<> 70) = 0.016

Check 75 as a threshold for humidity

Entropy(Decision|Humidity<=75) = – (1/5).log_{2}(1/5) – (4/5).log_{2}(4/5) = 0.721

Entropy(Decision|Humidity>75) = – (4/9).log_{2}(4/9) – (5/9).log_{2}(5/9) = 0.991

Gain(Decision, Humidity<> 75) = 0.940 – (5/14).(0.721) – (9/14).(0.991) = 0.940 – 0.2575 – 0.637 = 0.045

SplitInfo(Decision, Humidity<> 75) = -(5/14).log_{2}(4/14) -(9/14).log_{2}(10/14) = 0.940

GainRatio(Decision, Humidity<> 75) = 0.047

I think calculation demonstrations are enough. Now, I skip the calculations and write only results.

Gain(Decision, Outlook <> 78) =0.090, GainRatio(Decision, Humidity<> 78) =0.090

Gain(Decision, Outlook <> 80) = 0.101, GainRatio(Decision, Humidity<> 80) = 0.107

Gain(Decision, Outlook <> 85) = 0.024, GainRatio(Decision, Humidity<> 85) = 0.027

Gain(Decision, Outlook <> 90) = 0.010, GainRatio(Decision, Humidity<> 90) = 0.016

As seen, gain maximizes when threshold is equal to 80 for humidity. This means that we need to compare other nominal attributes and comparison of humidity to 80 to create a branch in our tree.

Let’s summarize calculated gain and gain ratios. Outlook attribute comes with both maximized gain and gain ratio. This means that we need to put outlook decision in root of decision tree.

Attribute | Gain | GainRatio |

Wind | 0.049 | 0.049 |

Outlook | 0.246 | 0.155 |

Humidity <> 80 | 0.101 | 0.107 |

After then, we would apply similar steps just like as ID3 and create following decision tree. Outlook is put into root node. Now, we should look decisions for different outlook types.

We’ve split humidity for greater than 80, and less than or equal to 80. Surprisingly, decisions would be no if humidity is greater than 80 when outlook is sunny. Similarly, decision would be yes if humidity is less than or equal to 80 for sunny outlook.

Day | Outlook | Temp. | Hum. > 80 | Wind | Decision |
---|---|---|---|---|---|

1 | Sunny | 85 | Yes | Weak | No |

2 | Sunny | 80 | Yes | Strong | No |

8 | Sunny | 72 | Yes | Weak | No |

9 | Sunny | 69 | No | Weak | Yes |

11 | Sunny | 75 | No | Strong | Yes |

If outlook is overcast, then no matter temperature, humidity or wind are, decision will always be yes.

Day | Outlook | Temp. | Hum. > 80 | Wind | Decision |
---|---|---|---|---|---|

3 | Overcast | 83 | No | Weak | Yes |

7 | Overcast | 64 | No | Strong | Yes |

12 | Overcast | 72 | Yes | Strong | Yes |

13 | Overcast | 81 | No | Weak | Yes |

We’ve just filtered rain outlook instances. As seen, decision would be yes when wind is weak, and it would be no if wind is strong.

Day | Outlook | Temp. | Hum. > 80 | Wind | Decision |
---|---|---|---|---|---|

4 | Rain | 70 | Yes | Weak | Yes |

5 | Rain | 68 | No | Weak | Yes |

6 | Rain | 65 | No | Strong | No |

10 | Rain | 75 | No | Weak | Yes |

14 | Rain | 71 | No | Strong | No |

Final form of decision table is demonstrated below.

So, C4.5 algorithm solves most of problems in ID3. The algorithm uses gain ratios instead of gains. In this way, it creates more generalized trees and not to fall into overfitting. Moreover, the algorithm transforms continuous attributes to nominal ones based on gain maximization and in this way it can handle continuous data. Additionally, it can ignore instances including missing data and handle missing dataset. On the other hand, both ID3 and C4.5 requires high CPU and memory demand. Besides, most of authorities think decision tree algorithms in data mining field instead of machine learning.

The post A Step By Step C4.5 Decision Tree Example appeared first on Sefik Ilkin Serengil.

]]>The post Random Initialization in Neural Networks appeared first on Sefik Ilkin Serengil.

]]>Random initialization did not exist in legacy version of perceptron. Adding hidden layers was not enough to generalize non-linear problems. Let’s monitor how initializing all weight values as zero fails for multi-layer perceptron. It cannot generalize even an xor gate problem even though it have a hidden layer including 4 nodes.

def initialize_weights(layer_index, rows, columns): weights = np.zeros((rows+1, columns))

As seen, final weight values are same for same layers. This is the reason of failing.

On the other hand, initializing weights randomly enables to back propagate. You can create populate the weights with random samples from a uniform distribution over [0, 1].

def initialize_weights(rows, columns): weights = np.random.random((rows+1, columns)) #+1 refers to bias unit

You can improve converge performance by applying some additional techniques. Initializing weights is based on the layer it connected from. This is called Xavier Initialization. This initialization is good for tanh activation.

def initialize_weights(rows, columns): weights = np.random.randn(rows+1, columns) #normal distribution, +1 refers to bias unit weights = weights * np.sqrt(1/rows) return weights

Modifying dividend works better for ReLU.

weights = weights * np.sqrt(2/(rows+1)) #+1 refers to bias unit

Same research proposes another initialization technique called normalized initialization based on the size of previous layer and following layer.

weights = weights * np.sqrt(6/((rows+1) + columns)) #+1 refers to bias unit

You can create weights’ initial values in python as coded below:

num_of_layers = len(hidden_layers) + 2 #plus input layer and output layer w = [0 for i in range(num_of_layers-1)] #weights from input layer to first hidden layer w[0] = initialize_weights(num_of_features, hidden_layers[0]) #weights connecting a hidden layer to another hidden layer if len(hidden_layers)-1 != 0: for i in range(len(hidden_layers) - 1): w[i+1] = initialize_weights(hidden_layers[i], hidden_layers[i+1]) #weights from final hidden layer to output layer w[num_of_layers-2] = initialize_weights(hidden_layers[len(hidden_layers) - 1], num_of_classes)

So, we have focused on why random initialization is important for neural networks. Also, we’ve mentioned some initialization techniques. However, applying one of these initialization approaches are not must. Neural networks can handle any problem if they just initialized randomly. I’ve finally pushed weight initialization logic into GitHub.

The post Random Initialization in Neural Networks appeared first on Sefik Ilkin Serengil.

]]>The post How Vectorization Saves Life in Neural Networks appeared first on Sefik Ilkin Serengil.

]]>Suppose that you will construct a neural network. Using for loops requires to store relations between nodes and weights to apply feed forward propagation. I have applied this approach once. That might be good for beginners. But you have to pay particular attention to follow algorithm instructions. Even a basic feed forward propagation can be coded as illustrated below. I can handle it with almost 50 lines of codes.

def applyForwardPropagation(nodes, weights, instance, activation_function): #transfer bias unit values as +1 for j in range(len(nodes)): if nodes[j].get_is_bias_unit() == True: nodes[j].set_net_value(1) #------------------------------ #tranfer instace features to input layer. activation function would not be applied for input layer. for j in range(len(instance) - 1): #final item is output of an instance, that's why len(instance) - 1 used to iterate on features var = instance[j] for k in range(len(nodes)): if j+1 == nodes[k].get_index(): nodes[k].set_net_value(var) break #------------------------------ for j in range(len(nodes)): if nodes[j].get_level()&amp;gt;0 and nodes[j].get_is_bias_unit() == False: net_input = 0 net_output = 0 target_index = nodes[j].get_index() for k in range(len(weights)): if target_index == weights[k].get_to_index(): wi = weights[k].get_value() source_index = weights[k].get_from_index() for m in range(len(nodes)): if source_index == nodes[m].get_index(): xi = nodes[m].get_net_value() net_input = net_input + (xi * wi) break #iterate on weights end net_output = Activation.activate(activation_function, net_input) nodes[j].set_net_input_value(net_input) nodes[j].set_net_value(net_output) #------------------------------ return nodes

So, is this really that complex? Of course, not. We will focus on linear algebra to transform neural networks concept to vectorized version.

You might realize that demonstration of weights is a little different.

E.g. w^{(2)}_{11} refers to weight connecting 2nd layer to 3rd layer because of (2) superscript. It is not the power expression. Moreover, this weight connects 1st node in the previous layer to 1st node in the following layer because of 11 subscript. First item in the subscript refers to connected from information and second item in the subscript refers to connected to information. Similarly, w^{(1)}_{12} refers to weight connecting 1st layer’s 1st item to 2nd layer’s 2nd item.

Let’s express inputs and weights as vectors and matrices. Input features are expressed as column vector size of 1xn where n is the total number of inputs.

Let’s imagine, what would be if transposed weights and input features are multiplied?

Yes, you are right! This matrix multiplication will store netinput for hidden layer.

We additionally need to transfer these inputs to activation function (e.g. sigmoid) to calculate netoutputs.

So, what would vectorization contribute when compared to loop approach?

We will consume only the following libraries in our python program. Numpy is very strong python library makes matrix operations easier.

import math import numpy as np

Here, let’s initialize the input features and weights

x = np.array( #xor dataset [ #bias, #x1, #x2 [[1],[0],[0]], #instance 1 [[1],[0],[1]], #instance 2 [[1],[1],[0]], #instance 3 [[1],[1],[1]] #instace 4 ] ) w = np.array( [ [ #weights for input layer to 1st hidden layer [0.8215133714710082, -4.781957888088778, 4.521206980948031], [-1.7254199547588138, -9.530462129807947, -8.932730568307496], [2.3874630239703, 9.221735768691351, 9.27410475328787] ], [ #weights for hidden layer to output layer [3.233334754817538], [-0.3269698166346504], [6.817229313048568], [-6.381026998906089] ] ] )

Now, it is time to code. We can adapt feed forward logic in 2 meaningful steps (matmul which serves matrix multiplication and sigmoid which serves activation function) as illustrated below. The other lines refer to initialization. As seen, there is neither loop nor condition statement used among nodes and weights.

num_of_layers = w.shape[0] + 1 def applyFeedForward(x, w): netoutput = [i for i in range(num_of_layers)] netinput = [i for i in range(num_of_layers)] netoutput[0] = x for i in range(num_of_layers - 1): netinput[i+1] = np.matmul(np.transpose(w[i]), netoutput[i]) netoutput[i+1] = sigmoid(netinput[i+1]) return netoutput

Additionally, we need to apply the following function to transform netinput to netoutput in layers.

def sigmoid(netinput): netoutput = np.ones((netinput.shape[0] + 1, 1)) #ones because init values are same as bias unit. #also size of output is 1 plus input because of bias for i in range(netinput.shape[0]): netoutput[i+1] = 1/(1 + math.exp(-netinput[i][0])) return netoutput

Similar approach can be applied to learning process in neural networks. Element wise multiplication and scalar multiplication ease construction.

for epoch in range(10000): for i in range(num_of_instances): instance = x[i] nodes = applyFeedForward(instance, w) predict = nodes[num_of_layers - 1][1] actual = y[i] error = actual - predict sigmas = [i for i in range(num_of_layers)] #error should not be reflected to input layer sigmas[num_of_layers - 1] = error for j in range(num_of_layers - 2, -1, -1): if sigmas[j + 1].shape[0] == 1: sigmas[j] = w[j] * sigmas[j + 1] else: if j == num_of_layers - 2: #output layer has no bias unit sigmas[j] = np.matmul(w[j], sigmas[j + 1]) else: #otherwise remove bias unit from the following node because it is not connected from previous layer sigmas[j] = np.matmul(w[j], sigmas[j + 1][1:]) #sigma calculation end derivative_of_sigmoid = nodes * (np.array([1]) - nodes) #element wise multiplication and scalar multiplication sigmas = derivative_of_sigmoid * sigmas for j in range(num_of_layers - 1): delta = nodes[j] * np.transpose(sigmas[j+1][1:]) w[j] = w[j] + np.array([0.1]) * delta

It is clear that vectorization makes code more readable and more clear. What about the performance? I tested it for both loop approach and vectorization on xor data set for same configurations (10000 epoch, 2 hidden layers number of different nodes – x axis). It seems vectorization defeats loop approach even for a basic dataset. That is the engineering! You can test it by your own from this GitHub repo. NN.py refers to loop approach whereas Vectorization.py refers to vectorized version.

So, we have replaced loop approach to vectorization in neural networks feed forward step. This approach speeds performance up and increase code readability radically. I’ve also pushed both vectorization and loop approach the code to GitHub. Not surprising that Prof. Andrew mentioned that you should not use loops. BTW, Barbara Fusinska defines neural networks and deep learning as **matrix multiplication, a lot of matrix multiplication**. I like this definition as well.

The post How Vectorization Saves Life in Neural Networks appeared first on Sefik Ilkin Serengil.

]]>The post A Step By Step Bitcoin Address Example appeared first on Sefik Ilkin Serengil.

]]>

Suppose that you’ve chosen the following private key.

privateKey = 11253563012059685825953619222107823549092147699031672238385790369351542642469

Base point is a coordinate on the elliptic curve that bitcoin protocol consumes. It is publicly known. Additionally, modulo and order of group are publicly known information for bitcoin protocol, too. But these are not focus of this post.

x0 = 55066263022277343669578718895168534326250603453777594175500187360389116729240 y0 = 32670510020758816978083085130507043184471273380659243275938904335757337482424

Public key will be the following coordinates. We have used both point addition and double and add method rules to find the public key. Public key calculation is a fast operation.

public key = 36422191471907241029883925342251831624200921388586025344128047678873736520530, 20277110887056303803699431755396003735040374760118964734768299847012543114150

Here, we need to convert coordinates of public key to hex. Python provides hex command for this transformation but it prepends 0x prefix. We can specify the starting index to the end to remove that prefix. Additionally, we need to add 04 prefix to coordinates.

publicKeyHex = "04"+hex(publicKey[0])[2:]+hex(publicKey[1])[2:]

This will produce following public key demonstration.

public key (hex): 0450863ad64a87ae8a2fe83c1af1a8403cb53f53e486d8511dad8a04887e5b23522cd470243453a299fa9e77237716103abc11a1df38855ed6f2ee187e9c582ba6

Now, we need to apply a series of hash functions to hex version of public key. I’ve written the following generalized function for hashing.

def hexStringToByte(content): return codecs.decode(content.encode("utf-8"), 'hex') def hashHex(algorithm, content): my_sha = hashlib.new(algorithm) my_sha.update(hexStringToByte(content)) return my_sha.hexdigest()

Firstly, we’ll digest the public key with SHA-256 and RIPEMD160, respectively. Finally, we need to add 00 prefix to double hashed value.

output = hashHex('sha256', publicKeyHex) print("apply sha-256 to public key: ",output) output = hashHex('ripemd160', output) print("apply ripemd160 to sha-256 applied public key: ", output) output = "00"+output print("add network bytes to ripemd160 applied hash - extended ripemd160: ", output,"\n")

This produces the following hashes.

apply sha-256 to public key hex: 600ffe422b4e00731a59557a5cca46cc183944191006324a447bdb2d98d4b408

apply ripemd160 to sha-256 applied public key: 010966776006953d5567439e5e39f86a0d273bee

add network bytes to ripemd160 applied hash – extended ripemd160: 00010966776006953d5567439e5e39f86a0d273bee

We’ve calculated the hash of public key in previous section. We’ll apply two times SHA-256 to hash of public key. And only first 8 digit of this new hash concerns us.

checksum = hashHex('sha256', hashHex('sha256', output)) checksum = checksum[0:8]

That would be the checksum

extract first 8 characters as checksum: d61967f6

We will append this checksum to hash of public key.

address = output+checksum

In this way, we can create the raw address.

checksum appended public key hash: 00010966776006953d5567439e5e39f86a0d273beed61967f6

Finally, we need to apply base-58 encoding to the raw address. I’ve found an excellent implementation of this encoding. I’ve directly adapted it.

address = base58.b58encode(hexStringToByte(address)) print("this is your bitcoin address:",str(address)[2:len(address)-2])

Bitcoin address calculation is finally over. You can send and receice bitcoins if you have this kind of address.

this is your bitcoin address: 16UwLL9Risc3QfPqBUvKofHmBQ7wM

So, we’ve picked up just a really random private key. Then, calculate public key from known private key. After then, we’ve applied several hash algorithms to public key and retrieved our bitcoin address. Additionally, we’ll sign every transaction we’ve involved in with our private key whereas bitcoin network users verify these transactions with our public key.

I’ve pushed the source code of this post to the GitHub. Please consider to star the repository if you like this post.

The post A Step By Step Bitcoin Address Example appeared first on Sefik Ilkin Serengil.

]]>The post Convolutional Autoencoder: Clustering Images with Neural Networks appeared first on Sefik Ilkin Serengil.

]]>Remember autoencoder post. Network design is symettric about centroid and number of nodes reduce from left to centroid, they increase from centroid to right. Centroid layer would be compressed representation. We will apply same procedure for CNN, too. We will additionally consume convolution, activation and pooling layer for convolutional autoencoder.

We can call left to centroid side as convolution whereas centroid to right side as deconvolution. Deconvolution side is also known as unsampling or transpose convolution. We’ve mentioned how pooling operation works. It is a basic reduction operation. How can we apply its reverse operation? That might be a little confusing. I’ve found a excellent animation for unsampling. Input matrix size of 2×2 (blue one) will be deconvolved to a matrix size of 4×4 (cyan one). To do this duty, we can add imaginary elements (e.g. 0 values) to the base matrix and it is transformed to 6×6 sized matrix.

We will work on handwritten digit database again. We’ll design the structure of convolutional autoencoder as illustrated above.

model = Sequential() #1st convolution layer model.add(Conv2D(16, (3, 3) #16 is number of filters and (3, 3) is the size of the filter. , padding='same', input_shape=(28,28,1))) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2,2), padding='same')) #2nd convolution layer model.add(Conv2D(2,(3, 3), padding='same')) # apply 2 filters sized of (3x3) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2,2), padding='same')) #here compressed version #3rd convolution layer model.add(Conv2D(2,(3, 3), padding='same')) # apply 2 filters sized of (3x3) model.add(Activation('relu')) model.add(UpSampling2D((2, 2))) #4rd convolution layer model.add(Conv2D(16,(3, 3), padding='same')) model.add(Activation('relu')) model.add(UpSampling2D((2, 2))) model.add(Conv2D(1,(3, 3), padding='same')) model.add(Activation('sigmoid'))

You can summarize the constructed network structure.

model.summary()

This command dumps the following output. Base input is size of 28×28 at the beginnig, 2 first two layers are responsible for reduction, following 2 layers are in charged of restoration. Final layer restores same size of input as seen.

_____________

Layer (type) Output Shape Param #

========

conv2d_1 (Conv2D) (None, 28, 28, 16) 160

_____________

activation_1 (Activation) (None, 28, 28, 16) 0

_____________

max_pooling2d_1 (MaxPooling2 (None, 14, 14, 16) 0

_____________

conv2d_2 (Conv2D) (None, 14, 14, 2) 290

_____________

activation_2 (Activation) (None, 14, 14, 2) 0

_____________

max_pooling2d_2 (MaxPooling2 (None, 7, 7, 2) 0

_____________

conv2d_3 (Conv2D) (None, 7, 7, 2) 38

_____________

activation_3 (Activation) (None, 7, 7, 2) 0

_____________

up_sampling2d_1 (UpSampling2 (None, 14, 14, 2) 0

_____________

conv2d_4 (Conv2D) (None, 14, 14, 16) 304

_____________

activation_4 (Activation) (None, 14, 14, 16) 0

_____________

up_sampling2d_2 (UpSampling2 (None, 28, 28, 16) 0

_____________

conv2d_5 (Conv2D) (None, 28, 28, 1) 145

_____________

activation_5 (Activation) (None, 28, 28, 1) 0

========

Here, we can start training.

model.compile(optimizer='adadelta', loss='binary_crossentropy') model.fit(x_train, x_train, epochs=3, validation_data=(x_test, x_test))

Loss values for both training set and test set are satisfactory.

loss: 0.0968 – val_loss: 0.0926

Let’s visualize some restorations.

restored_imgs = model.predict(x_test) for i in range(5): plt.imshow(x_test[i].reshape(28, 28)) plt.gray() plt.show() plt.imshow(restored_imgs[i].reshape(28, 28)) plt.gray() plt.show()

Restorations seems really satisfactory. Images on the left side are original images whereas images on the right side are restored from compressed representation.

Notice that 5th layer named max_pooling2d_2 states the compressed representation and it is size of (None, 7, 7, 2). This work reveals that we can restore 28×28 pixel image from 7x7x2 sized matrix with a little loss. In other words, compressed representation takes a 8 times less space to original image.

You might wonder how to extract compressed representations.

compressed_layer = 5 get_3rd_layer_output = K.function([model.layers[0].input], [model.layers[compressed_layer].output]) compressed = get_3rd_layer_output([x_test])[0] #flatten compressed representation to 1 dimensional array compressed = compressed.reshape(10000,7*7*2)

Now, we can apply clustering to compressed representation. I would like to apply k-means clustering.

from tensorflow.contrib.factorization.python.ops import clustering_ops import tensorflow as tf def train_input_fn(): data = tf.constant(compressed, tf.float32) return (data, None) unsupervised_model = tf.contrib.learn.KMeansClustering( 10 #num of clusters , distance_metric = clustering_ops.SQUARED_EUCLIDEAN_DISTANCE , initial_clusters=tf.contrib.learn.KMeansClustering.RANDOM_INIT ) unsupervised_model.fit(input_fn=train_input_fn, steps=1000)

Training is over. Now, we can check clusters for all test set.

clusters = unsupervised_model.predict(input_fn=train_input_fn) index = 0 for i in clusters: current_cluster = i['cluster_idx'] features = x_test[index] index = index + 1

For example, 6th cluster consists of 46 items. Distribution for this cluster is like that: 22 items are 4, 14 items are 9, 7 items are 7, and 1 item is 5. It seems mostly 4 and 9 digits are put in this cluster.

So, we’ve integrated both convolutional neural networks and autoencoder ideas for information reduction from image based data. That would be pre-processing step for clustering. In this way, we can apply k-means clustering with 98 features instead of 784 features. This could fasten labeling process for unlabeled data. Of course, **with autoencoding comes great speed**. Source code of this post is already pushed into GitHub.

The post Convolutional Autoencoder: Clustering Images with Neural Networks appeared first on Sefik Ilkin Serengil.

]]>The post Autoencoder: Neural Networks For Unsupervised Learning appeared first on Sefik Ilkin Serengil.

]]>They are actually traditional neural networks. Their design make them special. Firstly, they must have same number of nodes for both input and output layers. Secondly, hidden layers must be symmetric about center. Thirdly, number of nodes for hidden layers must decrease from left to centroid, and must increase from centroid to right.

The key point is that input features are reduced and restored respectively. We can say that input can be compressed as the value of centroid layer’s output if input is similar to output. I said similar because this compression operation is not lossless compression.

Left side of this network is called as autoencoder and it is responsible for reduction. On the other hand, right side of the network is called as autodecoder and this is in charge of enlargement.

Let’s apply this approach to handwritten digit dataset. We’ve already applied several approaches for this problem before. Even though both training and testing sets are already labeled from 0 to 9, we will discard their labels and pretend not to know what they are.

Let’s construct the autoencoder structure first. As you might remember, dataset consists of 28×28 pixel images. This means that input features are size of 784 (28×28).

model = Sequential() model.add(Dense(128, activation='relu', input_shape=(784,))) model.add(Dense(32, activation='relu')) model.add(Dense(128, activation='relu')) model.add(Dense(784, activation='sigmoid'))

Autoencoder model would have 784 nodes in both input and output layers. What’s more, there are 3 hidden layers size of 128, 32 and 128 respectively. Based on the autoencoder construction rule, it is symmetric about the centroid and centroid layer consists of 32 nodes.

We’ll transfer input features of trainset for both input layer and output layer.

model.compile(loss='binary_crossentropy', optimizer='adam') model.fit(x_train, x_train, epochs=3, validation_data=(x_test, x_test))

Both train error and validation error satisfies me (loss: 0.0881 – val_loss: 0.0867). But it would be concrete when it is applied for a real example.

def test_restoration(model): decoded_imgs = model.predict(x_test) get_3rd_layer_output = K.function([model.layers[0].input], [model.layers[1].output]) for i in range(2): print("original: ") plt.imshow(x_test[i].reshape(28, 28)) plt.show() #------------------- print("reconstructed: ") plt.imshow(decoded_imgs[i].reshape(28, 28)) plt.show() #------------------- print("compressed: ") current_compressed = get_3rd_layer_output([x_test[i:i+1]])[0][0] plt.imshow(current_compressed.reshape(8, 4)) plt.show()

Even though restored one is a little blurred, it is clearly readable. Herein, it means that compressed representation is meaningful.

We do not need to display restorations anymore. We can use the following code block to store compressed versions instead of displaying.

def autoencode(model): decoded_imgs = model.predict(x_test) get_3rd_layer_output = K.function([model.layers[0].input], [model.layers[1].output]) compressed = get_3rd_layer_output([x_test]) return compressed com = autoencode(model)

Notice that input features are size of 784 whereas compressed representation is size of 32. This means that it is 24 times smaller than the original image. Herein, complex input features enforces traditional unsupervised learning algorithms such as k-means or k-NN. On the other hand, including all features would confuse these algorithms. The idea is that you should apply autoencoder, reduce input features and extract meaningful data first. Then, you should apply a unsupervised learning algorithm to compressed representation. In this way, clustering algorithms works high performance whereas it produces more meaningful results.

unsupervised_model = tf.contrib.learn.KMeansClustering( 10 , distance_metric = clustering_ops.SQUARED_EUCLIDEAN_DISTANCE , initial_clusters=tf.contrib.learn.KMeansClustering.RANDOM_INIT) def train_input_fn(): data = tf.constant(com[0], tf.float32) return (data, None) unsupervised_model.fit(input_fn=train_input_fn, steps=5000) clusters = unsupervised_model.predict(input_fn=train_input_fn) index = 0 for i in clusters: current_cluster = i['cluster_idx'] features = x_test[index] index = index + 1

Surprisingly, this approach puts the following images in the same cluster. It seems that clustering is based on general shapes of digits instead of their identities.

So, we’ve mentioned how to adapt neural networks in unsupervised learning process. Autoencoders are trend topics of last years. They are not the alternative of supervised learning algorithms. Today, most data we have are pixel based and unlabeled. Some mechanisms such as mechanical turk provides services to label these unlabeled data. This approach might help and fasten to label unlabeled data process. Finally, source code of this post is pushed to GitHub.

The post Autoencoder: Neural Networks For Unsupervised Learning appeared first on Sefik Ilkin Serengil.

]]>