The post Random Initialization in Neural Networks appeared first on Sefik Ilkin Serengil.

]]>Random initialization did not exist in legacy version of perceptron. Adding hidden layers was not enough to generalize non-linear problems. Let’s monitor how initializing all weight values as zero fails for multi-layer perceptron. It cannot generalize even an xor gate problem even though it have a hidden layer including 4 nodes.

def initialize_weights(layer_index, rows, columns): weights = np.zeros((rows+1, columns))

As seen, final weight values are same for same layers. This is the reason of failing.

On the other hand, initializing weights randomly enables to back propagate. You can create populate the weights with random samples from a uniform distribution over [0, 1].

def initialize_weights(rows, columns): weights = np.random.random((rows+1, columns)) #+1 refers to bias unit

You can improve converge performance by applying some additional techniques. Initializing weights is based on the layer it connected from. This is called Xavier Initialization. This initialization is good for tanh activation.

def initialize_weights(rows, columns): weights = np.random.randn(rows+1, columns) #normal distribution, +1 refers to bias unit weights = weights * np.sqrt(1/rows) return weights

Modifying dividend works better for ReLU.

weights = weights * np.sqrt(2/(rows+1)) #+1 refers to bias unit

Same research proposes another initialization technique called normalized initialization based on the size of previous layer and following layer.

weights = weights * np.sqrt(6/((rows+1) + columns)) #+1 refers to bias unit

You can create weights’ initial values in python as coded below:

num_of_layers = len(hidden_layers) + 2 #plus input layer and output layer w = [0 for i in range(num_of_layers-1)] #weights from input layer to first hidden layer w[0] = initialize_weights(num_of_features, hidden_layers[0]) #weights connecting a hidden layer to another hidden layer if len(hidden_layers)-1 != 0: for i in range(len(hidden_layers) - 1): w[i+1] = initialize_weights(hidden_layers[i], hidden_layers[i+1]) #weights from final hidden layer to output layer w[num_of_layers-2] = initialize_weights(hidden_layers[len(hidden_layers) - 1], num_of_classes)

So, we have focused on why random initialization is important for neural networks. Also, we’ve mentioned some initialization techniques. However, applying one of these initialization approaches are not must. Neural networks can handle any problem if they just initialized randomly. I’ve finally pushed weight initialization logic into GitHub.

The post Random Initialization in Neural Networks appeared first on Sefik Ilkin Serengil.

]]>The post How Vectorization Saves Life in Neural Networks appeared first on Sefik Ilkin Serengil.

]]>Suppose that you will construct a neural network. Using for loops requires to store relations between nodes and weights to apply feed forward propagation. I have applied this approach once. That might be good for beginners. But you have to pay particular attention to follow algorithm instructions. Even a basic feed forward propagation can be coded as illustrated below. I can handle it with almost 50 lines of codes.

def applyForwardPropagation(nodes, weights, instance, activation_function): #transfer bias unit values as +1 for j in range(len(nodes)): if nodes[j].get_is_bias_unit() == True: nodes[j].set_net_value(1) #------------------------------ #tranfer instace features to input layer. activation function would not be applied for input layer. for j in range(len(instance) - 1): #final item is output of an instance, that's why len(instance) - 1 used to iterate on features var = instance[j] for k in range(len(nodes)): if j+1 == nodes[k].get_index(): nodes[k].set_net_value(var) break #------------------------------ for j in range(len(nodes)): if nodes[j].get_level()&amp;gt;0 and nodes[j].get_is_bias_unit() == False: net_input = 0 net_output = 0 target_index = nodes[j].get_index() for k in range(len(weights)): if target_index == weights[k].get_to_index(): wi = weights[k].get_value() source_index = weights[k].get_from_index() for m in range(len(nodes)): if source_index == nodes[m].get_index(): xi = nodes[m].get_net_value() net_input = net_input + (xi * wi) break #iterate on weights end net_output = Activation.activate(activation_function, net_input) nodes[j].set_net_input_value(net_input) nodes[j].set_net_value(net_output) #------------------------------ return nodes

So, is this really that complex? Of course, not. We will focus on linear algebra to transform neural networks concept to vectorized version.

You might realize that demonstration of weights is a little different.

E.g. w^{(2)}_{11} refers to weight connecting 2nd layer to 3rd layer because of (2) superscript. It is not the power expression. Moreover, this weight connects 1st node in the previous layer to 1st node in the following layer because of 11 subscript. First item in the subscript refers to connected from information and second item in the subscript refers to connected to information. Similarly, w^{(1)}_{12} refers to weight connecting 1st layer’s 1st item to 2nd layer’s 2nd item.

Let’s express inputs and weights as vectors and matrices. Input features are expressed as column vector size of 1xn where n is the total number of inputs.

Let’s imagine, what would be if transposed weights and input features are multiplied?

Yes, you are right! This matrix multiplication will store netinput for hidden layer.

We additionally need to transfer these inputs to activation function (e.g. sigmoid) to calculate netoutputs.

So, what would vectorization contribute when compared to loop approach?

We will consume only the following libraries in our python program. Numpy is very strong python library makes matrix operations easier.

import math import numpy as np

Here, let’s initialize the input features and weights

x = np.array( #xor dataset [ #bias, #x1, #x2 [[1],[0],[0]], #instance 1 [[1],[0],[1]], #instance 2 [[1],[1],[0]], #instance 3 [[1],[1],[1]] #instace 4 ] ) w = np.array( [ [ #weights for input layer to 1st hidden layer [0.8215133714710082, -4.781957888088778, 4.521206980948031], [-1.7254199547588138, -9.530462129807947, -8.932730568307496], [2.3874630239703, 9.221735768691351, 9.27410475328787] ], [ #weights for hidden layer to output layer [3.233334754817538], [-0.3269698166346504], [6.817229313048568], [-6.381026998906089] ] ] )

Now, it is time to code. We can adapt feed forward logic in 2 meaningful steps (matmul which serves matrix multiplication and sigmoid which serves activation function) as illustrated below. The other lines refer to initialization. As seen, there is neither loop nor condition statement used among nodes and weights.

num_of_layers = w.shape[0] + 1 def applyFeedForward(x, w): netoutput = [i for i in range(num_of_layers)] netinput = [i for i in range(num_of_layers)] netoutput[0] = x for i in range(num_of_layers - 1): netinput[i+1] = np.matmul(np.transpose(w[i]), netoutput[i]) netoutput[i+1] = sigmoid(netinput[i+1]) return netoutput

Additionally, we need to apply the following function to transform netinput to netoutput in layers.

def sigmoid(netinput): netoutput = np.ones((netinput.shape[0] + 1, 1)) #ones because init values are same as bias unit. #also size of output is 1 plus input because of bias for i in range(netinput.shape[0]): netoutput[i+1] = 1/(1 + math.exp(-netinput[i][0])) return netoutput

Similar approach can be applied to learning process in neural networks. Element wise multiplication and scalar multiplication ease construction.

for epoch in range(10000): for i in range(num_of_instances): instance = x[i] nodes = applyFeedForward(instance, w) predict = nodes[num_of_layers - 1][1] actual = y[i] error = actual - predict sigmas = [i for i in range(num_of_layers)] #error should not be reflected to input layer sigmas[num_of_layers - 1] = error for j in range(num_of_layers - 2, -1, -1): if sigmas[j + 1].shape[0] == 1: sigmas[j] = w[j] * sigmas[j + 1] else: if j == num_of_layers - 2: #output layer has no bias unit sigmas[j] = np.matmul(w[j], sigmas[j + 1]) else: #otherwise remove bias unit from the following node because it is not connected from previous layer sigmas[j] = np.matmul(w[j], sigmas[j + 1][1:]) #sigma calculation end derivative_of_sigmoid = nodes * (np.array([1]) - nodes) #element wise multiplication and scalar multiplication sigmas = derivative_of_sigmoid * sigmas for j in range(num_of_layers - 1): delta = nodes[j] * np.transpose(sigmas[j+1][1:]) w[j] = w[j] + np.array([0.1]) * delta

It is clear that vectorization makes code more readable and more clear. What about the performance? I tested it for both loop approach and vectorization on xor data set for same configurations (10000 epoch, 2 hidden layers number of different nodes – x axis). It seems vectorization defeats loop approach even for a basic dataset. That is the engineering! You can test it by your own from this GitHub repo. NN.py refers to loop approach whereas Vectorization.py refers to vectorized version.

So, we have replaced loop approach to vectorization in neural networks feed forward step. This approach speeds performance up and increase code readability radically. I’ve also pushed both vectorization and loop approach the code to GitHub. Not surprising that Prof. Andrew mentioned that you should not use loops. BTW, Barbara Fusinska defines neural networks and deep learning as **matrix multiplication, a lot of matrix multiplication**. I like this definition as well.

The post How Vectorization Saves Life in Neural Networks appeared first on Sefik Ilkin Serengil.

]]>The post A Step By Step Bitcoin Address Example appeared first on Sefik Ilkin Serengil.

]]>

Suppose that you’ve chosen the following private key.

privateKey = 11253563012059685825953619222107823549092147699031672238385790369351542642469

Base point is a coordinate on the elliptic curve that bitcoin protocol consumes. It is publicly known. Additionally, modulo and order of group are publicly known information for bitcoin protocol, too. But these are not focus of this post.

x0 = 55066263022277343669578718895168534326250603453777594175500187360389116729240 y0 = 32670510020758816978083085130507043184471273380659243275938904335757337482424

Public key will be the following coordinates. We have used both point addition and double and add method rules to find the public key. Public key calculation is a fast operation.

public key = 36422191471907241029883925342251831624200921388586025344128047678873736520530, 20277110887056303803699431755396003735040374760118964734768299847012543114150

Here, we need to convert coordinates of public key to hex. Python provides hex command for this transformation but it prepends 0x prefix. We can specify the starting index to the end to remove that prefix. Additionally, we need to add 04 prefix to coordinates.

publicKeyHex = "04"+hex(publicKey[0])[2:]+hex(publicKey[1])[2:]

This will produce following public key demonstration.

public key (hex): 0450863ad64a87ae8a2fe83c1af1a8403cb53f53e486d8511dad8a04887e5b23522cd470243453a299fa9e77237716103abc11a1df38855ed6f2ee187e9c582ba6

Now, we need to apply a series of hash functions to hex version of public key. I’ve written the following generalized function for hashing.

def hexStringToByte(content): return codecs.decode(content.encode("utf-8"), 'hex') def hashHex(algorithm, content): my_sha = hashlib.new(algorithm) my_sha.update(hexStringToByte(content)) return my_sha.hexdigest()

Firstly, we’ll digest the public key with SHA-256 and RIPEMD160, respectively. Finally, we need to add 00 prefix to double hashed value.

output = hashHex('sha256', publicKeyHex) print("apply sha-256 to public key: ",output) output = hashHex('ripemd160', output) print("apply ripemd160 to sha-256 applied public key: ", output) output = "00"+output print("add network bytes to ripemd160 applied hash - extended ripemd160: ", output,"\n")

This produces the following hashes.

apply sha-256 to public key hex: 600ffe422b4e00731a59557a5cca46cc183944191006324a447bdb2d98d4b408

apply ripemd160 to sha-256 applied public key: 010966776006953d5567439e5e39f86a0d273bee

add network bytes to ripemd160 applied hash – extended ripemd160: 00010966776006953d5567439e5e39f86a0d273bee

We’ve calculated the hash of public key in previous section. We’ll apply two times SHA-256 to hash of public key. And only first 8 digit of this new hash concerns us.

checksum = hashHex('sha256', hashHex('sha256', output)) checksum = checksum[0:8]

That would be the checksum

extract first 8 characters as checksum: d61967f6

We will append this checksum to hash of public key.

address = output+checksum

In this way, we can create the raw address.

checksum appended public key hash: 00010966776006953d5567439e5e39f86a0d273beed61967f6

Finally, we need to apply base-58 encoding to the raw address. I’ve found an excellent implementation of this encoding. I’ve directly adapted it.

address = base58.b58encode(hexStringToByte(address)) print("this is your bitcoin address:",str(address)[2:len(address)-2])

Bitcoin address calculation is finally over. You can send and receice bitcoins if you have this kind of address.

this is your bitcoin address: 16UwLL9Risc3QfPqBUvKofHmBQ7wM

So, we’ve picked up just a really random private key. Then, calculate public key from known private key. After then, we’ve applied several hash algorithms to public key and retrieved our bitcoin address. Additionally, we’ll sign every transaction we’ve involved in with our private key whereas bitcoin network users verify these transactions with our public key.

I’ve pushed the source code of this post to the GitHub. Please consider to star the repository if you like this post.

The post A Step By Step Bitcoin Address Example appeared first on Sefik Ilkin Serengil.

]]>The post Convolutional Autoencoder: Clustering Images with Neural Networks appeared first on Sefik Ilkin Serengil.

]]>Remember autoencoder post. Network design is symettric about centroid and number of nodes reduce from left to centroid, they increase from centroid to right. Centroid layer would be compressed representation. We will apply same procedure for CNN, too. We will additionally consume convolution, activation and pooling layer for convolutional autoencoder.

We can call left to centroid side as convolution whereas centroid to right side as deconvolution. Deconvolution side is also known as unsampling or transpose convolution. We’ve mentioned how pooling operation works. It is a basic reduction operation. How can we apply its reverse operation? That might be a little confusing. I’ve found a excellent animation for unsampling. Input matrix size of 2×2 (blue one) will be deconvolved to a matrix size of 4×4 (cyan one). To do this duty, we can add imaginary elements (e.g. 0 values) to the base matrix and it is transformed to 6×6 sized matrix.

We will work on handwritten digit database again. We’ll design the structure of convolutional autoencoder as illustrated above.

model = Sequential() #1st convolution layer model.add(Conv2D(16, (3, 3) #16 is number of filters and (3, 3) is the size of the filter. , padding='same', input_shape=(28,28,1))) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2,2), padding='same')) #2nd convolution layer model.add(Conv2D(2,(3, 3), padding='same')) # apply 2 filters sized of (3x3) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2,2), padding='same')) #here compressed version #3rd convolution layer model.add(Conv2D(2,(3, 3), padding='same')) # apply 2 filters sized of (3x3) model.add(Activation('relu')) model.add(UpSampling2D((2, 2))) #4rd convolution layer model.add(Conv2D(16,(3, 3), padding='same')) model.add(Activation('relu')) model.add(UpSampling2D((2, 2))) model.add(Conv2D(1,(3, 3), padding='same')) model.add(Activation('sigmoid'))

You can summarize the constructed network structure.

model.summary()

This command dumps the following output. Base input is size of 28×28 at the beginnig, 2 first two layers are responsible for reduction, following 2 layers are in charged of restoration. Final layer restores same size of input as seen.

_____________

Layer (type) Output Shape Param #

========

conv2d_1 (Conv2D) (None, 28, 28, 16) 160

_____________

activation_1 (Activation) (None, 28, 28, 16) 0

_____________

max_pooling2d_1 (MaxPooling2 (None, 14, 14, 16) 0

_____________

conv2d_2 (Conv2D) (None, 14, 14, 2) 290

_____________

activation_2 (Activation) (None, 14, 14, 2) 0

_____________

max_pooling2d_2 (MaxPooling2 (None, 7, 7, 2) 0

_____________

conv2d_3 (Conv2D) (None, 7, 7, 2) 38

_____________

activation_3 (Activation) (None, 7, 7, 2) 0

_____________

up_sampling2d_1 (UpSampling2 (None, 14, 14, 2) 0

_____________

conv2d_4 (Conv2D) (None, 14, 14, 16) 304

_____________

activation_4 (Activation) (None, 14, 14, 16) 0

_____________

up_sampling2d_2 (UpSampling2 (None, 28, 28, 16) 0

_____________

conv2d_5 (Conv2D) (None, 28, 28, 1) 145

_____________

activation_5 (Activation) (None, 28, 28, 1) 0

========

Here, we can start training.

model.compile(optimizer='adadelta', loss='binary_crossentropy') model.fit(x_train, x_train, epochs=3, validation_data=(x_test, x_test))

Loss values for both training set and test set are satisfactory.

loss: 0.0968 – val_loss: 0.0926

Let’s visualize some restorations.

restored_imgs = model.predict(x_test) for i in range(5): plt.imshow(x_test[i].reshape(28, 28)) plt.gray() plt.show() plt.imshow(restored_imgs[i].reshape(28, 28)) plt.gray() plt.show()

Restorations seems really satisfactory. Images on the left side are original images whereas images on the right side are restored from compressed representation.

Notice that 5th layer named max_pooling2d_2 states the compressed representation and it is size of (None, 7, 7, 2). This work reveals that we can restore 28×28 pixel image from 7x7x2 sized matrix with a little loss. In other words, compressed representation takes a 8 times less space to original image.

You might wonder how to extract compressed representations.

compressed_layer = 5 get_3rd_layer_output = K.function([model.layers[0].input], [model.layers[compressed_layer].output]) compressed = get_3rd_layer_output([x_test])[0] #flatten compressed representation to 1 dimensional array compressed = compressed.reshape(10000,7*7*2)

Now, we can apply clustering to compressed representation. I would like to apply k-means clustering.

from tensorflow.contrib.factorization.python.ops import clustering_ops import tensorflow as tf def train_input_fn(): data = tf.constant(compressed, tf.float32) return (data, None) unsupervised_model = tf.contrib.learn.KMeansClustering( 10 #num of clusters , distance_metric = clustering_ops.SQUARED_EUCLIDEAN_DISTANCE , initial_clusters=tf.contrib.learn.KMeansClustering.RANDOM_INIT ) unsupervised_model.fit(input_fn=train_input_fn, steps=1000)

Training is over. Now, we can check clusters for all test set.

clusters = unsupervised_model.predict(input_fn=train_input_fn) index = 0 for i in clusters: current_cluster = i['cluster_idx'] features = x_test[index] index = index + 1

For example, 6th cluster consists of 46 items. Distribution for this cluster is like that: 22 items are 4, 14 items are 9, 7 items are 7, and 1 item is 5. It seems mostly 4 and 9 digits are put in this cluster.

So, we’ve integrated both convolutional neural networks and autoencoder ideas for information reduction from image based data. That would be pre-processing step for clustering. In this way, we can apply k-means clustering with 98 features instead of 784 features. This could fasten labeling process for unlabeled data. Of course, **with autoencoding comes great speed**. Source code of this post is already pushed into GitHub.

The post Convolutional Autoencoder: Clustering Images with Neural Networks appeared first on Sefik Ilkin Serengil.

]]>The post Autoencoder: Neural Networks For Unsupervised Learning appeared first on Sefik Ilkin Serengil.

]]>They are actually traditional neural networks. Their design make them special. Firstly, they must have same number of nodes for both input and output layers. Secondly, hidden layers must be symmetric about center. Thirdly, number of nodes for hidden layers must decrease from left to centroid, and must increase from centroid to right.

The key point is that input features are reduced and restored respectively. We can say that input can be compressed as the value of centroid layer’s output if input is similar to output. I said similar because this compression operation is not lossless compression.

Left side of this network is called as autoencoder and it is responsible for reduction. On the other hand, right side of the network is called as autodecoder and this is in charge of enlargement.

Let’s apply this approach to handwritten digit dataset. We’ve already applied several approaches for this problem before. Even though both training and testing sets are already labeled from 0 to 9, we will discard their labels and pretend not to know what they are.

Let’s construct the autoencoder structure first. As you might remember, dataset consists of 28×28 pixel images. This means that input features are size of 784 (28×28).

model = Sequential() model.add(Dense(128, activation='relu', input_shape=(784,))) model.add(Dense(32, activation='relu')) model.add(Dense(128, activation='relu')) model.add(Dense(784, activation='sigmoid'))

Autoencoder model would have 784 nodes in both input and output layers. What’s more, there are 3 hidden layers size of 128, 32 and 128 respectively. Based on the autoencoder construction rule, it is symmetric about the centroid and centroid layer consists of 32 nodes.

We’ll transfer input features of trainset for both input layer and output layer.

model.compile(loss='binary_crossentropy', optimizer='adam') model.fit(x_train, x_train, epochs=3, validation_data=(x_test, x_test))

Both train error and validation error satisfies me (loss: 0.0881 – val_loss: 0.0867). But it would be concrete when it is applied for a real example.

def test_restoration(model): decoded_imgs = model.predict(x_test) get_3rd_layer_output = K.function([model.layers[0].input], [model.layers[1].output]) for i in range(2): print("original: ") plt.imshow(x_test[i].reshape(28, 28)) plt.show() #------------------- print("reconstructed: ") plt.imshow(decoded_imgs[i].reshape(28, 28)) plt.show() #------------------- print("compressed: ") current_compressed = get_3rd_layer_output([x_test[i:i+1]])[0][0] plt.imshow(current_compressed.reshape(8, 4)) plt.show()

Even though restored one is a little blurred, it is clearly readable. Herein, it means that compressed representation is meaningful.

We do not need to display restorations anymore. We can use the following code block to store compressed versions instead of displaying.

def autoencode(model): decoded_imgs = model.predict(x_test) get_3rd_layer_output = K.function([model.layers[0].input], [model.layers[1].output]) compressed = get_3rd_layer_output([x_test]) return compressed com = autoencode(model)

Notice that input features are size of 784 whereas compressed representation is size of 32. This means that it is 24 times smaller than the original image. Herein, complex input features enforces traditional unsupervised learning algorithms such as k-means or k-NN. On the other hand, including all features would confuse these algorithms. The idea is that you should apply autoencoder, reduce input features and extract meaningful data first. Then, you should apply a unsupervised learning algorithm to compressed representation. In this way, clustering algorithms works high performance whereas it produces more meaningful results.

unsupervised_model = tf.contrib.learn.KMeansClustering( 10 , distance_metric = clustering_ops.SQUARED_EUCLIDEAN_DISTANCE , initial_clusters=tf.contrib.learn.KMeansClustering.RANDOM_INIT) def train_input_fn(): data = tf.constant(com[0], tf.float32) return (data, None) unsupervised_model.fit(input_fn=train_input_fn, steps=5000) clusters = unsupervised_model.predict(input_fn=train_input_fn) index = 0 for i in clusters: current_cluster = i['cluster_idx'] features = x_test[index] index = index + 1

Surprisingly, this approach puts the following images in the same cluster. It seems that clustering is based on general shapes of digits instead of their identities.

So, we’ve mentioned how to adapt neural networks in unsupervised learning process. Autoencoders are trend topics of last years. They are not the alternative of supervised learning algorithms. Today, most data we have are pixel based and unlabeled. Some mechanisms such as mechanical turk provides services to label these unlabeled data. This approach might help and fasten to label unlabeled data process. Finally, source code of this post is pushed to GitHub.

The post Autoencoder: Neural Networks For Unsupervised Learning appeared first on Sefik Ilkin Serengil.

]]>The post Handling Overfitting with Dropout in Neural Networks appeared first on Sefik Ilkin Serengil.

]]>Neural networks, particularly **Deep Neural Networks** or **Deep Learning** have wide and deep structure. Even though these units provide to solve many non-linear problems, they might cause to fall into overfitting. Overfitting explains learning too much on training data. Memorizing trainin data may you loss your way for unknown examples. Instead of re-designing the network structure, dropout would gain victory.

Applying dropout is very easy task. You need to ignore some units randomly while training the network. You should ignore when you back propagate and feed forward. In this way, you can prevent overfitting. Dropout operation includes both dropping units and their connections. Dropped units can be located in both hidden layer or input / output layer. Additionally, training time reduces dramatically.

Dropout can be applied in Keras easily. You might already apply it several times if you follow this blog. It is enough to mention dropout ratio for layers as demonstrated below. It says that layer would be droped the ratio of 20%.

I will apply dropout to basic XOR example

model = Sequential() model.add(Dense(3 #num of hidden units , input_shape=(len(attributes[0]),))) #num of features in input layer model.add(Activation('sigmoid')) #activation function from input layer to 1st hidden layer model.add(Dropout(0.2)) model.add(Dense(len(labels[0]))) #num of classes in output layer model.add(Activation('softmax')) #activation function from 1st hidden layer to output layer

We can compare loss change for both raw model and dropout applied version.

plt.plot(rawmodel_score.history['loss']) plt.plot(dropout_score.history['loss'])

Even though, dropout applied model shows instability, it can reduce the loss significantly for some epoch values. On the other hand, raw model cannot get closer to dropout applied version.

So, dropout imposes that small is beautiful even for the most complex systems. You should not overrate it but you should consume it. It is basically a regularization technique. Today, most complicated neural networks models such as Inception V3 include droput units.

Bonus: The idea of dropout is firstly by Hinton. You might be familiar with this name if you interested in AI. He is the one who created back-propagation in neural networks and deep neural networks concepts. He is also called as Godfather of AI. This technique draws my attention because of its creator and inventor.

The post Handling Overfitting with Dropout in Neural Networks appeared first on Sefik Ilkin Serengil.

]]>The post IBM Data Summit 2018 Istanbul Notes appeared first on Sefik Ilkin Serengil.

]]>- Transforming information technologies to business aims to optimize whereas digital transformation aims to create new revenue flow.
- Big data technologies already provided improving customer service and support and gaining customers.
- Currently, we are using these technologies to manage business operations and risk.
- We plan to use big data to create new businesses and improve our marketing skills.
- Still we have difficulty about data accuracy and extracting correlations between data
- Data ethnographers and data scientists would play more active role in 25% of enterprises until 2021.
- 50% of enterprises would create revenue from data as a service until 2018.
- Data retrieved from public resources would be stored in blockchain until 2021. In this way, public data can be verified.

- Intelligence and neuron numbers have correlation. Neural networks schemes are not different than we have in 90’s. But now we have power to calculate million times wider synapses.
- We prefer to use technology in life instead of sentence as a principle.
- We also locate the AI as a supportive of employees instead of they selves.
- Framework choice is one of the most important subject because gifted employees are limited. Gifted one should adopt the framework.
- Failing fast is better than failing.
- Dialog banking term covers both text based bots and voice bots.
- Put your potential projects on a 2D graph. We’ve put x-axis to ease of implementation, and y-axis to business value. You can choose different dimensions. You should follow common data between use cases. Suppose that use case 1 and use case 2 have same business value but implementation of use case 3 is harder than use case 1. If both use cases use common data, then use case 3 would be easier than you think. Then, you should prioritise use case 3. Because almost 80% of time cost is preparing data for AI projects.

- IBM proposes a single infrastructure for AI transformation.
- Different roles such as Data Scientist, Data Engineer, Data Analyst can work on same environment with Watson Data Platform
- IBM Data Science Experience adopts both open source projects such as tensorflow, additionally its own intellectual properties such as SPSS.
- We also offer hybrid data management. You can work on both remote cloud or on-premise. This is important for finance institutions. They must not work on remote cloud because of legal regulations.
- Data scientists are people who dance with data.

- 80% of world data has not yet been accessed and analyzed.
- We are going to a new world. There would not be neither software nor hardware. There would be (data driven) solutions only.
- No matter where data is stored. Cloud or on-premise.
- No matter its storage type. Sql or no-sql.

The post IBM Data Summit 2018 Istanbul Notes appeared first on Sefik Ilkin Serengil.

]]>The post Oracle Analytics Summit 2018 Istanbul appeared first on Sefik Ilkin Serengil.

]]>- Data gives us feedback, and feedback gives us progress
- Why now? Because everything is connected. IOT, Cloud and Big Data.

- In 2016, our expectations are that 100% of customers would move to cloud. In 2018, expectations are changing. Now, we think that cloud and new technologies (Blockchain, AI and IOT) will accelerate the next disruption.
- Cloud provides
**democratization**for AI and algorithms! - Highlighting Geoffrey Moore’s quote.
**AI is from Venus, ML is from Mars**. This metaphor is really interesting. The question is that is Mars or Venus closer to the Earth? Actually, the answer is that it depends on time. Even though, these two planets are closest neighbors of Earth, they orbits around the sun move at different speed. But she intend to define ML as more scale-able. AI seeks to understand the world whereas ML just seeks to simulate it.

- We also define adaptive intelligence where ML meets AI.

- MIT AI lab founded in 1959. We just put it on our agenda
- Robots are alive. They even have a citizenship
- New data sources, computing paradigm and AI reveal data lakes
- No matter the structure of data source anymore
- Runnable on any kind of workflow (batch or real time)

- In 1950, we argue that can machines think argument raised by Turing. In 1980, the arguments is transformed to can machines understand. This is Chinese Room Argument raised by John Searle. Now, we argue that can machines think and understand like human

- Data feeds AI, and AI produces data. Produced data feeds AI again. That is a infinite loop.
- Short term economic value will come from Deep Learning.
- We have been talking fraud detection is almost 20 years. But now, we can do it better. This is like Einstein’s memory. A student asks Einstein that aren’t these the same questions as last year’s psychics exam and Einstein responds him that yes but this year answers are different. The question might be still fraud detection but answer is different. Answer was random forest for yesterday, but today answer is absolutely deep learning.

- Previously, AI enabled process begins with human and ends with ML. Now, it should begin with ML, and end with human. In this way, biases behaviors would be minimized.

- Traditional approaches are rule based whereas innovational approaches are bots driven.
- No UI is the best UI. We would handle this with chatbots only.

At the end of the summit, my colleagues have taken a souvenir photo in front of the oracle board. As seen, everybody leaves satisfied from the event.

The post Oracle Analytics Summit 2018 Istanbul appeared first on Sefik Ilkin Serengil.

]]>The post Solving Elliptic Curve Discrete Logarithm Problem appeared first on Sefik Ilkin Serengil.

]]>Basically, we would like to find k from the equation Q = k x P

This approach reduces complexity O(√n) where n states the order of group

Suppose that we are working on the bitcoin curve which satisfies y^{2} = x^{3} + 7 and the base point is (2, 24). Additionally, modulo would be 199 and order of group would be 211. Suppose that we have a point (14, 39) on that curve as public key.

Now, we would like to find k such that (14, 39) = k x (2, 24)

m = int(sqrt(order)) + 1 for i in range(1, m): iP = applyDoubleAndAddMethod(x0, y0, i, a, b, mod) for j in range(1, m): checkpoint = applyDoubleAndAddMethod(x0, y0, j*m, a, b, mod) checkpoint = pointAddition(publicKey[0], publicKey[1], checkpoint[0], -checkpoint[1], a, b, mod) if iP == checkpoint: print("private key is ", i+j*m," mod ",order) print("ECDLP solved in", i+m,"th step") terminate = True break if terminate == True: break

So, private key can be found in 27th operation in baby step giant step approach. On the other hand, it can be found in 177th operation if brute force is applied. This approach runs 6 times faster the brute force method even on very small numbers. It would reduce the complexity dramatically for very large private keys.

Even though, this approach reduces the complexity dramatically, elliptic curve cryptography is still too powerful and elliptic curve discrete logarithm problem is still hard. For instance, the following values are order of group and its square root of bitcoin protocol.

n = 115792089237316195423570985008687907852837564279074904382605163141518161494337 (256 bit)

√n = 340282366920938463463374607431768211455

So, Square root of order is greater than 10^{38}. Suppose that you can check a possibility in microseconds (10^{-6}). Then, finding private key lasts more than years 10^{24} based on the following calculation.

(10^{38.}10^{-6})/(60 . 60 . 24 .365) > 10^{24}

Funnily, age of the universe is 10^{9} years. You can imagine that why elliptic curve crypto systems are powerful. This is because of that there is no sub-exponential solution for ECDLP yet! Besides, ECDLP seems much more difficult than the traditional DLP which empowers RSA and Diffie Hellman. Select a 256-bit random private key and leave the rest to ECC!

Code of this tutorial is pushed into my GitHub profile. You can test it by yourself if you clone the repository.

The post Solving Elliptic Curve Discrete Logarithm Problem appeared first on Sefik Ilkin Serengil.

]]>The post Counting Points on Elliptic Curves over Finite Field: Order of Elliptic Curve Group appeared first on Sefik Ilkin Serengil.

]]>ECDSA enables to produce signatures faster. Besides, its both signatures and keys are much smaller than adopted alternative options offering similar security levels. However, this algorithm introduces a new thing called **order of group**. Point addition operations are handled on a public modulo whereas signing and verification could be handled on order of elliptic curve group. This states total number of points over that finite field.

Curve equation, base point and modulo are publicly known information. The easiest way to calculate order of group is adding base point to itself cumulatively until it throws exception.

Suppose that the curve we are working on satisfies y^{2} = x^{3} + 7 mod 199 and the base point on the curve is (2, 24). The following python code checks all alternatives until faced with an exception. BTW, all subsidiary functions can be found on GitHub.

print("P: (", x0,", ",y0,")") new_x, new_y = pointAddition(x0, y0, x0, y0, a, b, mod) print("2 P: (",new_x,", ",new_y,")") for i in range(3, 1000): try: new_x, new_y = pointAddition(new_x, new_y, x0, y0, a, b, mod) print(i,"P: (",new_x,", ",new_y,")") except: print("order of group: ",i) break

This code will produce the following results and returns exception while calculating 211P. This means that order of this elliptic curve group is 211 because 211P is infinite.

P: ( 2 , 24 )

2 P: ( 108 , 49 )

3 P: ( 72 , 166 )

4 P: ( 18 , 80 )

5 P: ( 42 , 35 )

…

206 P: ( 42 , 164 )

207 P: ( 18 , 119 )

208 P: ( 72 , 33 )

209 P: ( 108 , 150 )

210 P: ( 2 , 175 )

Notice that x coordinates are equal for point P-210P, 2P – 209P, 3P – 208P, …

This way is easy to understand but it is really hard to run. Because complexity of this operation is * O(n)* in big O notation whereas n is the public modulo and it should be very large integer. For example, bitcoin protocol works on 256-bit integer as calculated below.

#modulo for bitcoin mod = pow(2, 256) - pow(2, 32) - pow(2, 9) - pow(2, 8) - pow(2, 7) - pow(2, 6) - pow(2, 4) - pow(2, 0) print("modulo: ", mod)

The modulo of bitcoin protocol is equal the following value. You can imagine how hard to check all points whether it is infinite or not for the following one.

modulo: 115792089237316195423570985008687907853269984665640564039457584007908834671663

Suppose that elliptic curve satisfies the equation y^{2} = x^{3} + ax + b mod p. In other words, order of the elliptic curve group over GF(p) must be bounded by the following equality. BTW, √p comes from the probability theory.

p + 1 – 2 * √p ≤ order ≤ p + 1 + 2 * √p

Let’s subtract the boundaries. This reveals the complexity.

p + 1 + 2 * √p – (p + 1 – 2 * √p) = p + 1 + 2 * √p – p – 1 + 2 * √p = 4 * √p = 4 * p^{1/2}

Big O notation says that complexity of (4 * √p) is still equal to * O (√n)*. It is still too many operations to run.

This is not the fastest way but it is my favorite method to find the order of elliptic curve group. Complexity of this approach is * O(^{4}√n)* or

from math import sqrt Q = applyDoubleAndAddMethod(x0, y0, mod + 1, a, b, mod) print("(mod + 1)P = ", mod + 1,"P = ",Q) m = int(sqrt(sqrt(mod))) + 1 print("1 + (mod^1/4) = 1 + (",mod,")^1/4 = ",m) print() terminate = False for j in range (1, m+1): jP = applyDoubleAndAddMethod(x0, y0, j, a, b, mod) print(j,"P = ",jP, " -> ", end="") for k in range (-m, m+1): checkpoint = applyDoubleAndAddMethod(x0, y0, m*2*k, a, b, mod) checkpoint = pointAddition(checkpoint[0], checkpoint[1], Q[0], Q[1], a, b, mod) print(checkpoint," ", end="") if checkpoint[0] == jP[0]: #check x-corrdinates of checkpoint and jP orderOfGroup = mod + 1 + m*2*k print("\norder of group should be ", orderOfGroup ," ± ", j) try: applyDoubleAndAddMethod(x0, y0, orderOfGroup + j, a, b, mod) except: orderOfGroup = orderOfGroup + j terminate = True break try: applyDoubleAndAddMethod(x0, y0, orderOfGroup - j, a, b, mod) except: orderOfGroup = orderOfGroup - j terminate = True break print() if terminate == True: break print("order of group: ", orderOfGroup)

Notice that 3P (72, 166) and 208P (72, 33) are negative points of each other because their x-coordinates are same. Now we need to calculate both (208+3)P and (208-3)P. Which one throws exception, then it would be order of group!

In this way, 12 calculations are enough to find the order of an elliptic curve over GF(199) group as shown below. In contrast, brute force method requires 211 calculation to do same duty. This approach is 17 times faster than the brute force on GF(199).

Of course, there is always better way to do it! Order of group calculation can be handled in a less complex way with Schoof Method. Its compexity is **O(log ^{8}**

So, we’ve mentioned how to calculate order of an elliptic curve group. Order of group should be calculated once and announced publicly. There are a lot of common elliptic curves and their both mod and order of group are publicly available. Mostly, you do not have to calculate the order of group. However, we researchers are suspicious ones and feel safe when know the background!

If the topic draws your attention, you can enroll the Elliptic Curve Cryptography Masterclass online course to dig deeper.

The post Counting Points on Elliptic Curves over Finite Field: Order of Elliptic Curve Group appeared first on Sefik Ilkin Serengil.

]]>