Large Scale Machine Learning with Pandas

We often store the training data set in memory and transfer it to learning algorithm. Mostly, numpy handles data manipulation things. But this is the happy path. Sometimes we need to work on massive data sets. In this case, we cannot store the all train set in the memory. Managing big data sets would be trouble. Herein, python pandas can load massive file as small chunks and enable to work with massive data sources in machine learning studies.

You might remember the Iris flower data set. There are 150 instances of length and width measurement for top and bottom leaf and corresponding class in the data set. Corresponding class can be 3 different iris flower types: setosa, versicolor and virginica. So, there are 4 input features and 3 output labels. Let’s create a hidden layer consisting of 4 nodes in the neural networks. I mostly decide this number as 2/3 times of sum of features and labels. Multi class classification requires to use cross-entropy as loss function. Also, I want to apply Adam optimization algorithm to converge faster.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

import keras
from keras.models import Sequential
from keras.layers import Dense, Activation
def createNetwork():
 model = Sequential()

 model.add(Dense(4 #num of hidden units
 	, input_shape=(4,))) #num of features in input layer
 model.add(Activation('sigmoid')) #activation function from input layer to 1st hidden layer

 model.add(Dense(num_classes)) #num of classes in output layer
 model.add(Activation('sigmoid')) #activation function from 1st hidden layer to output layer

 return model

model = createNetwork()
model.compile(loss='categorical_crossentropy'
 , optimizer=keras.optimizers.Adam(lr=0.007)
 , metrics=['accuracy']

Even though data set is small enough, we will load sub data sets instead of loading all. In this way, we will save on the memory. On the other hand, this will increase the I/O usage but this is reasonable because we cannot store massive data sets on memory.

Chunk size parameter is set to 30. Thus, we will read 30 lines of data set for each iteration. Moreover, column information is missing in the data set. That’s why, we need to define column names. Otherwise, pandas thinks the first row as column names and we will lose that line’s information.

import pandas as pd
import numpy as np
chunk_size = 30

def processDataset():
 for chunk in pd.read_csv("iris.data", chunksize=chunk_size
  , names = ["sepal_length","sepal_width","petal_length","petal_width","class"]):

  current_set = chunk.values #convert df to numpy array

Chunk parameter is type of pandas data frame. We can still convert it to numpy array by getting its values. This is important because fit operation will expect features and labels as numpy.

A line of the data set consisting of 4 measurement of a flower, and corresponding class respectively. I can seperate features and label by specifying index values.

features = current_set[:,0:4]
labels = current_set[:,4]

Labels are in single column and type of string. I will apply one-hot-encoding to feed network.

for i in range(0,labels.shape[0]):
 if labels[i] == 'Iris-setosa':
  labels[i] = 0
 elif labels[i] == 'Iris-versicolor':
  labels[i] = 1
 elif labels[i] == 'Iris-virginica':
  labels[i] = 2

 labels = keras.utils.to_categorical(labels, num_classes)

Features and labels are ready. We can feed to neural networks. Learning time or epochs must be set to 1 here. This is important. I will handle epochs in a for loop at the top.

model.fit(features, labels, epochs=1, verbose=0) #epochs handled in the for loop above

We will done processing all train set when processDataset() operation is over. Remeber back-propagation and gradient descent algorithm. We need to apply this processing over and over.

epochs = 1000
for epoch in range(0, epochs): #epoch should be handled here, not in fit command!
 processDataset()

If you set verbose to 1, then you will face will loss values for current sub data set. You should ignore the loss during training because it does not represent global loss for train set.

So, we’ve adapted pandas to read massive data set as small chunks and feed neural networks learning. It comes with pros and cons. The main advantage is that we can handle massive data set and save on the memory. The disadvantage is that it increases I/O usage. However, the focus of this post is working on massive data sets, it is neither big data nor streaming data. I’ve pushed the source code of this post into GitHub.

Like this blog? Support me on Patreon

Related