Pandas has the power of a tiger. Its performance can still surprise me. However, it comes with a huge shortage. No matter how many CPU cores you have, pandas performs its operations on a single cpu core. It’s like you’re always stuck in second gear even if you have a porsche! Previously, I’ve mentioned its multi-core variation Modin, but it really disappoints me. It cannot get closer to regular Pandas. Herein, h2o offers h2o frame similar to pandas. h2o frame calls forth the power of ten tigers.
CPU Cores
When you perform a pandas operation, if you run top command in the terminal you will see one python3 operation consuming 100% cpu in the best case. It cannot reach 100% because of its limitations.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
However, if you perform a h2o frame operation, it would get a java process up. I could see 8000% cpu consumption when top command performed.
You can also run the htop command to see core consumption graphically. Your pandas operation can allocate one cpu core whereas h2o performs multi-cores.
Functions
h2o frame’s interface mostly covers similar functions to pandas but it is not identical. For example, std() stands for standard deviation in pandas whereas sd() refers to it in h2o. Even though, replacing pandas functions to h2o has a cost, it definetely worths.
Initialization
We will import h2o and initialize it. Mostly, init() function handles initialization but sometimes you need to limit the memory and threads.
import h2o #h2o.init() h2o.init(ip="127.0.0.1", max_mem_size_GB = 200, nthreads = 20)
Tasks
We will handle the following tasks.
1- Read the Santander customer satisfaction data set. The original data set has 200k rows and 202 columns.
2- Increase the data set size. Append data set to itself 3 times. Data set will have 1.6M rows.
3- var_1 column has both positive and negative values. Update negative ones to zero.
4- Create a new column which stores var_1 column’s mean value plus standard deviation. Notice that these kind of values are indexed in the data frame.
5- Print the first 10 rows.
h2o implementation
import time tic = time.time() hf = h2o.import_file('dataset/train.csv') print("raw data set size: ",hf.shape) for i in range(0, 3): hf = hf.rbind(hf) print("current data set size: ",hf.shape) hf[hf["var_1"] < 0, "var_1"] = 0 hf['var_1_sigma'] = hf['var_1'].mean()[0] + hf['var_1'].sd()[0] print(hf.head()) toc = time.time() print(toc-tic," seconds")
pandas implementation
tic = time.time() df = pd.read_csv("dataset/train.csv") print("raw data set size: ",df.shape) for i in range(0, 3): df = pd.concat([df, df]) print("current data set size: ",df.shape) df.loc[df[df['var_1'] < 0].index, 'var_1'] = 0 df['var_1_sigma'] = df['var_1'].mean() + df['var_1'].std() print(df.head(10)) toc = time.time() print(toc-tic," seconds")
h2o completed in 14.14 seconds whereas pandas completed in 25.21 seconds. It seems that h2o is almost 2 times faster than pandas.
As seen, when you load the data set in h2o format, you can do all your work with h2o functions. Even though you can convert h2o frame to pandas and vice versa, you mostly won’t need to do.
So, h2o surprises me again similar to its AutoML solution. Today, pandas was a de-facto standard among data scientist. Even though, it is performed on a single cpu core, no one can pass its performance until h2o appears. I strongly recommend h2o to data people who are going to work on a large scale machine learning problems. My experiments on h2o will be continued…
Support this blog if you do like!