How Modin Can Keep Data Scientists From Pandas

Data scientists tend to use pandas for data transformation because it is pretty. I witnessed many times that feature engineering steps on terabytes of data are handled in Pandas. Handling this kind of task can be completed in minutes with Spark because it can consume thousands of CPU cores. On the other hand, Pandas can be run on a single CPU core. Same task might be done in hours and nights with Pandas. However, developers are conservative ones and they hesitate to switch the technology they are used to. Herein, UC Berkeley researchers announced Modin. It offers same functionalities of Pandas while it is running on multiple cores. Does this come with speeding performance up? I will share my experiments on Modin and its comparison to Pandas.

Kung Fu Panda 3 - 2016
Kung Fu Panda (2008)

You must store all the data on your processing units memory. Herein, Modin runs just on CPU cores. This might be seen negatively but it is actually not. A GPU can have 32 GB memory whereas you can have terabytes of memory on your system. You can write your own kernel to move the data from CPU’s memory to GPU but this really slows the system down radically.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

How to use Modin

You just need to replace importing procedure of Pandas to Modin. This is just one line code replacement. Thereafter, same pandas functions will be worked.

!python -m pip install --user modin
#import pandas as pd
import modin.pandas as pd

I called Modin on a Docker image having 56 cpu cores. This causes some troubles. Importing it as illustrated above returns the following exception.

Could not connect to socket /tmp/ray/session_2019-04-23_18-41-58_28079/sockets/plasma_store

I can solve this issue by passing the plasma directory in initialization.

import ray
ray.init(plasma_directory="/workspaces/sefik/temp")

import modin.pandas as modin

This dumps the following warning but it allows me to import the library in this way.

WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-23_18-44-30_31971/logs.
Waiting for redis server at 127.0.0.1:23077 to respond…
Waiting for redis server at 127.0.0.1:12847 to respond…
Starting Redis shard with 10.0 GB max memory.
Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` when calling ray.init() or ray start.
WARNING: object_store_memory is not verified when plasma_directory is set.
Starting the Plasma object store with 20.0 GB memory using /workspaces/sefik/temp.

======================================================================
View the web UI at http://localhost:8888/notebooks/ray_ui.ipynb?token=4be6611c5f86475ae78feafc347f256b20a881aa885364ee
======================================================================

WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
Calling ray.init() again after it has already been called.





Data set

I run my experiments on Santander Customer Transaction Prediction data set on Kaggle. My choice is mainly based on the data set size. It consists of 200K rows and 202 columns. It is also in size of 302 MB.

Reading data set

I imported both Modin and Pandas simultaneous to compare their performance.

import modin.pandas as modin
import pandas as pd

Thereafter, call read_csv function

import time
#----------------------------
tic = time.time()
modin_df = modin.read_csv("/workspaces/96273/train.csv")
toc = time.time()
modin_time = toc-tic
print("Lasts ",modin_time," seconds in Modin")
#----------------------------
tic = time.time()
pandas_df = pd.read_csv("/workspaces/96273/train.csv")
toc = time.time()
pandas_time = toc - tic
print("Lasts ",pandas_time," seconds in Pandas")
#----------------------------
if pandas_time > modin_time:
    print("Modin is ",pandas_time / modin_time," times faster than Pandas")
else:
    print("Pandas is ",modin_time / pandas_time," times faster than Modin")

In this case, Modin overperforms as claimed. It runs almost 8 times faster than Pandas while reading the data set.

Lasts 0.9373431205749512 seconds in Modin
Lasts 7.191021203994751 seconds in Pandas
Modin is 7.67170638600718 times faster than Pandas

Functions

The top 20 Pandas functions in Kaggle challenges is demonstrated here. I applied these functions for both Pandas and Modin. The following table shows performing time in seconds.

Function Modin Pandas Faster One Faster
read_csv 0.937343 7.191021 Modin 7.671706
to_csv 49.20241 47.45609 Pandas 1.036799
std 0.259741 0.002418 Pandas 107.4284
max 0.257777 0.001678 Pandas 153.6005
min 0.257878 0.001641 Pandas 157.1206
merge 1.198666 0.295746 Pandas 4.053021
mean 0.264024 0.001823 Pandas 144.7959
sum 0.258359 0.001425 Pandas 181.3313
head 0.262655 0.000222 Pandas 1180.768
groupby 0.003455 0.000634 Pandas 5.450169
loc 7.693181 0.145900 Pandas 52.72913

Modin disappoints me a lot. It underperforms all functions I applied except read csv. Herein, some functions such as std requires vertical calculation and parallel processing technologies don’t like this kind of calculations. However, other functions such as min, max, mean should be performed well on parallel processing. Because they all depend on map reduce.

Similarly, loc function can be performed by map reduce technology. I find negative one on var_1 column and replace them with 0. I expect that loc would overperform on Modin but it wouldn’t.

tic = time.time()
ixs = modin_df[modin_df['var_1'] < 0].index
modin_df.loc[ixs, 'var_1'] = 0
toc = time.time()
modin_time = toc-tic
print("Lasts ",modin_time," seconds in Modin")
#----------------------------
tic = time.time()
ixs = pandas_df[pandas_df['var_1'] < 0].index
pandas_df.loc[ixs, 'var_1'] = 0
toc = time.time()
pandas_time = toc - tic
print("Lasts ",pandas_time," seconds in Pandas")
#----------------------------
if pandas_time > modin_time:
    print("Modin is ",pandas_time / modin_time," times faster than Pandas")
elif modin_time > pandas_time:
    print("Pandas is ",modin_time / pandas_time," times faster than Modin")

The library should be tested for a larger data set.





Testing on a larger data set

Let’s test Modin on a larger data set. Here, I’ll create a random data set.

import numpy as np
data = np.random.randint(0,100,size = (2**22, 2**8))
print(data.shape)

This will generate a data set consisting of 4.194.304 rows and 256 columns. It is almost 20 times bigger than the Santander data set.

tic = time.time()
modin_df = modin.DataFrame(data)
modin_df = modin_df.add_prefix("var_")
toc = time.time()
modin_time = toc-tic
print("Lasts ",modin_time," seconds in Modin")
#----------------------------
tic = time.time()
pandas_df = pd.DataFrame(data)
pandas_df = pandas_df.add_prefix("var_")
toc = time.time()
pandas_time = toc - tic
print("Lasts ",pandas_time," seconds in Pandas")
#----------------------------
if pandas_time > modin_time:
    print("Modin is ",pandas_time / modin_time," times faster than Pandas")
else:
    print("Pandas is ",modin_time / pandas_time," times faster than Modin")

Results didn’t surprise me this time. Pandas is still good at math calculations. Rates seem shorter this time. We can expect that Modin can reach Pandas for larger data sets.

Function Modin Pandas Faster One Faster
DataFrame 33.26098 30.84624 Pandas 1.078283
std 0.631248 0.117352 Pandas 5.379107
max 0.432755 0.053161 Pandas 8.140433
min 0.435226 0.053207 Pandas 8.179875
mean 0.446028 0.056934 Pandas 7.834107
sum 0.434081 0.053153 Pandas 8.166591
groupby 0.002592 0.000391 Pandas 6.633923

On the other hand, some functions such as merge, head fail in modin for this data set.

To Sum Up

It seems very early to adopt Modin instead of Pandas. Because it under-performs than regular Pandas for all functions except read_csv. Besides, it is not supported on Windows OS. However, its idea is very promising. I pushed the notebook including my experiments into GitHub.

Besides, some existing libraries support pandas on GPU such as cuDF. I plan to have experiments on this subject soon.


Like this blog? Support me on Patreon

Buy me a coffee


1 Comment

  1. This is an awesome post! I just tried modin and found similar problem.
    I checked the repo, I think in cases modin’s faster. It’s because modin is lazy-executed which means the action is not actually executed. For example if you run:
    modin = some_modin_df.read_csv()
    modin.to_csv()
    pandas = pandas_df.read_csv()
    pandas.to_csv()
    You will find the first one will be slower

Comments are closed.