Data scientists tend to use pandas for data transformation because it is pretty. I witnessed many times that feature engineering steps on terabytes of data are handled in Pandas. Handling this kind of task can be completed in minutes with Spark because it can consume thousands of CPU cores. On the other hand, Pandas can be run on a single CPU core. Same task might be done in hours and nights with Pandas. However, developers are conservative ones and they hesitate to switch the technology they are used to. Herein, UC Berkeley researchers announced Modin. It offers same functionalities of Pandas while it is running on multiple cores. Does this come with speeding performance up? I will share my experiments on Modin and its comparison to Pandas.
You must store all the data on your processing units memory. Herein, Modin runs just on CPU cores. This might be seen negatively but it is actually not. A GPU can have 32 GB memory whereas you can have terabytes of memory on your system. You can write your own kernel to move the data from CPU’s memory to GPU but this really slows the system down radically.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
How to use Modin
You just need to replace importing procedure of Pandas to Modin. This is just one line code replacement. Thereafter, same pandas functions will be worked.
!python -m pip install --user modin #import pandas as pd import modin.pandas as pd
I called Modin on a Docker image having 56 cpu cores. This causes some troubles. Importing it as illustrated above returns the following exception.
Could not connect to socket /tmp/ray/session_2019-04-23_18-41-58_28079/sockets/plasma_store
I can solve this issue by passing the plasma directory in initialization.
import ray ray.init(plasma_directory="/workspaces/sefik/temp") import modin.pandas as modin
This dumps the following warning but it allows me to import the library in this way.
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-23_18-44-30_31971/logs.
Waiting for redis server at 127.0.0.1:23077 to respond…
Waiting for redis server at 127.0.0.1:12847 to respond…
Starting Redis shard with 10.0 GB max memory.
Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` when calling ray.init() or ray start.
WARNING: object_store_memory is not verified when plasma_directory is set.
Starting the Plasma object store with 20.0 GB memory using /workspaces/sefik/temp.======================================================================
View the web UI at http://localhost:8888/notebooks/ray_ui.ipynb?token=4be6611c5f86475ae78feafc347f256b20a881aa885364ee
======================================================================WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
Calling ray.init() again after it has already been called.
Data set
I run my experiments on Santander Customer Transaction Prediction data set on Kaggle. My choice is mainly based on the data set size. It consists of 200K rows and 202 columns. It is also in size of 302 MB.
Reading data set
I imported both Modin and Pandas simultaneous to compare their performance.
import modin.pandas as modin
import pandas as pd
Thereafter, call read_csv function
import time #---------------------------- tic = time.time() modin_df = modin.read_csv("/workspaces/96273/train.csv") toc = time.time() modin_time = toc-tic print("Lasts ",modin_time," seconds in Modin") #---------------------------- tic = time.time() pandas_df = pd.read_csv("/workspaces/96273/train.csv") toc = time.time() pandas_time = toc - tic print("Lasts ",pandas_time," seconds in Pandas") #---------------------------- if pandas_time > modin_time: print("Modin is ",pandas_time / modin_time," times faster than Pandas") else: print("Pandas is ",modin_time / pandas_time," times faster than Modin")
In this case, Modin overperforms as claimed. It runs almost 8 times faster than Pandas while reading the data set.
Lasts 0.9373431205749512 seconds in Modin
Lasts 7.191021203994751 seconds in Pandas
Modin is 7.67170638600718 times faster than Pandas
Functions
The top 20 Pandas functions in Kaggle challenges is demonstrated here. I applied these functions for both Pandas and Modin. The following table shows performing time in seconds.
Function | Modin | Pandas | Faster One | Faster |
read_csv | 0.937343 | 7.191021 | Modin | 7.671706 |
to_csv | 49.20241 | 47.45609 | Pandas | 1.036799 |
std | 0.259741 | 0.002418 | Pandas | 107.4284 |
max | 0.257777 | 0.001678 | Pandas | 153.6005 |
min | 0.257878 | 0.001641 | Pandas | 157.1206 |
merge | 1.198666 | 0.295746 | Pandas | 4.053021 |
mean | 0.264024 | 0.001823 | Pandas | 144.7959 |
sum | 0.258359 | 0.001425 | Pandas | 181.3313 |
head | 0.262655 | 0.000222 | Pandas | 1180.768 |
groupby | 0.003455 | 0.000634 | Pandas | 5.450169 |
loc | 7.693181 | 0.145900 | Pandas | 52.72913 |
Modin disappoints me a lot. It underperforms all functions I applied except read csv. Herein, some functions such as std requires vertical calculation and parallel processing technologies don’t like this kind of calculations. However, other functions such as min, max, mean should be performed well on parallel processing. Because they all depend on map reduce.
Similarly, loc function can be performed by map reduce technology. I find negative one on var_1 column and replace them with 0. I expect that loc would overperform on Modin but it wouldn’t.
tic = time.time() ixs = modin_df[modin_df['var_1'] < 0].index modin_df.loc[ixs, 'var_1'] = 0 toc = time.time() modin_time = toc-tic print("Lasts ",modin_time," seconds in Modin") #---------------------------- tic = time.time() ixs = pandas_df[pandas_df['var_1'] < 0].index pandas_df.loc[ixs, 'var_1'] = 0 toc = time.time() pandas_time = toc - tic print("Lasts ",pandas_time," seconds in Pandas") #---------------------------- if pandas_time > modin_time: print("Modin is ",pandas_time / modin_time," times faster than Pandas") elif modin_time > pandas_time: print("Pandas is ",modin_time / pandas_time," times faster than Modin")
The library should be tested for a larger data set.
Testing on a larger data set
Let’s test Modin on a larger data set. Here, I’ll create a random data set.
import numpy as np data = np.random.randint(0,100,size = (2**22, 2**8)) print(data.shape)
This will generate a data set consisting of 4.194.304 rows and 256 columns. It is almost 20 times bigger than the Santander data set.
tic = time.time() modin_df = modin.DataFrame(data) modin_df = modin_df.add_prefix("var_") toc = time.time() modin_time = toc-tic print("Lasts ",modin_time," seconds in Modin") #---------------------------- tic = time.time() pandas_df = pd.DataFrame(data) pandas_df = pandas_df.add_prefix("var_") toc = time.time() pandas_time = toc - tic print("Lasts ",pandas_time," seconds in Pandas") #---------------------------- if pandas_time > modin_time: print("Modin is ",pandas_time / modin_time," times faster than Pandas") else: print("Pandas is ",modin_time / pandas_time," times faster than Modin")
Results didn’t surprise me this time. Pandas is still good at math calculations. Rates seem shorter this time. We can expect that Modin can reach Pandas for larger data sets.
Function | Modin | Pandas | Faster One | Faster |
DataFrame | 33.26098 | 30.84624 | Pandas | 1.078283 |
std | 0.631248 | 0.117352 | Pandas | 5.379107 |
max | 0.432755 | 0.053161 | Pandas | 8.140433 |
min | 0.435226 | 0.053207 | Pandas | 8.179875 |
mean | 0.446028 | 0.056934 | Pandas | 7.834107 |
sum | 0.434081 | 0.053153 | Pandas | 8.166591 |
groupby | 0.002592 | 0.000391 | Pandas | 6.633923 |
On the other hand, some functions such as merge, head fail in modin for this data set.
To Sum Up
It seems very early to adopt Modin instead of Pandas. Because it under-performs than regular Pandas for all functions except read_csv. Besides, it is not supported on Windows OS. However, its idea is very promising. I pushed the notebook including my experiments into GitHub.
Besides, some existing libraries support pandas on GPU such as cuDF. I plan to have experiments on this subject soon.
Support this blog if you do like!
This is an awesome post! I just tried modin and found similar problem.
I checked the repo, I think in cases modin’s faster. It’s because modin is lazy-executed which means the action is not actually executed. For example if you run:
modin = some_modin_df.read_csv()
modin.to_csv()
pandas = pandas_df.read_csv()
pandas.to_csv()
You will find the first one will be slower