Regression with model loading on Weka file system and deepspeed zero 3 configuration in transformers

Wed Jun 18 20:41:01 UTC 2025

Hi everyone,

We are experiencing really bad model loading performance with our Weka
parallel file system after we updated to ubuntu 22.04 with kernel 6.8.

When we are loading models using deepspeed zero 3 optimization through the
transformers python library it is taking from about 30 to 1 hour to load
from weka when it is tiered into ssd storage.

It has been thoroughly tested, and it appears that if we downgrade the
kernel to 5.15, these issues go away entirely.

We are using 6.8 because our infrastructure is on GCP and we have the
a3-highgpu-8g instance types which require us to use TCPX patched kernels.

I'm not sure how to proceed. I have some perf data that I sent to Weka.

I was also wondering if it would be possible to get the 5.15 kernel patches
that were for 20.04 applied to the 5.15 kernel in 22.04 for TCPX?

If we could downgrade the kernel from 6.8, to 5.15, with TCPX working that
would make everything a lot better in our regards.

-RC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20250618/f12efec7/attachment.html>