Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray 2.10.0 is broken, ModuleNotFoundError: No module named 'modin' #44264

Closed
YarShev opened this issue Mar 25, 2024 · 14 comments
Closed

Ray 2.10.0 is broken, ModuleNotFoundError: No module named 'modin' #44264

YarShev opened this issue Mar 25, 2024 · 14 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core needs-repro-script Issue needs a runnable script to be reproduced P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared usability

Comments

@YarShev
Copy link

YarShev commented Mar 25, 2024

What happened + What you expected to happen

Hi guys, we started to see a failure in Modin tests with ray 2.10.0. An example is here. The error is ModuleNotFoundError: No module named 'modin'. The tests are also failing locally. It is worth noting the tests are failing with pytest -n 2 or pytest -n 1 but passing with default run pytest. Is anyone aware of what might be broken in 2.10.0?

Versions / Dependencies

cat /etc/os-release
Ubuntu 22.04.1 LTS

python --version
Python 3.9.19

conda list | grep ray
ray-core                  2.10.0           py39h53bc9df_0    conda-forge
ray-default               2.10.0           py39h7a4ae58_0    conda-forge

Reproduction script

git clone https://github.com/modin-project/modin
cd modin
mamba env create -f environment-dev.yml
conda activate modin
python -m pytest -n 2 modin/test/test_partition_api.py

Issue Severity

High: It blocks me from completing my task.

@YarShev YarShev added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 25, 2024
@dchigarev
Copy link

Reproduction script
git clone https://github.com/modin-project/modin
cd modin
mamba env create -f environment-dev.yml
conda activate modin
python -m pytest -n 2 modin/test/test_partition_api.py

running pip install . from the cloned modin directory fixes the problem

@YarShev
Copy link
Author

YarShev commented Mar 26, 2024

It would be great to know the reason why it worked before and what exactly has been changed in 2.10.0.

@hahamark1
Copy link

I am running into similar issues with ray 2.10.0

I have upgraded from 2.9.3 and am now getting the following error (which only happens on the worker.

Exception raised in creation task: The actor died because of an error raised in its creation task, �[36mray::SERVE_REPLICA::llm_app#xformers-hf-internal-testing-tiny-random-gpt2#dmsm0oos:ServeReplica:llm_app:xformers-hf-internal-testing-tiny-random-gpt2.__init__()�[39m (pid=457, ip=192.128.15.91, actor_id=7e50a57d8b03c22157ca5ff901000000, repr=<ray.serve._private.replica.ServeReplica:llm_app:xformers-hf-internal-testing-tiny-random-gpt2 object at 0x7f3d93b532b0>)
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 258, in __init__
    deployment_def = cloudpickle.loads(serialized_deployment_def)
ModuleNotFoundError: No module named 'kaiko.llm_serve'
[2024-03-26 15:43:01,478 E 457 457] logging.cc:104: Stack trace: 
 /home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe543a) [0x7f3ee7f0043a] ray::operator<<()
/home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe7b78) [0x7f3ee7f02b78] ray::TerminateHandler()
/home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xb135a) [0x7f3ee6da835a] __cxxabiv1::__terminate()
/home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7f3ee6da83c5]
/home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x7c9670) [0x7f3ee76e4670] std::thread::_State_impl<>::~_State_impl()
/home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x6285ba) [0x7f3ee75435ba] std::_Sp_counted_base<>::_M_release()
/home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x7b2772) [0x7f3ee76cd772] std::_Sp_counted_ptr_inplace<>::_M_dispose()
/home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x6285ba) [0x7f3ee75435ba] std::_Sp_counted_base<>::_M_release()
/home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x6e6d92) [0x7f3ee7601d92] std::default_delete<>::operator()()
/home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorkerD1Ev+0xf7) [0x7f3ee76734e7] ray::core::CoreWorker::~CoreWorker()
/home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x6285ba) [0x7f3ee75435ba] std::_Sp_counted_base<>::_M_release()
/home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv+0x134) [0x7f3ee76b2484] ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop()
/home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core17CoreWorkerProcess20RunTaskExecutionLoopEv+0x1d) [0x7f3ee76b258d] ray::core::CoreWorkerProcess::RunTaskExecutionLoop()
/home/ray/anaconda3/lib/python3.10/site-packages/ray/_raylet.so(+0x5a38e7) [0x7f3ee74be8e7] __pyx_pw_3ray_7_raylet_10CoreWorker_7run_task_loop()
ray::ServeReplica:llm_app:xformers-hf-internal-testing-tiny-random-gpt2() [0x4ff2f4] method_vectorcall_NOARGS
ray::ServeReplica:llm_app:xformers-hf-internal-testing-tiny-random-gpt2(_PyEval_EvalFrameDefault+0x731) [0x4ed6d1] _PyEval_EvalFrameDefault
ray::ServeReplica:llm_app:xformers-hf-internal-testing-tiny-random-gpt2(_PyFunction_Vectorcall+0x6f) [0x4fcadf] _PyFunction_Vectorcall
ray::ServeReplica:llm_app:xformers-hf-internal-testing-tiny-random-gpt2(_PyEval_EvalFrameDefault+0x731) [0x4ed6d1] _PyEval_EvalFrameDefault
ray::ServeReplica:llm_app:xformers-hf-internal-testing-tiny-random-gpt2() [0x591d92] _PyEval_Vector
ray::ServeReplica:llm_app:xformers-hf-internal-testing-tiny-random-gpt2(PyEval_EvalCode+0x87) [0x591cd7] PyEval_EvalCode
ray::ServeReplica:llm_app:xformers-hf-internal-testing-tiny-random-gpt2() [0x5c2967] run_eval_code_obj
ray::ServeReplica:llm_app:xformers-hf-internal-testing-tiny-random-gpt2() [0x5bdad0] run_mod
ray::ServeReplica:llm_app:xformers-hf-internal-testing-tiny-random-gpt2() [0x45956b] pyrun_file.cold
ray::ServeReplica:llm_app:xformers-hf-internal-testing-tiny-random-gpt2(_PyRun_SimpleFileObject+0x19f) [0x5b805f] _PyRun_SimpleFileObject
ray::ServeReplica:llm_app:xformers-hf-internal-testing-tiny-random-gpt2(_PyRun_AnyFileObject+0x43) [0x5b7dc3] _PyRun_AnyFileObject
ray::ServeReplica:llm_app:xformers-hf-internal-testing-tiny-random-gpt2(Py_RunMain+0x38d) [0x5b4b7d] Py_RunMain
ray::ServeReplica:llm_app:xformers-hf-internal-testing-tiny-random-gpt2(Py_BytesMain+0x39) [0x584e49] Py_BytesMain
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f3ee8bb5083] __libc_start_main
ray::ServeReplica:llm_app:xformers-hf-internal-testing-tiny-random-gpt2() [0x584cfe]

*** SIGABRT received at time=1711464181 on cpu 3 ***
PC: @     0x7f3ee8bd400b  (unknown)  raise
    @     0x7f3ee8ef1420  (unknown)  (unknown)
    @     0x7f3ee6da835a         80  __cxxabiv1::__terminate()
    @     0x7f3ee75435ba         32  std::_Sp_counted_base<>::_M_release()
    @     0x7f3ee76cd772         96  std::_Sp_counted_ptr_inplace<>::_M_dispose()
    @     0x7f3ee75435ba         32  std::_Sp_counted_base<>::_M_release()
    @     0x7f3ee7601d92        144  std::default_delete<>::operator()()
    @     0x7f3ee76734e7        128  ray::core::CoreWorker::~CoreWorker()
    @     0x7f3ee75435ba         32  std::_Sp_counted_base<>::_M_release()
    @     0x7f3ee76b2484        112  ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop()
    @     0x7f3ee76b258d         32  ray::core::CoreWorkerProcess::RunTaskExecutionLoop()
    @     0x7f3ee74be8e7         32  __pyx_pw_3ray_7_raylet_10CoreWorker_7run_task_loop()
    @           0x4ff2f4  (unknown)  method_vectorcall_NOARGS
    @ ... and at least 1 more frames
[2024-03-26 15:43:01,555 E 457 457] logging.cc:361: *** SIGABRT received at time=1711464181 on cpu 3 ***
[2024-03-26 15:43:01,555 E 457 457] logging.cc:361: PC: @     0x7f3ee8bd400b  (unknown)  raise
[2024-03-26 15:43:01,555 E 457 457] logging.cc:361:     @     0x7f3ee8ef1420  (unknown)  (unknown)
[2024-03-26 15:43:01,555 E 457 457] logging.cc:361:     @     0x7f3ee6da835a         80  __cxxabiv1::__terminate()
[2024-03-26 15:43:01,555 E 457 457] logging.cc:361:     @     0x7f3ee75435ba         32  std::_Sp_counted_base<>::_M_release()
[2024-03-26 15:43:01,555 E 457 457] logging.cc:361:     @     0x7f3ee76cd772         96  std::_Sp_counted_ptr_inplace<>::_M_dispose()
[2024-03-26 15:43:01,555 E 457 457] logging.cc:361:     @     0x7f3ee75435ba         32  std::_Sp_counted_base<>::_M_release()
[2024-03-26 15:43:01,555 E 457 457] logging.cc:361:     @     0x7f3ee7601d92        144  std::default_delete<>::operator()()
[2024-03-26 15:43:01,555 E 457 457] logging.cc:361:     @     0x7f3ee76734e7        128  ray::core::CoreWorker::~CoreWorker()
[2024-03-26 15:43:01,555 E 457 457] logging.cc:361:     @     0x7f3ee75435ba         32  std::_Sp_counted_base<>::_M_release()
[2024-03-26 15:43:01,555 E 457 457] logging.cc:361:     @     0x7f3ee76b2484        112  ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop()
[2024-03-26 15:43:01,555 E 457 457] logging.cc:361:     @     0x7f3ee76b258d         32  ray::core::CoreWorkerProcess::RunTaskExecutionLoop()
[2024-03-26 15:43:01,555 E 457 457] logging.cc:361:     @     0x7f3ee74be8e7         32  __pyx_pw_3ray_7_raylet_10CoreWorker_7run_task_loop()
[2024-03-26 15:43:01,555 E 457 457] logging.cc:361:     @           0x4ff2f4  (unknown)  method_vectorcall_NOARGS
[2024-03-26 15:43:01,555 E 457 457] logging.cc:361:     @ ... and at least 1 more frames
Fatal Python error: Aborted

Stack (most recent call first):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 879 in main_loop
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/workers/default_worker.py", line 282 in <module>

Extension modules: msgpack._cmsgpack, google.protobuf.pyext._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, charset_normalizer.md, uvloop.loop, ray._raylet, pvectorc, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, grpc._cython.cygrpc, pyarrow._json (total: 93)

This library (kaiko.llm_serve) is also used in the head node, and they share the same custom image.

@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label Mar 26, 2024
@hongchaodeng
Copy link
Member

hongchaodeng commented Mar 29, 2024

@YarShev @hahamark1 Can you provide a simpler reproduction script using only Ray?

@jjyao jjyao added the needs-repro-script Issue needs a runnable script to be reproduced label Apr 1, 2024
@jjyao
Copy link
Collaborator

jjyao commented Apr 1, 2024

@hahamark1 are you using working_dir?

@jjyao jjyao added P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 1, 2024
@hahamark1
Copy link

@hahamark1 are you using working_dir?

No we are not. In the current setup, the custom Dockerfile has the WORKING_DIR specified.

@hahamark1
Copy link

The problem I face is better described in #44329

@anyscalesam
Copy link
Collaborator

@jjyao @rynewang per #44329 I think the fix has been merged to master and will release with ray 2.11; is this good to close?

@jjyao
Copy link
Collaborator

jjyao commented Apr 8, 2024

Yes, it's fixed and will be included in 2.11

@hahamark1
Copy link

Yes, it's fixed and will be included in 2.11

Is there any clarity on when 2.11 will be released? I saw #44276 but I think this hasnt been started?

@jjyao
Copy link
Collaborator

jjyao commented Apr 8, 2024

It will be some time this week.

@dizhouwu
Copy link

@jjyao hi thanks for the update - any chance there is any change to the release plan given it's Fri? thank you again

@jjyao
Copy link
Collaborator

jjyao commented Apr 12, 2024

Hi @dizhouwu, the release is delayed to next week due to some last minutes issue. We will get it out ASAP.

@dizhouwu
Copy link

Hi @dizhouwu, the release is delayed to next week due to some last minutes issue. We will get it out ASAP.

Got it, thanks for the quick reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core needs-repro-script Issue needs a runnable script to be reproduced P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared usability
Projects
None yet
Development

No branches or pull requests

7 participants