Running multiple jobs, one per TPU core? #16629

rog77 · 2023-07-05T13:57:08Z

rog77
Jul 5, 2023

So I read that TPUs have multiple cores, it would be mighty handy for a project I am looking at to be able to dispatch jobs that might run for different lengths of time on separate TPU cores with different compiled code, that return their solutions when they are ready back to python. So far I have been Vmaping and pmaping the same algorithm and now have a need to switch it up.

Is there any documentation on how to do this please? I would be using colab/kaggle initially, then moving onto Google Cloud single TPU(s), with a view to eventually needing a pod slice or two. Fine-grained control of how I can assign jobs to chips/cores is something I would need to address.

Any advice will be gratefully received, thank you!

Answered by ayaka14732

Jul 12, 2023

Solution: https://twitter.com/ayaka14732/status/1589274652354162690

I am also creating a library about this: https://github.com/ayaka14732/llama-jax/blob/main/lib/proc_init_utils/initialisation.py

Usage:

1.py

from lib.proc_init_utils import initialise_tpu; initialise_tpu('v4-16', n_devices=1, rank=0)

2.py

from lib.proc_init_utils import initialise_tpu; initialise_tpu('v4-16', n_devices=1, rank=1)

3.py

from lib.proc_init_utils import initialise_tpu; initialise_tpu('v4-16', n_devices=1, rank=2)

4.py

from lib.proc_init_utils import initialise_tpu; initialise_tpu('v4-16', n_devices=1, rank=3)

View full answer

rog77 · 2023-07-10T22:47:33Z

rog77
Jul 10, 2023
Author

@jakevdp

I'd really appreciate your thoughts on this.

Is it possible to run separate jobs on different TPU cores? Could you please point me at any relevant resources? As I say, presently I am using colab/kaggle for testing.

Right now I have a Pmap'd Vmap of the same problem and it works very well, I would very much like to be able to run two separate jobs that are unrelated to each other on the same chip.

Thanks in advance for any response :-)

0 replies

ayaka14732 · 2023-07-12T17:38:40Z

ayaka14732
Jul 12, 2023
Collaborator

Solution: https://twitter.com/ayaka14732/status/1589274652354162690

I am also creating a library about this: https://github.com/ayaka14732/llama-jax/blob/main/lib/proc_init_utils/initialisation.py

Usage:

1.py

from lib.proc_init_utils import initialise_tpu; initialise_tpu('v4-16', n_devices=1, rank=0)

2.py

from lib.proc_init_utils import initialise_tpu; initialise_tpu('v4-16', n_devices=1, rank=1)

3.py

from lib.proc_init_utils import initialise_tpu; initialise_tpu('v4-16', n_devices=1, rank=2)

4.py

from lib.proc_init_utils import initialise_tpu; initialise_tpu('v4-16', n_devices=1, rank=3)

2 replies

rog77 Jul 12, 2023
Author

Thanks for this, do you suppose it would work with python multithreading, or does it need a separate process per chip/core? Ether way, this is very helpful much appreciated!

ayaka14732 Jul 13, 2023
Collaborator

I use separate process per chip/core. I don't know if it works with python multithreading.

greg-rezo · 2024-09-16T20:30:13Z

greg-rezo
Sep 16, 2024

Sorry to revive an old thread. Is there a way to do something like this to run a process per core instead of a process per chip? I'm thinking of the v5p architecture which has 2 cores per chip. Wondering how to support full use of this architecture in jax.

1 reply

hawkinsp Sep 16, 2024
Maintainer

Unfortunately this cannot be done. A single TPU chip cannot currently be shared between mulitple processes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running multiple jobs, one per TPU core? #16629

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Running multiple jobs, one per TPU core? #16629

rog77 Jul 5, 2023

Replies: 3 comments · 3 replies

rog77 Jul 10, 2023 Author

ayaka14732 Jul 12, 2023 Collaborator

rog77 Jul 12, 2023 Author

ayaka14732 Jul 13, 2023 Collaborator

greg-rezo Sep 16, 2024

hawkinsp Sep 16, 2024 Maintainer

rog77
Jul 5, 2023

Replies: 3 comments 3 replies

rog77
Jul 10, 2023
Author

ayaka14732
Jul 12, 2023
Collaborator

rog77 Jul 12, 2023
Author

ayaka14732 Jul 13, 2023
Collaborator

greg-rezo
Sep 16, 2024

hawkinsp Sep 16, 2024
Maintainer