Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fake HPU mode to Habana components #180

Closed
wants to merge 10 commits into from

Conversation

kzawora-intel
Copy link

No description provided.

Copy link

@mswiniarsk mswiniarsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the idea is great, but it introduces lots of conditional to our code (if not is_fake_hpu()).
I think it would be great if we could apply here monkey patching - similiar to GPU Migration: https://docs.habana.ai/en/latest/PyTorch/PyTorch_Model_Porting/GPU_Migration_Toolkit/GPU_Migration_Toolkit.html

In this case we could override all "hpu" modules with "pass" (do nothing) or "cpu", and then limit changes in our main HPU specific modules, as well as ease future development so there will be no need to add is_fake_hpu every time.


jobs:
cputest:
runs-on: ubuntu-latest

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be safer to use a hardcoded ubuntu version?

- habana_main
pull_request:
branches:
- habana_main

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about adding also habana_next? Just temporary until the time we maintain two branches

VLLM_TARGET_DEVICE=hpu python setup.py develop
- name: cpu-test
run: |
VLLM_SKIP_WARMUP=true VLLM_PROMPT_SEQ_BUCKET_MAX=128 python examples/offline_inference_fakehpu.py

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running with warmup would be an additional bonus validation don't you think? Probably it would be better to limit number of buckets, so that it does not take that much time, instead of disabling warmup

@@ -100,6 +100,7 @@ def forward(
kv_cache: torch.Tensor,
attn_metadata: AttentionMetadata,
) -> torch.Tensor:
# import pdb; pdb.set_trace()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this comment is not needed

@@ -126,6 +131,11 @@ def determine_num_available_blocks(self) -> Tuple[int, int]:

# Execute a forward pass with dummy inputs to profile the memory usage
# of the model.
if is_fake_hpu():
# self.model_runner.profile_run()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove commented code

@kzawora-intel
Copy link
Author

irrelevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
habana Issues or PRs submitted by Habana Labs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants