Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

examples: added trigger-phrase agent example #800

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

s-hamdananwar
Copy link
Contributor

No description provided.

- switched to elevenlabs for tts
- switched tts audio publishing into a streamed method
- added boost trigger for deepgram stt
- added references to the returns of asyncio.createtask
- added readme
- removed unused variable
Copy link

changeset-bot bot commented Sep 26, 2024

⚠️ No Changeset found

Latest commit: e09049f

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Member

@nbsp nbsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments, plus run ruff check . && ruff format . for linting

tokenize.basic.WordTokenizer(ignore_punctuation=True)
)

trigger_phrase = "Hi Bob!"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: TRIGGER_PHRASE instead to show that this is a changeable constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh okay, didn't know that - thanks!

initial_ctx = llm.ChatContext().append(
role="system",
text=(
f"You are {trigger_phrase}, a voice assistant created by LiveKit. Your interface with users will be voice. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird misleading use of trigger_phrase here. this implies that it can only be used as a name, i think it's best to drop it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I agree. I was also debating about this but my thought was that it might be helpful to give the LLM a bit more context

examples/trigger-phrase/agent.py Show resolved Hide resolved
Copy link
Member

@nbsp nbsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! i noticed a few things:

  • there's no STT transcriptions in chat, can you add those?
  • it's a bit slow. worth looking into
  • semantically this should probably be inside the voice_assistant examples directory

examples/trigger-phrase/README.md Outdated Show resolved Hide resolved
Comment on lines 101 to 104
vad = silero.VAD.load(
min_speech_duration=0.01,
min_silence_duration=0.5,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should have this be in the prewarm function so it doesn't block the job from starting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oope I see it in the docs now, should have read it better 🤦‍♂️

@s-hamdananwar
Copy link
Contributor Author

  • it's a bit slow. worth looking into

I think it is mainly due to the 0.5 sec timeout set for the VAD, and maybe partly due to the computation that needs to happen on every END_OF_SPEECH event. I am not sure the best way to address them though. Since the primary goal of this example is to show the users a way to use transcribed words to trigger the LLM, I didn't go down the path of ensuring minimum possible latency like VoiceAssistant does.

  • semantically this should probably be inside the voice_assistant examples directory

Even though technically this is a voice assistant, since we are not using the VoiceAssistant class, I feel like it would be confusing and counter intuitive to the users if we placed in that directory and hence resorted to a stand alone example directory. What do you think?

@nbsp
Copy link
Member

nbsp commented Oct 4, 2024

I think it is mainly due to the 0.5 sec timeout set for the VAD, and maybe partly due to the computation that needs to happen on every END_OF_SPEECH event.

in my testing i encountered closer to three or sometimes four seconds of silence before the response started playing. this doesn't need to be fully optimized as an example, but at this point it is hurting the effectiveness of the demo.

re: directory, disregard; did not notice this doesn't actually use VoicePipelineAgent.

- removed VAD
- add STT transcription
- removed first participant constraint
@s-hamdananwar
Copy link
Contributor Author

  • STT transcriptions is now added ✅
  • VAD is removed, both due to issues with adding StreamAdapter to Deepgram and also hopefully to reduce latency
  • first_participant constraint removed

@dsgolman
Copy link

@s-hamdananwar this is how I was able to manage "multiple" participants in a single raise hand queue, check out the PR and let me know if this can help resolve the issue of still only listening to the first participant that joins the room.

PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants