Bug: Significant Cold Start Initialization Delay in Orchestration Framework #126

karanakatle · 2024-12-03T03:21:40Z

Expected Behaviour

The orchestration framework should initialize with minimal delay during cold starts, ensuring consistent performance across all Lambda invocations, including the first execution.

Current Behaviour

When a new Lambda instance is created, the orchestration framework incurs a significant initialization delay of 5-7 seconds. This delay occurs before any classification or other processing steps, resulting in a total latency of 7-9 seconds for the first invocation.

However, subsequent invocations on the same Lambda instance execute within 1-2 seconds, indicating the issue is specific to cold start initialization. This behavior impacts the overall performance and user experience during the first execution.

Note: Testing with provisioned concurrency (set to 5) does keep 5 instances of the Lambda function warm; however, the issue persists when execution shifts to a new Lambda instance outside of these provisioned instances. The orchestration framework initialization itself takes 5-7 seconds, which adversely impacts performance and user experience during the first execution.

Code snippet

Logs for the initial invocation:

INIT_START Runtime Version: python:3.12.v38	Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:7515e00d6763496e7a147ffa395ef5b0f0c1ffd6064130abb5ecde5a6d630e86
START RequestId: 29a54971-892b-4748-affc-2e4e9a6139db Version: 96
[INFO]	2024-12-02T16:55:56.989Z	29a54971-892b-4748-affc-2e4e9a6139db	Received event in Lambda
[INFO]	2024-12-02T16:55:56.990Z	29a54971-892b-4748-affc-2e4e9a6139db	LLM Classification
[INFO]	2024-12-02T16:55:**57**.513Z	29a54971-892b-4748-affc-2e4e9a6139db	Found credentials in environment variables.
[INFO]	2024-12-02T16:56:**04**.180Z	29a54971-892b-4748-affc-2e4e9a6139db	
** CLASSIFIED INTENT **


Logs for the other invocations in same lambda instance:

[INFO]	2024-12-02T16:57:00.674Z	b6b59439-3a0c-48b5-b763-1e0a2c30ca51	Received event: 
[INFO]	2024-12-02T16:57:**00**.674Z	b6b59439-3a0c-48b5-b763-1e0a2c30ca51	LLM Classification
[INFO]	2024-12-02T16:57:**01**.311Z	b6b59439-3a0c-48b5-b763-1e0a2c30ca51	
** CLASSIFIED INTENT **

Possible Solution

No response

Steps to Reproduce

Deploy the orchestration framework on AWS Lambda.
Trigger a request that leads to the creation of a new Lambda instance.
Measure the total latency for the first invocation (observe 7-9 seconds).
Measure latency for subsequent invocations on the same instance (observe 1-2 seconds).
Test with provisioned concurrency and observe behavior when execution moves to new instances.

brnaba-aws · 2024-12-03T15:26:17Z

Hi @karanakatle ,
thanks for submitting this issue.
There are a couple of things we need to understand before:

Can you share the code you used for this lambda?
Have you seen that now, you can install the multi-agent-orchestrator with only the minimum required dependency. For instance, if you do not use Anthropic or OpenAi you can simple install multi-agent-orchestrator using: pip install multi-agent-orchestrator this is available from version 0.1.1 This will save you a bit of init time.
You can also check the snapstart for python. Which has been released last week.

Let us know if you need further assistance.
regards,
Anthony

brnaba-aws · 2024-12-09T08:36:00Z

@karanakatle any updates on this?

brnaba-aws · 2024-12-12T17:00:11Z

@karanakatle , I'm about to close this since I didn't hear anything from you. Let me know if you need further assistance.

karanashokraokatle · 2024-12-13T09:52:26Z

Hello @brnaba-aws
Sorry for delay response, I tried enabling snapstart and installing the required dependencies only but the issue persists.
On detail analysis, this is what I have found.

Every time the library import takes 2 seconds.
We have used a custom classifier - where we are using our fine tune model for getting a response - it provides a response in 0.6 to 1 sec of time.
If no agent is selected - it goes to fallback agent - which is bedrock agent which again takes 2-3 sec. of time

Is there any chance or way to save the time consuming in 1st and 3rd point

Code is attached for reference.
multi_agent_orchestrator.zip

Cloudwatch logs also attached for analysis
Bot Orchestration Logs.txt

brnaba-aws · 2024-12-13T10:17:56Z

For:

can you provide a log for more than a single invocation? To see if this time is always there or only on the very first invocation.
You can't really improve that. Unless you don't use a default agent by setting: USE_DEFAULT_AGENT_IF_NONE_IDENTIFIED=False,
Or use another model for the default agent to be a fast one like claude 3 haiku. I see that your bedrock_llm_agent is using Claude Sonnet 3.5, which is slow.

brnaba-aws · 2024-12-13T10:58:39Z

@karanakatle , one more thing. I don't think you are using the latest python version since I don't see this try/except with Anthropic: new import method

One thing that I just noticed is the fact that each Lex bot will instantiate a boto3.client('lexv2-runtime', region_name=self.region). We haven't provided a way to reuse a client passed as a parameter.
This would help I believe. I'll create an issue, and provide you with a file to test? ok?

karanashokraokatle · 2024-12-13T11:13:23Z

Sure @brnaba-aws , thnx for the help.

brnaba-aws · 2024-12-13T13:51:20Z

@karanashokraokatle,

could you please try to use this LexBotAgent?
It can accept a client as an option, so you can create a single client and use it across all your lex bot.

example:

lex_client = boto3.client('lexv2-runtime', region_name=os.getenv('AWS_REGION','us-east-1'))

my_agent = LexBotAgent(LexBotAgentOptions(client=lex_client, bot_id = '', bot_alias_id='', locale_id=''))
my_agent_2 = LexBotAgent(LexBotAgentOptions(client=lex_client, bot_id = '', bot_alias_id='', locale_id=''))

from typing import List, Dict, Optional
from dataclasses import dataclass
import boto3
from botocore.exceptions import BotoCoreError, ClientError
from multi_agent_orchestrator.agents import Agent, AgentOptions
from multi_agent_orchestrator.types import ConversationMessage, ParticipantRole
from multi_agent_orchestrator.utils import Logger
import os
from typing import Any

@dataclass
class LexBotAgentOptions(AgentOptions):
    bot_id: str = None
    bot_alias_id: str = None
    locale_id: str = None
    client: Optional[Any] = None

class LexBotAgent(Agent):
    def __init__(self, options: LexBotAgentOptions):
        super().__init__(options)
        if (options.region is None):
            self.region = os.environ.get("AWS_REGION", 'us-east-1')
        else:
            self.region = options.region

        if options.client:
            self.lex_client = options.client

        else:
            self.lex_client = boto3.client('lexv2-runtime', region_name=self.region)

        self.bot_id = options.bot_id
        self.bot_alias_id = options.bot_alias_id
        self.locale_id = options.locale_id

        if not all([self.bot_id, self.bot_alias_id, self.locale_id]):
            raise ValueError("bot_id, bot_alias_id, and locale_id are required for LexBotAgent")

    async def process_request(self, input_text: str, user_id: str, session_id: str,
                        chat_history: List[ConversationMessage],
                        additional_params: Optional[Dict[str, str]] = None) -> ConversationMessage:
        try:
            params = {
                'botId': self.bot_id,
                'botAliasId': self.bot_alias_id,
                'localeId': self.locale_id,
                'sessionId': session_id,
                'text': input_text,
                'sessionState': {}  # You might want to maintain session state if needed
            }

            response = self.lex_client.recognize_text(**params)

            concatenated_content = ' '.join(
                message.get('content', '') for message in response.get('messages', [])
                if message.get('content')
            )

            return ConversationMessage(
                role=ParticipantRole.ASSISTANT.value,
                content=[{"text": concatenated_content or "No response from Lex bot."}]
            )

        except (BotoCoreError, ClientError) as error:
            Logger.error(f"Error processing request: {str(error)}")
            raise error

karanakatle added the bug Something isn't working label Dec 3, 2024

github-actions bot added the triage label Dec 3, 2024

brnaba-aws added this to multi-agent-orchestrator board Dec 20, 2024

brnaba-aws moved this to Todo in multi-agent-orchestrator board Dec 20, 2024

brnaba-aws moved this from Todo to In review in multi-agent-orchestrator board Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Significant Cold Start Initialization Delay in Orchestration Framework #126

Bug: Significant Cold Start Initialization Delay in Orchestration Framework #126

karanakatle commented Dec 3, 2024

brnaba-aws commented Dec 3, 2024 •

edited

Loading

brnaba-aws commented Dec 9, 2024

brnaba-aws commented Dec 12, 2024

karanashokraokatle commented Dec 13, 2024 •

edited

Loading

brnaba-aws commented Dec 13, 2024

brnaba-aws commented Dec 13, 2024

karanashokraokatle commented Dec 13, 2024

brnaba-aws commented Dec 13, 2024

Bug: Significant Cold Start Initialization Delay in Orchestration Framework #126

Bug: Significant Cold Start Initialization Delay in Orchestration Framework #126

Comments

karanakatle commented Dec 3, 2024

Expected Behaviour

Current Behaviour

Code snippet

Possible Solution

Steps to Reproduce

brnaba-aws commented Dec 3, 2024 • edited Loading

brnaba-aws commented Dec 9, 2024

brnaba-aws commented Dec 12, 2024

karanashokraokatle commented Dec 13, 2024 • edited Loading

brnaba-aws commented Dec 13, 2024

brnaba-aws commented Dec 13, 2024

karanashokraokatle commented Dec 13, 2024

brnaba-aws commented Dec 13, 2024

brnaba-aws commented Dec 3, 2024 •

edited

Loading

karanashokraokatle commented Dec 13, 2024 •

edited

Loading