Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x-pack/filebeat/input/http_endpoint: make input GA #39410

Merged
merged 13 commits into from
May 17, 2024

Conversation

efd6
Copy link
Contributor

@efd6 efd6 commented May 6, 2024

Proposed commit message

See title.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Author's Checklist

  • None of the linter suggestions have any merit.

The timeout ACK behaviour is tested manually using elastic-package using the the version here injected into the stack.

$ curl -X POST http://localhost:8080/ -d '{"data":1}' -H 'Content-Type: application/json'
{"message":"ok!"}
$ curl -X POST http://localhost:8080/?wait_for_completion_timeout=1m -d '{"data":1}' -H 'Content-Type: application/json'
{"message":"ok!"}
$ curl -X POST http://localhost:8080/?wait_for_completion_timeout=1s -d '{"data":1}' -H 'Content-Type: application/json'
{"message":"could not publish event within timeout"}

The first instance without a timeout returned immediately, the second took ~5s and the last ~2s. These are consistent with expected behaviour.

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@efd6 efd6 self-assigned this May 6, 2024
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 6, 2024
Copy link
Contributor

mergify bot commented May 6, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @efd6? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@efd6 efd6 added Filebeat Filebeat backport-skip Skip notification from the automated backport with mergify Team:Security-Service Integrations Security Service Integrations Team enhancement and removed needs_team Indicates that the issue/PR needs a Team:* label labels May 6, 2024
@elasticmachine
Copy link
Collaborator

elasticmachine commented May 6, 2024

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 137 min 30 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@efd6 efd6 marked this pull request as ready for review May 8, 2024 00:38
@efd6 efd6 requested review from a team as code owners May 8, 2024 00:38
@efd6 efd6 requested review from AndersonQ and rdner May 8, 2024 00:38
@elasticmachine
Copy link
Collaborator

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

@efd6 efd6 marked this pull request as draft May 8, 2024 00:38
@efd6 efd6 marked this pull request as ready for review May 8, 2024 04:10
}

// Ready signals that the batch has been fully consumed. Only
// after the batch is marked as "ready" can the lumberjack batch
Copy link
Member

@andrewkroh andrewkroh May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lumberjack batch

@@ -124,7 +128,8 @@ func (h *handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
}
}

if err = h.publishEvent(obj, headers); err != nil {
acker.Add()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With end-to-end acknowledgements the goal is to be able to provide a delivery guarantee that the event(s) was written to the destination output. So I think this needs to relay the feedback to the caller that the event was written to the output.

I envision this meaning that a request blocks until the input receives all of the ACKs that it expects. This leads to a few follow-up behavioral questions:

  1. Introducing blocking behavior could be a viewed as a breaking change. Perhaps the waiting behavior should be opt-in and have a timeout like via ?wait_for_completion_timeout=30.
  2. Should there be any "circuit breakers" and how should the API respond when it overloaded (e.g. receiving more requests or more events than it can process in a timely manner)? We want to avoid a DoS situation if we have some high volume clients and an output that is slow or down. (This one is relevant even without E2E ack'ing.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that, but it means that we are keeping connections open while some (unrelated) internal process may block (I think that this is partially your point in 2.).

I think that this is a dangerous thing to do in the general case, but I also think that it would be reasonable to do as an opt-in that the user can choose in the configuration. This has the happy consequence that this would non-breaking. Note that I had not read 1. before I wrote this, so these are independent discoveries (honest). Looking at 1., I think that having a timeout would be sensible. So, suggest configuring based on the presence of a timeout; if present, then the request will block until the ACK or the timeout (and add a timeout failing metric), if absent do not block. This intentionally does not add a block without timeout (wait forever) since it does not make any sense in the real world.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me.

@@ -391,17 +402,21 @@ func newInputMetrics(id string) *inputMetrics {
apiErrors: monitoring.NewUint(reg, "api_errors_total"),
batchesReceived: monitoring.NewUint(reg, "batches_received_total"),
batchesPublished: monitoring.NewUint(reg, "batches_published_total"),
batchesACKedTotal: monitoring.NewUint(reg, "batches_acked_total"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to add the new metrics to the table in the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that the lumberjack input does not have any docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment under Author's Notes at #32175 (comment). Whether or not to formally advertise it is something I still debate. Feedback welcomed. It is only used and exposed as part of the barracuda integration at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes perfect sense.

Copy link
Contributor

mergify bot commented May 8, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b 38979-http_endpoint upstream/38979-http_endpoint
git merge upstream/main
git push upstream 38979-http_endpoint

Copy link
Contributor

mergify bot commented May 9, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b 38979-http_endpoint upstream/38979-http_endpoint
git merge upstream/main
git push upstream 38979-http_endpoint

@efd6 efd6 force-pushed the 38979-http_endpoint branch 3 times, most recently from ef8bd51 to 6ec5edd Compare May 9, 2024 22:47
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 10, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Comment on lines +199 to +204
var k string
for k = range q {
break
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That confused me a little bit, the for loop is there to easily extract the key from the map q, is that correct?

Could you add a little comment to make it clear why using a for loop on a map of length 1?

Copy link
Contributor Author

@efd6 efd6 May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought this was idiomatic, but I will add a comment.

Rationale for inclusion of break over iteration over all one keys is https://godbolt.org/z/z95q8K77f.

@@ -64,18 +69,40 @@ type handler struct {
}

func (h *handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
txID := h.nextTxID()
h.log.Debugw("request", "url", r.URL, "tx_id", txID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about debug logs in hot paths like that. If there is a high throughput of incoming requests and a user turns on debug logging, this can flood the logs.

It's not a blocker, I just wanted to raise my concern and let you decide what is best for this input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this.

Copy link
Contributor

mergify bot commented May 15, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b 38979-http_endpoint upstream/38979-http_endpoint
git merge upstream/main
git push upstream 38979-http_endpoint

The endpoint will enforce end-to-end ACK when a URL query parameter
`wait_for_completion_timeout` with a duration is provided. For example
`http://localhost:8080/?wait_for_completion_timeout=1m` will wait up
to 1min for the event to be published to the cluster and then return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to 1min for the event to be published to the cluster and then return
to 1 minute for the event to be published to the cluster and then return

`http://localhost:8080/?wait_for_completion_timeout=1m` will wait up
to 1min for the event to be published to the cluster and then return
the user-defined response message. In the case that the publication
does not happen within the timeout duration, the HTTP response will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
does not happen within the timeout duration, the HTTP response will
does not complete within the timeout duration, the HTTP response will

Copy link
Contributor

mergify bot commented May 17, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b 38979-http_endpoint upstream/38979-http_endpoint
git merge upstream/main
git push upstream 38979-http_endpoint

@efd6 efd6 merged commit 6979849 into elastic:main May 17, 2024
17 of 19 checks passed
efd6 added a commit to efd6/beats that referenced this pull request May 22, 2024
…ames

This is a partial backport of elastic#39410 to fix the missing replacement of
"*" with the input ID.
efd6 added a commit to efd6/beats that referenced this pull request May 22, 2024
…ames

This is a partial backport of elastic#39410 to fix the missing replacement of
"*" with the input ID.
efd6 added a commit that referenced this pull request May 22, 2024
…ames (#39656)

This is a partial backport of #39410 to fix the missing replacement of
"*" with the input ID.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify enhancement Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Security-Service Integrations Security Service Integrations Team
Projects
None yet
5 participants