Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS/HTTP Node Attestor #4788

Closed
kfox1111 opened this issue Jan 10, 2024 · 17 comments · Fixed by #4909
Closed

DNS/HTTP Node Attestor #4788

kfox1111 opened this issue Jan 10, 2024 · 17 comments · Fixed by #4909
Labels
priority/backlog Issue is approved and in the backlog unscoped The issue needs more design or understanding in order for the work to progress

Comments

@kfox1111
Copy link
Contributor

On bare metal nodes without TPM's, it would be very nice if using HTTP/DNS like ACME does for initial attestation could be used for bootstrapping rather then needing to ssh in (and accept an untrusted key) and using a join token. It wouldn't need to be ACME itself, but something that functions similarly.

@evan2645 evan2645 added the triage/in-progress Issue triage is in progress label Jan 11, 2024
@kfox1111
Copy link
Contributor Author

kfox1111 commented Feb 4, 2024

I'm currently thinking, I start with the x509pop plugin, copy it to a plugin named 'http', then modify it as follows:

For the server plugin, change its Attest function, removing the x509 cert validation bits. Then change the challenge to generate a 'token' as per https://datatracker.ietf.org/doc/html/rfc8555#section-8.3. It is returned to the agent.

The agent would then start a webserver on port 80 (default) or any configured port. (If port is != 80, something else needs to proxy on the host from 80->the chosen port). The agent would share out just "/.well-known/acme-challenge/$token" as per the acme rfc. The content would be the token

Once the webserver is started, the agent would respond to the server that its ready, along with its proposed dns name.
The Server plugin would first validate the dns name against a regex list its configured with of valid dns names it is willing to test. If it matches, it fetch the document from the agents webserver and validates the token matches. If it all passes, it generates a node identity with a selector matching the dns name attested.

@aaomidi
Copy link
Contributor

aaomidi commented Feb 7, 2024

So the concern I have with this flow is:

Some assumptions:

  • Client: Agent
  • Server: The aforementioned server plugin
  • Finalize: Similar to ACME's Finalize method, but with a public key and dns name
  • The server is only TLS protected for this attestation, and not mTLS (yet).

The ideal scenario is:

  • Agent reaches out to the server, and says give me a challenge.
  • Agent gets the token from the challenge, and puts up a web resource.
  • Agent reaches out to the server, acknowledging that it's ready to respond to the challenge.
  • Server confirms that the token matches.
  • Agent reaches out to the server with a finalize message.

Now imagine this scenario:

  • Agent reaches out to the server, and says give me a challenge.
  • Agent gets the token from the challenge, and puts up a web resource.
  • Agent reaches out to the server, acknowledging that it's ready to respond to the challenge.
  • Server confirms that the token matches.
  • Bad actor has entered the communication between the two
  • Bad actor reaches out with something like "finalize" with a public key

At this point the bad actor has successfully hijacked the issued identity.

Note: I may be making some wrong assumptions here, that may make this not really a possible attack.

Let me think a bit more about how this would work securely.

@aaomidi
Copy link
Contributor

aaomidi commented Feb 7, 2024

So, I think the way I can see this being made a bit safer without making a ton of changes to this flow is to use a self-signed mTLS identity for the Client side.

The server would need to be configured to trust all and any client certificate for establishing an mTLS session for the attestation endpoint.

Once we have that assumption, this DNS/HTTP auth plugin can be designed that for the entire lifetime of the challenge, the challenge is scoped to that specific client certificate. E.g. if the certificate is changed, it can not interject itself into another challenge-response flow.

@kfox1111
Copy link
Contributor Author

kfox1111 commented Feb 7, 2024

Ah. I see... I'll think some more on this too. Thanks. :)

@kfox1111
Copy link
Contributor Author

kfox1111 commented Feb 7, 2024

Looking at the plugin code, the plugin will only respond to the request from the same client tcp stream (?) so not sure a bad actor can man in the middle that process.

If they can, it looks like there is a piece in the acme protocol meant to handle that:
https://datatracker.ietf.org/doc/html/rfc8555#section-8.3

The initial client request is done with a jwk pair, and the client is expected to put its public fingerprint at the http token url as well. If we did the same, it would also close the loop I think?

@aaomidi
Copy link
Contributor

aaomidi commented Feb 7, 2024

The initial client request is done with a jwk pair, and the client is expected to put its public fingerprint at the http token url as well. If we did the same, it would also close the loop I think?

Yes ACME gets around this by using the ACME account. I didn't know if you wanted to build an ACME account model here.

Looking at the plugin code, the plugin will only respond to the request from the same client tcp stream (?) so not sure a bad actor can man in the middle that process.

I think as long as the bad actor isn't a layer 7 proxy that you connected to to talk to the server it should be fine.

Note that the layer 7 proxy would have to be the one terminating TLS too, so it'd come down to do you trust the TLS certificate of the server at that point.

@kfox1111
Copy link
Contributor Author

kfox1111 commented Feb 7, 2024

Ah, gotcha.

Not sure we need to adopt all of ACME, but there may be some advantage to reusing the bits of their protocol that work, to solve all the same problems? I could go either way though.

@aaomidi
Copy link
Contributor

aaomidi commented Feb 7, 2024

Honestly, I think if this is scoped to a single TCP connection, and new TCP connections would have to full restart the flow, you'd solve the majority of my concerns with this.

The only other stipulation being that the server MUST be protected by TLS for this to work properly.

@kfox1111
Copy link
Contributor Author

kfox1111 commented Feb 7, 2024

Honestly, I think if this is scoped to a single TCP connection, and new TCP connections would have to full restart the flow, you'd solve the majority of my concerns with this.

I think that is currently true with spire's currnent plugin model? Anyone we can have double check that assumption?

The only other stipulation being that the server MUST be protected by TLS for this to work properly.

Just to double check, your referring here to the server plugin hosted out of spire, which is TLS protected?

The temp webserver for the handshaking can be http only?

@aaomidi
Copy link
Contributor

aaomidi commented Feb 7, 2024

I think that is currently true with spire's currnent plugin model? Anyone we can have double check that assumption?

If so then I think the initial proposal wouldn't create any concerns.

Just to double check, your referring here to the server plugin hosted out of spire, which is TLS protected?
The temp webserver for the handshaking can be http only?

Yes & Yes

@amartinezfayo amartinezfayo added priority/backlog Issue is approved and in the backlog unscoped The issue needs more design or understanding in order for the work to progress and removed triage/in-progress Issue triage is in progress labels Feb 8, 2024
@amartinezfayo
Copy link
Member

Thank you @kfox1111 for bringing this up and thank you @aaomidi for your feedback.
We have discussed this in the last maintainer's call and we think that the absence of an attestor for bare metal nodes without TPM's is a real problem that we want to address in the project.
The solution for this problem will always include trusting a third-party, in the proposed solution it would be DNS. We haven't explored if there are better options, so we are open for other solutions as well.

@kfox1111 If you think that a DNS/HTTP node attestor is the best option, and in the absence of other proposals, it would be great to make progress on scoping the work that needs to be done for the proposed solution, including some more details about the implementation, configuration and the mechanics of the attestation.
Some of the important aspects that we need to figure out in order to have a clear scope are:

  • Configuration of the plugin in the server and in the agent.
  • Challenge/Response flow diagram.
  • How multiple hosts talking behind the same DNS record is handled.
  • DNSSEC support.
  • Shape of the SPIFFE ID of agents attested by the attestor.
  • Selectors produced by the server plugin.

I'm sure there are other things to figure out also, but finding answers to those items will help a lot to have this scoped.

Thanks again @kfox1111 for bringing this up to our attention!

@evan2645
Copy link
Member

evan2645 commented Feb 8, 2024

I think that is currently true with spire's currnent plugin model? Anyone we can have double check that assumption?

Yes it is correct. SPIRE server/agent node attestation is a bi-directional gRPC stream. It remains open until node attestation is complete. In the case we're discussing, the server will initiate the challenge check all while the agent is blocked on it, and the server will unblock after success. So the whole process is covered by a single stream lifetime.

Thanks @amartinezfayo for the guidance, I agree answers to those points will help to move the issue out of unscoped. Considering my above comment, as far as the flows go, a starting point can be:
Agent -> configured DNS name -> Server
Agent <- nonce <- Server
(agent binds random port and serves nonce)
Agent -> port number -> Server
(server checks the nonce)
Agent <- success/SVID <- Server

I'm sure it will change as we get answers to e.g. multiple hosts, configuration (dns server config?) etc.

One nice thing about this attestation type is it's repeatable.

@kfox1111
Copy link
Contributor Author

kfox1111 commented Feb 9, 2024

Will work on these things. but initial thoughts inline:

Configuration of the plugin in the server and in the agent.

server:
  dns_patterns:   # Optional list of regexes dns hostname need to match. If empty, all dns entries are alllowed. If none match, the request is rejected.
  - <regex>
  - <regex>
agent:
  hostname: # Optional. If unset, use the hostname as detected on the node. If running in a container, this may need to be set explicitly.
  port: 80 # Optional port to listen on. Default is 80, and if not 80, some other webserver on the host needs to port forward to whatever port is chosen here.

Challenge/Response flow diagram.

Will work on this. Some potential details discussed above.

I'm thinking of sticking to port 80 from server -> agent for the reasons described in the acme http-01 documentation. (Short short answer, one of the most firewall friendly protocols/ports. Random ports can cause problems to some orgs. low ports can have extra security too)

How multiple hosts talking behind the same DNS record is handled.

I think this would be not allowed. Each node that wants to attest needs to have its own dns entry, and the selector returned is that dns name, so uniquely identifies the node. acme http-01 assumes this as well I believe.

DNSSEC support.

I think this is transparent for http. The dns entry the server looks up is just a bit more trustworthy.

If there was a pure dns attestor like the acme dns-01 challenge, then it would help that I think. But for the scope of this plugin, I'm thinking limiting it to http attestion utilizing dns just for hostname lookups? So akin to acme http-01 only.

Shape of the SPIFFE ID of agents attested by the attestor.

spiffe://$trustDomain/spire/agent/http/hostname/foo.example.org

Selectors produced by the server plugin.

http:hostname:foo.example.org

@kfox1111
Copy link
Contributor Author

kfox1111 commented Feb 22, 2024

I think the main thing left is finalizing the details of the communication flows?

I was thinking about @evan2645's suggestion of random ports again, and could see an advantage to that when having multiple spire-agents on the same node (used to attest to different spire servers). I also can see some benefits on restricting the port back to port 80 for easier internet traversal.

So maybe it should be a configurable on both sides? That the agent allows specifying port to use and passes it to the server and the server can force override the port to always be port 80 should the server be intended to traverse the internet? Maybe even defaulting to 80 unless the user overrides?

In that case, the config might be:

server:
  dns_patterns:   # Optional list of regexes dns hostname need to match. If empty, all dns entries are alllowed. If none match, the request is rejected.
  - <regex>
  - <regex>
  allow_alternate_ports: false # Optional flag. Defaults to false. Allow the agent to specify what port to use. Otherwise, it must be port 80.
agent:
  hostname: # Optional. If unset, use the hostname as detected on the node. If running in a container, this may need to be set explicitly.
  port: 80 # Optional port to listen on. Default is 80, and if not 80, some other webserver on the host needs to port forward to whatever port is chosen here.
  advertised_port: 80 # Optional port to tell the spire-server to use for contact. Defaults to port 80. Used along with the spire-server setting allow_alternate_ports=true

@kfox1111 kfox1111 mentioned this issue Feb 23, 2024
3 tasks
@kfox1111
Copy link
Contributor Author

kfox1111 commented Feb 23, 2024

Started to work up the documentation around this. #4909

And scaffolded a bit based on the x509pop plugins.

@kfox1111
Copy link
Contributor Author

hmm.... should the plugin be named 'http' or 'httppop'?

@kfox1111
Copy link
Contributor Author

The pr has reached the level of a workable prototype. It seems to attest, and when I set agent_ttl to something very small, it seems to reattest ok too.

It has very little error checking and no testing at the moment. Once we work through all the details, then those things can be added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/backlog Issue is approved and in the backlog unscoped The issue needs more design or understanding in order for the work to progress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants