-
Notifications
You must be signed in to change notification settings - Fork 6
/
jira.xml
735 lines (658 loc) · 25 KB
/
jira.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
<?xml version="1.0" encoding="utf-8" ?>
<!--
# \\ SPIKE: Secure your secrets with SPIFFE.
# \\\\\ Copyright 2024-present SPIKE contributors.
# \\\\\\\ SPDX-License-Identifier: Apache-2.0
-->
<stuff>
<purpose>
<target>Our goal is to have a minimally delightful product.</target>
<target>Strive not to add features just for the sake of adding features.
</target>
<target>Half-assed features shall be completed before adding more
features.
</target>
</purpose>
<next>
<issue>
Deploy Keycloak and make sure you can initialize it.
</issue>
<issue>
<task>The system should work with three keepers by default.</task>
<task>
NEW INIT WORKFLOW
Each keeper starts in NOT_READY state and:
- Has a unique ID
- Knows about all other keepers (for mTLS)
- Has configuration for total keepers and threshold
Key generation protocol:
- Each keeper generates its own random contribution
- They share their contribution with all other keepers
- Each keeper collects all contributions
Once all contributions are received, each keeper:
- XORs all contributions to create the final root key
- Uses Shamir's Secret Sharing to split it
- Keeps its own shard
- Transitions to READY state
- Securely erases all contributions and the computed root key
KEEPER CRASH RECOVERY
- Keeper requests root key from Nexus via mTLS
- Recomputes all shards using same configuration
- Keeps its shard, wipes everything else
- Returns to READY state
The state transitions are:
NOT_READY -> CONTRIBUTING -> READY
Key security properties:
No single keeper ever has the complete root key
The final key is as random as any individual keeper's contribution
Each keeper independently computes the same shards
After initialization, only the shards remain in memory
FOR SIMPLICITY WE'll HARD CODE 3 keepers and 2 min shards;
we will make it configurable later.
--
The above flow means we won't need `spike init` command anymore.
Nexus' states will automatically transition:
Initial State: PENDING
- Nexus starts up
- Begins collecting shards from keepers
- Cannot serve requests yet
Shard Collection:
- Keepers provide their shards via mTLS
- Nexus stores shards temporarily
- After each shard addition, checks if threshold is met
- If keeper is unreachable, we'll try the next one.
- If after one loop still not enough shards, we backoff and retry.
- If reconstruction from the shards fail; we backoff and retry.
Transition to READY:
Once threshold shards are collected
- Reconstructs root key using Shamir's combine
- Clears individual shards from memory
- Can now serve API requests
SUPERADMIN BREAK-THE-GLASS EMERGENCY BACKUP
Security features:
- Requires super admin SVID for authorization
- Encrypts backup with AES-GCM using the provided passphrase
- Creates an audit log entry of backup creation
- Generates unique backup IDs for tracking
Emergency backup flow:
- Creates two shards (requires both for reconstruction)
- Encrypts them in a JSON structure with metadata
- Super admin gets an encrypted blob
SUPERADMIN EMERGENCY RECOVERY
- In Pilot CLI
- Super admin provides the passphrase
- Super admin provides encrypted shard #1
- Super admin provided encrypted shard #1
</task>
<task>Nexus will be NOT_READY until it collects enough shards to create the root key</task>
<task>
Nexus will have a background process to reconstruct the root key in memory.
Until the root key is constructed; any call other than `spike init` will
result in "please initialize spike first" warning.
After init workflow is done Nexus, will set a special
tombstone secret to indicate that it is initialized.
The former initialization logic will be discarded.
Nexus Status:
pending: If it cannot connect to keepers.
message: "make sure keepers are up and nexus it configured to talk to them"
uninitialized: all keepers connected so far returns an empty shard
message: "please `spike init` to initialize spike"
initializing: at least one keeper returns a shard, but not enough to create root key
message: "system initializing, please wait"
initialized: root key fully constructed; no need to use keepers for now
message: "OK"
---
Case: Nexus crashes
uninitialized/null | initializing | ready |
</task>
</issue>
</next>
<low-hanging-fruits>
<issue>
Multiple Keeper instances will be required for fan-in fan-out of the
shards.
configure the current system to work with multiple keepers.
the demo setup should initialize 3 keepers by default.
the demo setup should use sqlite as the backing store by default.
</issue>
<issue>
Install Keycloak locally and experiment with it.
This is required for "named admin" feature.
</issue>
<issue>
One way token flow;
keeper provides the rootkey to nexus;
nexus init pushes root key to keeper.
that's it.
</issue>
<issue>
If SPIKE is not initialized; `spike` or `spike --help` should display
a reminder to initialize SPIKE first and exit.
</issue>
<issue>
The paths that we set in get put ...etc should look like a unix path.
it will require sanitization!
Check how other secrets stores manage those paths.
</issue>
<issue>
make sure that everything sanitizable are properly sanitized.
</issue>
<issue>
read policies from a yaml or a json file and create them.
</issue>
<issue>
have sqlite as the default backing store.
(until we implement the S3 backing store)
</issue>
<issue>
Fix commented out tests.
</issue>
</low-hanging-fruits>
<later>
<issue>
set sqlilite on by default and make sure everything works.
</issue>
<issue>
volkan@spike:~/Desktop/WORKSPACE/spike$ spike secret get /db
Error reading secret: post: Problem connecting to peer
^ I get an error instead of a "secret not found" message.
</issue>
<issue>
this is from SecretReadResponse, so maybe its entity should be somewhere
common too.
return &data.Secret{Data: res.Data}, nil
</issue>
<issue>
these may come from the environment:
DataDir: ".data",
DatabaseFile: "spike.db",
JournalMode: "WAL",
BusyTimeoutMs: 5000,
MaxOpenConns: 10,
MaxIdleConns: 5,
ConnMaxLifetime: time.Hour,
</issue>
</later>
<reserved>
<issue waitingFor="shamir-to-be-implemented-first">
use case: shamir
1. `spike init` verifies that there are 3 healthy keeper instances.
it creates a shard of 3 shamir secrets (2 of which will be enough to
reassemble the root key) send each share to each keeper.
2. SPIKE nexus regularly polls all keepers and if it can assemble a secret
all good.
3. `spike init` will also save the 2 shards (out of 3) in
`~/.spike/recovery/*`
The admin will be "highly encouraged" do delete those from the machine and
securely back up the keys and distribute them to separate people etc.
[2 and 3 are configurable]
</issue>
<issue waitingFor="shamir-to-be-implemented">
<workflow>
1. `spike init` initializes keeper(s). From that point on, SPIKE Nexus
pulls the root key whenever it needs it.
2. nexus and keeper can use e2e encryption with one time key pairs
to have forward secrecy and defend the transport in the VERY unlikely
case of a SPIFFE mTLS breach.
3. ability for nexus to talk to multiple keepers
4. ability for a keeper to talk to nexus to recover its root key if it
loses it.
5. abiliy for nexus to talk to and initialize multiple keepers.
(phase 1: all keepers share the same key)
6. `spike init` saves its shards (2 out of 3 or smilar) to
`~/.spike/recovery/*`
The admin will be "highly encouraged" to delete those from the machine
and
securely back up the keys and distribute them to separate people etc
`spike init` will also save the primary key used in the shamir's secret
sharing
to `~/.spike/recovery/*` (this is not as sensitive as the root key, but
still
should be kept safe)
- it is important to note that, without the recovery material, your only
opiton
to restore the root key relies on the possibility that more than N
keepers remain
operational at all times. -- that's a good enough possibility anyway
(say 5 keepers in 3 AZs, and you need only 2 to recover the root key;
then it will
be extremely unlikely for all of them to go down at the same time)
so in an ideal scenario you save your recovery material in a secure
encrypted enclave
and never ever use it.
7. `spike recover` will reset a keeper cluster by using the recovery
material.
`spike recover` will also recover the root key.
to use `spike recover` you will need a special SVID (even a super admin
could not use it
without prior authorization)
the SVID who can execute `spike recover` will not be able to execute
anything else.
8. At phase zero, `spike recover` will just save the root key to disk,
also mentioning that it's not secure and the key will be stored safely
and wiped from the disk.
9. maybe double encrypt keeper-nexus communication with one-time key
pairs because
the root key is very sensitive and we would want to make sure it's
secure even
if the SPIFFE mTLS is compromised.
</workflow>
<details>
say user sets up 5 keeper instances.
in nexus, we have a config
keepers:
- nodes: [n1, n2, n3, n4, n5]
nexus can reach out with its own spiffe id to each node in the list. it
can
call the assembly lib with whatever secrets it gets back, as it gets
them back,
and so long as it gets enough, "it just works"
recovery could even be, users have a copy of some of the keeper's
secrets.
they rebuild a secret server and load that piece back in. nexus then can
recover.
that api could also allow for backup configurations
</details>
<docs>
WAITINGFOR: shamir to be implemented
To documentation (Disaster Recovery)
Is it like
Keepers have 3 shares.
I get one share
you get one share.
We keep our shares secure.
none of us alone can assemble a keeper cluster.
But we two can join our forces and do an awesome DR at 3am in the
morning if needed?
or if your not that paranoid, you can keep both shares on one
thumbdrive, or 2
shares on two different thumbdrives in two different safes, and rebuild.
it gives a lot of options on just how secure you want to try and make
things vs
how painful it is to recover
</docs>
</issue>
<issue waitingFor="shamir-to-be-implemented">
func RouteInit(
w http.ResponseWriter, r *http.Request, audit *log.AuditEntry,
) error {
// This flow will change after implementing Shamir Secrets Sharing
// `init` will ensure there are enough keepers connected, and then
// initialize the keeper instances.
//
// We will NOT need the encrypted root key; instead, an admin user will
// fetch enough shards to back up. Admin will need to provide some sort
// of key or password to get the data in encrypted form.
</issue>
</reserved>
<immediate-backlog>
</immediate-backlog>
<runner-up>
<issue>
double-encryption of nexus-keeper comms (in case mTLS gets compromised, or
SPIRE is configured to use an upstream authority that is compromised, this
will provide end-to-end encryption and an additional layer of security
over
the existing PKI)
</issue>
<issue>
Minimally Delightful Product Requirements:
- A containerized SPIKE deployment
- A Kubernetes SPIKE deployment
- Minimal policy enforcement
- Minimal integration tests
- A demo workload that uses SPIKE to test things out as a consumer.
- A golang SDK (we can start at github/zerotohero-dev/spike-sdk-go
and them move it under spiffe once it matures)
</issue>
<issue>
Kubernetification
</issue>
<issue>
v.1.0.0 Requirements:
- Having S3 as a backing store
</issue>
<issue>
Consider a health check / heartbeat between Nexus and Keeper.
This can be more frequent than the root key sync interval.
</issue>
<issue>
Unit tests and coverage reports.
Create a solid integration test before.
</issue>
<issue>
Test automation.
</issue>
<issue>
Assigning secrets to SPIFFE IDs or SPIFFE ID prefixes.
</issue>
</runner-up>
<backlog>
<issue kind="v1.0-requirement">
- Run SPIKE in Kubernetes too.
</issue>
<issue kind="v1.0-requirement">
- Postgres support as a backing store.
</issue>
<issue kind="v1.0-requirement">
- Ability to channel audit logs to a log aggregator.
</issue>
<issue kind="v1.0-requirement">
- OIDC integration: Ability to connect to an identity provider.
</issue>
<issue kind="v1.0-requirement">
- ESO (External Secrets Operator) integration
</issue>
<issue kind="v1.0-requirement">
- An ADMIN UI (linked to OIDC probably)
</issue>
<issue kind="v1.0-requirement">
- Ability to use the RESTful API without needing an SDK.
That could be hard though since we rely on SPIFFE authentication and
SPIFFE workload API to gather certs: We can use a tool to automate that
part. But it's not that hard either if I know where my certs are:
`curl --cert /path/to/svid_cert.pem --key /path/to/svid_key.pem
https://mtls.example.com/resource`
</issue>
<issue kind="v1.0-requirement">
> 80% unit test coverage
</issue>
<issue kind="v1.0-requirement">
Fuzzing for the user-facing API
</issue>
<isssue kind="v1.0-requirement">
100% Integration test (all features will have automated integration tests
in all possible environments)
</isssue>
<issue>
By design, we regard memory as the source of truth.
This means that backing store might miss some secrets.
Find ways to reduce the likelihood of this happening.
1. Implement exponential retries.
2. Implement a health check to ensure backing store is up.
3. Create background jobs to sync the backing store.
</issue>
<issue>
Test the db backing store.
</issue>
<issue>
Ability to add custom metadata to secrets.
</issue>
<issue>
We need use cases in the website
- Policy-based access control for workloads
- Secret CRUD operations
- etc
</issue>
<issue>
Fleet management:
- There is a management plane cluster
- There is a control plane cluster
- There are workload clusters connected to the control plane
- All of those are their own trust domains.
- There is MP-CP connectivity
- There is CP-WL connectivity
- MP has a central secrets store
- WL and CP need secrets
- Securely dispatch them without "ever" using Kubernetes secrets.
- Have an alternative that uses ESO and a restricted secrets namespace
that no one other than SPIKE components can see into.
</issue>
<issue>
To docs:
how do we manage the root key.
i.e., it never leaves the memory and we keep it alive via replication.
</issue>
<issue>
API for SPIKE nexus to save its shard encrypted with a passphrase for
emergency backup
This will be optional; and admin will be advised to save it securely
outside the machine.
(requires the shamir secret sharing to be implemented)
</issue>
<issue>
Postgresql support for backing store.
</issue>
<issue>
maybe a default auditor SPIFFEID that can only read stuff (for Pilot;
not for named admins; named admins will use the policy system instead)
</issue>
<issue>
Optionally not create tables and other ddl during backing store creation
</issue>
<issue>
What if a keeper instance crashes and goes back up?
if there is an "initialized" Nexus; it can hint nexus to send its share
again.
</issue>
<issue>
Think about DR scenarios.
</issue>
<issue>
SPIKE Pilot to ingest a policy YAML file(s) to create policies.
(similar to kubectl)
</issue>
<issue>
- SPIKE Keep Sanity Tests
- Ensure that the root key is stored in SPIKE Keep's memory.
- Ensure that SPIKE Keep can return the root key back to SPIKE Nexus.
</issue>
<issue>
Demo: root key recovery.
</issue>
<issue>
If there is a backing store, load all secrets from the backing store
upon crash, which will also populate the key list.
after recovery, all secrets will be there and the system will be
operational.
after recovery admin will lose its session and will need to re-login.
</issue>
<issue>
Test edge cases:
* call api method w/o token.
* call api method w/ invalid token.
* call api method w/o initializing the nexus.
* call init twice.
* call login with bad password.
^ all these cases should return meaningful errors and
the user should be informed of what went wrong.
</issue>
<issue>
Try SPIKE on a Mac.
</issue>
<issue>
Try SPIKE on an x-86 Linux.
</issue>
<issue>
based on the following, maybe move SQLite "create table" ddls to a
separate file.
Still a "tool" or a "job" can do that out-of-band.
update: for SQLite it does not matter as SQLite does not have a concept
of RBAC; creating a db is equivalent to creating a file.
For other databases, it can be considered, so maybe write an ADR for that.
ADR:
It's generally considered better security practice to create the schema
out-of-band (separate from the application) for several reasons:
Principle of Least Privilege:
The application should only have the permissions it needs for runtime
(INSERT, UPDATE, SELECT, etc.)
Schema modification rights (CREATE TABLE, ALTER TABLE, etc.) are not
needed during normal operation
This limits potential damage if the application is compromised
Change Management:
Database schema changes can be managed through proper migration tools
Changes can be reviewed, versioned, and rolled back if needed
Prevents accidental schema modifications during application restarts
Environment Consistency:
Ensures all environments (dev, staging, prod) have identical schemas
Reduces risk of schema drift between environments
Makes it easier to track schema changes in version control
</issue>
<qa>
<issue>
- SPIKE Nexus Sanity Tests
- Ensure SPIKE Nexus caches the root key in memory.
- Ensure SPIKE Nexus reads from SPIKE keep if it does not have the root
key.
- Ensure SPIKE Nexus saves the encrypted root key to the database.
- Ensure SPIKE Nexus caches the user's session key.
- Ensure SPIKE Nexus removes outdated session keys.
- Ensure SPIKE Nexus does not re-init (without manual intervention)
after
being initialized.
- Ensure SPIKE Nexus adheres to the bootstrapping sequence diagram.
- Ensure SPIKE Nexus backs up the admin token by encrypting it with the
root
key and storing in the database.
- Ensure SPIKE Nexus stores the initialization tombstone in the
database.
</issue>
<issue>
- SPIKE Pilot Sanity Tests
- Ensure SPIKE Pilot denies any operation if SPIKE Nexus is not
initialized.
- Ensure SPIKE Pilot can warn if SPIKE Nexus is unreachable
- Ensure SPIKE Pilot does not indefinitely hang up if SPIRE is not
there.
- Ensure SPIKE Pilot can get and set a secret.
- Ensure SPIKE Pilot can do a force reset.
- Ensure SPIKE Pilot can recover the root password.
- Ensure that after `spike init` you have a password-encrypted root key
in the db.
- Ensure that you can recover the password-encrypted root key.
</issue>
</qa>
</backlog>
<future>
<issue>
multiple keeper clusters:
keepers:
- nodes: [n1, n2, n3, n4, n5]
- nodes: [dr1, dr2]
if it cant assemble back from the first pool, it could try the next
pool, which could be stood up only during disaster recovery.
</issue>
<issue>
a tool to read from one cluster of keepers to hydrate a different
cluster of keepers.
</issue>
<issue>
since OPA knows REST, can we expose a policy evaluation endpoint to
help OPA augment/extend SPIKE policy decisions?
</issue>
<issue>
maybe create an interface for kv, so we can have thread-safe variants too.
</issue>
<issue>
maybe create a password manager tool as an example use case
</issue>
<issue>
A `stats` endpoint to show the overall
system utilization
(how many secrets; how much memory, etc)
</issue>
<issue>
maybe inspire admin UI from keybase
https://keybase.io/v0lk4n/devices
for that, we need an admin ui first :)
for that we need keycloak to experiment with first.
</issue>
<issue>
the current docs are good and all but they are not good for seo; we might
want to convert to something like zola later down the line
</issue>
<issues>
wrt ADR-0014:
Maybe we should use something S3-compatible as primary storage
instead of sqlite.
But that can wait until we implement other features.
Besides, Postgres support will be something that some of the stakeholders
want to see too.
</issues>
<issue>
SPIKE Dev Mode:
* Single binary
* `keeper` functionality runs in memory
* `nexus` uses an in-memory store, and its functionality is in the single
binary too.
* only networking is between the binary and SPIRE Agent.
* For development only.
The design should be maintainable with code reuse and should not turn into
maintaining two separate projects.
</issue>
<issue>
rate limiting to api endpoints.
</issue>
<issue>
* super admin can create regular admins and other super admins.
* super admin can assign backup admins.
(see drafts.txt for more details)
</issue>
<issue>
Each keeper is backed by a TPM.
</issue>
<issue>
Do some static analysis.
</issue>
<to-plan>
<issue>
S3 (or compatible) backing store
</issue>
<issue>
File-based backing store
</issue>
<issue>
In memory backing store
</issue>
<issue>
Kubernetes Deployment
</issue>
</to-plan>
<issue>
Initial super admin can create other admins.
So that, if an admin leaves, the super admin can delete them.
or if the password of an admin is compromised, the super admin can
reset it.
</issue>
<issue>
- Security Measures (SPIKE Nexus)
- Encrypting the root key with admin password is good
Consider adding salt to the password encryption
- Maybe add a key rotation mechanism for the future
</issue>
<issue>
- Error Handling
- Good use of exponential retries
- Consider adding specific error types/codes for different failure
scenarios
- Might want to add cleanup steps for partial initialization failures
</issue>
<issue>
Ability to stream logs and audit trails outside of std out.
</issue>
<issue>
Audit logs should write to a separate location.
</issue>
<issue>
Create a dedicated OIDC resource server (that acts like Pilot but exposes
a
restful API for things like CI/CD integration.
</issue>
<issue>
HSM integration (i.e. root key is managed/provided by an HSM, and the key
ever leaves the trust boundary of the HSM.
</issue>
<issue>
Ability to rotate the root key (automatic via Nexus).
</issue>
<issue>
Ability to rotate the admin token (manual).
</issue>
<issue>
Admin tokens can expire.
</issue>
<issue>
Encourage to create users instead of relying on the system user.
</issue>
</future>
</stuff>