Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cluster leader election #6

Open
woodsaj opened this issue Apr 4, 2016 · 7 comments
Open

Add cluster leader election #6

woodsaj opened this issue Apr 4, 2016 · 7 comments

Comments

@woodsaj
Copy link
Contributor

woodsaj commented Apr 4, 2016

Issue by woodsaj
Friday Jun 19, 2015 at 20:22 GMT
Originally opened as raintank/grafana#228


There are a few features within the code base that should only be run from one node at a time.
This requires having the nodes co-ordinate this role amongst themselves.

Raft seems to be the new hotness when it comes to these things, so we should use that. Coreos' etcd package has an implementation of raft.
https://godoc.org/github.com/coreos/etcd/raft

@woodsaj
Copy link
Contributor Author

woodsaj commented Apr 4, 2016

Comment by Dieterbe
Tuesday Jun 23, 2015 at 00:06 GMT


  1. mind sharing a little bit what those features are?
  2. can we get away with transactions on the database?
  3. for alerting, i noticed you mentioned somewhere running only 1 job producer, but i thought we decided we actually wanted to run multiple alert job producers for HA, because if jobs get consistently routed (by key), the consumers will drop jobs they've already processed anyway. this is a fairly simplistic method of HA. if you're thinking of running only 1 producer, and it dies and restarts somewhere else then we also need to keep track of the last timestamp at which jobs were scheduled. in case it takes several seconds to restart a producer, the new producer should also process the missed ticks from the last few seconds. (i actually like this approach, it seems more efficient, but also requires more operations/automation, perhaps we should postpone this improvement until we're at a point where multiple producers bring too much overhead?)

@woodsaj
Copy link
Contributor Author

woodsaj commented Apr 4, 2016

Comment by woodsaj
Tuesday Jun 23, 2015 at 15:10 GMT


This is a long term goal to meet future scalability needs.

  1. alerting scheduler and also collector session management.

  2. yes, and that is likely what will be deployed first. but it does not scale. So long term we need a better solution.

  3. also true, but as with 2, does not scale. If we are running 10 instances of grafana, we don't want to have all 10 pushing the same messages into the queue.

@woodsaj
Copy link
Contributor Author

woodsaj commented Apr 4, 2016

Comment by Dieterbe
Tuesday Jun 23, 2015 at 19:59 GMT


This is a long term goal to meet future scalability needs.

I believe @nopzor1200 described the raft leader election as a high-prio item that was a must before we can launch.

I agree with your reasoning @woodsaj but we should make sure we're on the same page regarding urgency and timeline of this.
also, to make this viable I will need to make the alerting scheduler stateful (keeping track of last successfully processed timestamp, perhaps this could go into the raft log or in etcd, or in the database. will we have a HA transactional database?)

@woodsaj
Copy link
Contributor Author

woodsaj commented Apr 4, 2016

Comment by Dieterbe
Wednesday Jul 08, 2015 at 02:30 GMT


Just saw a docker talk at pre gophercon party about libkv which provides a nice abstraction for leader election (supports etcd, consul and zk)

@woodsaj
Copy link
Contributor Author

woodsaj commented Apr 4, 2016

Comment by nopzor1200
Saturday Jul 18, 2015 at 05:23 GMT


I originally misunderstood whether this was a high prio vs low prio item @woodsaj confirm it (raft or the like) is not something we need to worry about for now right?

@woodsaj
Copy link
Contributor Author

woodsaj commented Apr 4, 2016

Comment by woodsaj
Sunday Jul 19, 2015 at 13:24 GMT


This is low prio.

@woodsaj
Copy link
Contributor Author

woodsaj commented Apr 4, 2016

Comment by Dieterbe
Friday Jul 31, 2015 at 15:54 GMT


(interestingly, this ticket was in "to do" in codetree. when i moved it to backlog it removed the backlog milestone. i guess cause it doesn't use milestones for backlog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant