-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run test by simulating #41
Conversation
Hi Guangda! Thanks for the PR! This is very cool. I'm absolutely willing to merge something like this in the future. So, a while back I did a bit of my own experimenting with deterministic replay. I tried cloning all events before delivery and adding them to a per-node log. I found there were two basic problems. First, it slowed down execution and required lots of memory. Second, the logs were so long that none of the machinery around search tests was all that useful. That approach might be worth reconsidering one day, but I like this because it's much lighter weight. I'll leave some comments in the code and mark this as a draft for now. When you're ready to try to merge it, mark it as ready for for review. |
public void send(MessageEnvelope messageEnvelope) { | ||
var event = new VirtualTimeEvent(); | ||
event.virtualTimeNanos = (long) (virtualTimeNanos + | ||
(38 * 1000 + 20 * 1000 * rand.nextGaussian())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These constants should be declared at the top of the file for easier tuning and readability.
var event = new VirtualTimeEvent(); | ||
event.virtualTimeNanos = (long) (virtualTimeNanos + | ||
(double) timerEnvelope.timerLengthMillis() * 1000 * 1000 + | ||
(20 * 1000 * rand.nextGaussian())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work as-is. The framework guarantees that timers with the same durations are delivered in the order they're set. (https://github.com/emichael/dslabs/tree/master/handout-files#search-tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. The consideration here is to make sure that if multiple nodes set timers with same length at the same time (for example when multiple clients send their first requests at virtual time 0), those timers should be delivered in any possible order. There should be per-node bookkeeping on timers to ensure the constraint you mentioned.
singleThreaded = | ||
Boolean.parseBoolean(lookupWithDefault("singleThreaded", "false")), | ||
|
||
startVisualization = | ||
Boolean.parseBoolean(lookupWithDefault("startViz", "false")); | ||
|
||
@Getter private static final long seed = | ||
Long.parseLong(lookupWithDefault("seed", "2568")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any significance to 2568? :-)
} | ||
|
||
MessageEnvelope pollMessage(Address address) { | ||
synchronized (this) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is equivalent to just putting synchronized
on the method.
* A deterministic simulated network. | ||
*/ | ||
// the inheritance is for easily reusing `RunState`. consider extract common | ||
// behavior of `Network` and `SimulatedNetwork` into interface(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, absolutely there should be an interface. It doesn't look like you're reusing any of the Network.java
logic.
Having read through the code, I think this could probably be cleaner if we refactored I'm very interested to learn if this helps students. I don't have a good intuition on whether or not it will. They certainly want to know if their 1 out of 100 failure will mean that they lose points. (I would argue that the whole point is to design systems which always work.) It's ultimately up to each instructor, but I wouldn't give students the seeds they will be tested against ahead of time. But then in terms of uncertainty around grading outcomes, without the test seeds they're back in the same situation with simulated time they were already in. Certainly this will help with reproducibility, but I'm also a little concerned that a simulated network might not capture the kinds of failures that happen in multi-threaded runs. One of the things I'm concerned about is that single-threaded runs will be slower and might not have nodes make enough progress to get into bad states. I'm not sure how to solve the external thread problem. One thing we could do is simply disable virtual time for those tests. That's not satisfactory. The "right" solution is to model external actions (e.g., the thread in the lab 2 repeated crashes test) as events which get put onto the same priority queue. This would require major refactoring of the framework and many/most tests. |
Thanks for the detailed reviewing and comments! I am glad to have such valuable feedback for this quick and dirty hack ^^ For the intuition of helping students let me explain the rationale a little bit more. I totally agree that the lab assignments should require students to implement systems that always work, and that is also the only comment we ever made to related questions. It just deep in the heart we know that it is extremely hard to verify whether a solution will always work against certain run tests. For myself I have experienced a hard time when working on constant repartition run test in Paxos lab. (1) I wrote a script to repeatedly run the test overnight to see whether it would fail again, although I was not sure what to do in the next morning if some runs actually failed, and (2) it did not fail in the repeating, but it failed again (by chance) when later I tried to run all tests in lab 3 in a row (this phenomenon is also confirmed by students and we still don't know the reason). It was such a nightmare, and it is always the first thing I remind of about the labs for all the time. If we have deterministic simulated run tests, then for (1) in the next morning I can check the seed that cause the failure and reliably reproduce the failure with it, and (2) will never happen because of deterministic execution. And yes, we should not disclosure the grading seeds and students still cannot verify whether a solution will always work according to the most strict definition of "always work" (which I think is impossible to achieve), but instead they can know for sure their solutions work for certain seeds, and they can make their own decision of how many seeds to test before submitting, i.e., define their own "always work". In my opinion this means the black box concept "always work" turns into gray box a little bit. And also, even simulation cannot eliminate uncertainty completely, it can still improve the workflow. For example, occasionally there are students appealing points because they cannot reproduce failure on their own machines. For this situation the best we can do is basically let student show running on their machines lively for several times and give scores to them (and even this approach becomes less feasible because of pandemic), but we never know whether the solution also fails on student machines, just the chance is smaller. With failed seeds this can be completely resolved. In conclusion, as you mentioned run test can always be treated as a special kind of random DFS search test with single probe and super long trace. The labs benefit a lot from controllable search tests, and it will sure benefit from more controllable run tests as well. Some other points:
|
The deterministic simulated mode for run test is finally done and ready for use. (Will call it "replayable mode" for short.) Since it has been some time since this pull request (PR) was opened, I would like to briefly summarize my proposal and share any new thoughts I have had on the matter. Motivation. First of all, this is a PR focused on debuggability, not to improve how/what this framework to test solutions. This PR is intended to help students who are embarking on the journey of spending a whole week on repeat partition test, and hopefully it can make their life easier primarily to address the following two problems:
It also has some additional features, such as being using for grading (after regression, discussed below), but most importantly it is a helper tool for students to debug their solutions less painfully especially for heavyweight run tests. And I believe this tool will be helpful. The following table compares three run modes:
Example. (Test summary omitted.)
Design. The goal is to maintain the implementation as modular, clearly separated, and fully compatible with tests. I apologize for not making use of subtyping (as I dislike it). Instead, I place all the implementation in a Previously I talked about the problem with This solves the cases where interactive threads block on I also made some misc changes, such as fixing randomness in various places. I made some comments for detailed design choices. Model latency and processing time. Previously only message deliver latency and timer length are considered in virtual time. However, if we don't account for the time nodes spend on processing messages and timers, simulated result will diverge from real run on an inefficient implementation. On the other hand, processing time is not deterministic, so we cannot simply add it to virtual time. Current time model is defined in
Debugging experience. During development, I completed the solutions for lab 3 and 4 using the replayable mode. I complete the whole debugging on run test in replayable mode with default seed, and run tests in real mode never failed after it passes in replayable mode. The logging trace appears reasonable, and it took less time to capture. Regression test for grading. I plan to add a regression script that will run a batch of solutions in both the replayable and real modes side by side. For each solution it will run real mode 3 times and the replayable mode with 10 different seeds. A test is considered failed in real/simulated mode if at least one run fails. The script will look for any case that fail in real mode but pass in simulated mode. If regression passes for a sufficient number of solutions from past semesters, it may be suitable for use in grading. Alternatively, we could also grade in real mode, and if student requests clarification, we can search for a seed that also fails the tests and provide it to the student. Other features.
Conclusion. I hope this work is useful. I still need to write a regression script and handout-side instructions, but the implementation itself is stable and ready to be code reviewed. I plan to maintain the replayable mode in downstream and introduce it to students in January, and I will be glad to see it get merged, get more tested, and help more students. Alternatively, I propose merging |
Seems like |
Updated time model benchmark
|
This is an early-staged, ongoing effort to support deterministic simulated execution run tests. I would like to hear about comments before moving on, e.g., whether this feature is expected and will be merged. Planned tasks and progress of work are listed at the end.
Motivation. In the teaching experience of CS5223 of NUS, one critical issue we faced is about the poor reproducibility of complicated run tests. In specific:
Introducing deterministic simulation may help a lot. With the same random seed, a solution will always reproduce the same result (as long as the solution is deterministic), independent to machine performance and whether logging is enabled. For grading, we can simply select several random seeds for each submission, and for each failed solution inform student the failing seed(s), so they can confirm the failure reliably.
How this is different from single-threaded execution? Current simulation implementation uses single thread, shares a lot common with single-threaded mode. Actually, simulation is currently implemented as an "enhanced" single-threaded mode. The key differences include:
The interface. Simulation can be enabled by specifying
--simulated
flag torun-tests.py
script. All run tests will be executed in simulation mode, with addition loggingIf run multiple times, the virtual timestamps of all messages and timers keep the same. However, if the seed is customized with another command line argument
--seed <n>
, the timestamps will change, and the concurrent messages may receive in different order.Internally,
dslabs.framework.testting.runner.SimulatedNetwork
is added as a drop-in replacement of the plainNetwork
. Minimal adaption is applied toRunState
, and there is no change in labs' test cases.Semantic and compatibility. In simulation, message deliver time is normal distributed with mean 38us and standard deviation 20us. The mean time is based on benchmark of lab 1, part 3, long-running workload. On my machine both simulation and real execution finish ~40K requests per client with this mean time. The standard deviation is tentatively selected and may be refined with further evaluation. For timers, a 20us standard deviation is also added to model a small time skew and prevent potential trouble if multiple events happen at the exact same virtual time.
In test cases
RunState.run
is called when we want to block until ending condition satisfied, andRunState.start
is called when we want to start running in background, while concurrently operate in current thread, e.g., send command and check. The latter case makes the run test not "isolated" to the open world, and we can never know whether the earliest event we gathered from nodes will become the next event, or current thread will create some even earlier events concurrently to the simulation process.To solve this, the simulation takes the following approach
RunState.run
is called, no external event will be created, and simulation will always advance virtual time to the next earliest eventRunState.start
is called, simulation will never advance virtual time faster than real time, so consider pseudo code like thisThe simulation will take 500ms wall clock time in normal case. And client requests are sent per 100ms in both wall clock time and virtual time. On a slow machine, simulation in background thread may not finish processing events of each 100ms virtual time interval in 100ms wall clock time, but client requests will still be submitted per 100ms wall clock time by current thread, results in virtual sent time earlier than expected and break the deterministic guarantee.
I hope in the future a thorough refactor on test cases could be applied so that the simulation could be fully decoupled with wall clock time, to address above issue and make every test able to automatically advance virtual time. Except this, simulation should be fully compatible with current system.
Evaluation and plans. Currently the implementation is tested against my solution of lab 1 and lab 3. Later it will be applied to students' solutions from last semester to check if there's any regression. I think it will be ready to be merged when all the following are done:
Random
instance from multiple threadsSome potential long-term tasks: