forked from radical-cybertools/radical.pilot
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO
246 lines (210 loc) · 10.2 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
+ fix invokation frequency of idle callbacks
+ clean pilot cancelation
+ client agent startup
+ exec worker/watcher concurrency
+ lrms/scheduler dependency
+ profile names
+ log file name uniquenenss
+ state history (update ordering)
+ multiple updater instances will compete for state updates and push them
out of order, thus screwing with the CU states which end up in the DB.\
+ limit to one update instance
+ pilot updates via updater (different pubsub channel)
+ make sure finalize is called on terminate (via atexit?)
+ make sure all profile entries exist, merge from module profiling branch
+ merge from devel
+ merge from osgng
+ forward tunnel endpoint from bootstrap_1 to agent
+ make sure opaque_slot is a dict
+ make sure that the lrms_config_hook is called only once
(this is implicit now, as we only have one scheduler instance, and thus
only one (set of) LRMS instances)
- At this point it might be simpler to have one LRMS instance which handles
MPI and non-MPI? Not sure about the diversity of combinations which need
supporting...
= component failures need to propagate to the agent to trigger shutdown
= torus scheduler is broken in the split arch
- advance() implicitly publishes unit state, but the publisher needs to be
declared explicitly.
+ rename 'worker' to 'sub-agent' to avoid confusion?
- executor should write CU script in one go, not incrementally
- component: collapse declare_input and declare_worker
+ cu['_id'] -> cu['uid'] etc
+ state pubsub now has [cmd, ttype, thing] tuples fed from advance.
+ self.advance(cu, FAILED, ...) will not find a suitable output channel.
Handle gracefully!
+ Always push final things fully to DB!
- we mostly don't need create() methods -- in many cases we can just use
constructors.
+ default output staging component: __init__ calls Component __init__, but
should call base __init__, which then calls Component __init__
Same holds for other component implementations.
- component's base.create calls should pick up all component implementations
automatically, instead of having a static type map
+ Bridges should be created by session, both on client and on agent side.
+ on the component, add 'declare_thread' for actions which are needed more
frequent than idle callbacks, and also need some guarantees on the
frequency. Prominent example: execution watcher threads. Handle the
'work()' loop as just one of those threads (the default one).
+ register_idle_callback -> register_timed_callback
+ component watcher needs to clean out component threads even for failed
components
+ shutdown command propagation is broken
+ unit cancelation is best done on component level
+ subscribe to control to get cancel requests
+ for each cancel request
+ add to self._cancel_requested
+ whenver we see an advance or work request:
+ if uid in self._cancel_requested: advance(CANCEL) + drop
- we can still additionally add cancel actions during staging and
execution -- but the component base should provide support for that.
- use heartbeat mechanism also for bridges, agents, sub_agents
+ define _cfg['heart'] and use that for heartbeat src instead of owner,
limiting traffic and noise
- component.stop() should *first* unregister component_checker_cb
(and probably all other callbacks)
+ in fact, we should be very careful to unregister callbacks in reverse
order of registration.
- pilot default error cb calls sys.exit -> no shutdown because exit does not
exit when called in a thread
+ update the backfilling scheduler
- state notifications should be obtained via a tailed cursor
- update worker could check if a unit has multiple state updates pending. If
so, only send the last one. The client side cann still play callbacks for
all intermediate states. That implies delayed filling of the bulk.
+ update worker should not push state_hist on state updates, only on full
udpdates.
- pilot failing during wait_units results in timeout.
+ the update worker should have a list of keys which it needs to push to
MongoDB, and it should ignore other dict entries. Alternatively, components
should use underscore-prefixed keys for things which don't need to be stored
in MongoDB.
+ separate single-process components from multi-processed, in terms of
initialization
+ make sure we don't carry locks across fork calls. If that turns out to be
difficult (hi logging), we have to execve after fork and start afresh.
+ idle_cb -> timed_cb (we make no promise on idleness, really)
- kill pilots on pmgr.launching.finalize_child?
+ saga_pid in pmgr.launcher control_cb
+ heartbeat interval assertion
- dying agent bridge causes agent shutdown to hang
- lrms -> rm?
+ dying pilot watcher cb remains undetected
- cancellation leads to double wait, which is ok, but also reports twice,
which is not
- pilots canceled but state is not picked up during shutdown, should still
call callbacks.
- umgr wait and pmgr wait report different things, differently
(inclusion of previously final things, final state marker)
- staging directives should be expanded in umgr_input_staging -- at that point
we have enough information to make *all* src and tgt names into full URLs
(including fully quualified paths), and can then rely on those being URLs in
all places.
+ ctrl
+ things -> watchable
- advance -> remove push/pull args, and make those separate calls?
Probably not, but the advance call is a tad too complex and powerful...
+ comment in cb in sch base
+ remove release_slot in rr
+ add_pilot -> add_pilots
+ start/stop in mgrs...
+ open saga ticket: condor staging target should point to the job cwd, not
to intermediate storage
- units can end up in 'CANCELED + agent_pending', so agent pulls unusable
units. Final states should *always* set 'control' to 'umgr' or 'pmgr',
respectively
+ worker/agent.py -> agent/agent_n.py
+ agent_n controller has uid of pilot.000.ctrl -- should be agent_n.ctrl
+ sub-agents: catch stdout/err for popen procs
- agent_n does not need a child process really.
+ agent state ACTIVE is recorded twice
+ agent state DONE is recorded twice on timeout
+ we should never have a wildcard `except`, as that will also catch inetrrupts
and system exits, which we though expect to fall through to the component
layer for lifetime management.
- rpu.Component (via ru.Process) allows
self.start(spawn=False)
cases like this should not need a *process* class in the first place. It
would be cleaner to separate out the initialization/finalization, callback
and communication layer from Process
- separation of controller.[uid,heart,owner] is unclean
- the Controller class does not really do much apart from making sure the
sub-components and bridges are started according to a config. For that
little work it is rather complex.
- document minimal setuptools/pip version
fails with
pip 1.3 from
setuptools-0.6c11-py2.7
Expected Termination Behavior:
==============================
application calls session close
-------------------------------
- session.close() can be called at any point in time
- in main thread:
- close all umgrs
- cancel any non-final units and wait for them
- advance them to CANCELED
- issue 'cancel' commands (no reschedule)
- we want notifications for the unit cancel
? do we rely on UpdateWorker to register/store CANCELED advance?
- stop all components (delegated to session controller)
- stop all bridges (delegated to session controller)
- terminate all threads
- close all pmgrs
- cancel any non-final pilots and wait for them
- we want notifications for the pilot cancel
- stop all components (delegated to session controller)
- stop all bridges (delegated to session controller)
- terminate all threads
- terminate all threads
- close session controller
- terminate all threads
- stop all bridges
- stop all components
- in other threads
- thread should exit
- missing thread is detected by thread watcher
-> watcher calling session.close + raise
- in other processes / components
- stop the component
- no message is sent, we expect the pmgr, umgr or session watcher to
detect the closed component (if needed), and to start teardown
-> watcher calling session.close + raise
user interrupts via Ctrl-C
--------------------------
- application must except, call session.close()
- no guarantees on clean termination are made: multithreaded Python can
swallow interrupts.
pilot fails, exit_on_error is set
---------------------------------
- pmgr will invoke callback which will exit the callback thread
- missing thread is detected by thread watcher
-> watcher calling session.close + raise
pilot times out
---------------
- the agent_0 starts an idle callback (agent_command_cb) which checks runtime
- if runtime limit is reached
- set self._final_cause to 'timeout'
- call self.stop()
- agent_0 is a worker w/o child. stop() triggers this sequence:
- call finalize_parent
- stop self._lrms
- update db for final state
- call session.close()
- call finalize_common
- terminate and join all subscribers
- close profiler
- sys.exit()
term iv
-------
- bridges and components are started and owned by session. This should not be
the case: they should be owned *and watched* by the starting component, and
the session should only be *informed* about any available bridges, so that
these can be use by other components in the same session. Similarly,
bridges only need starting if they don't yet exist in the session.
- can base bridges still be owned by the session?
- move pmgr/umgr start_components calls to the component base class, into
initialize_child / parent (depending on spawn).
- make sure to also call start_bridges, before start_components.
- keep component and bridge handles in the component and clean out after
fork