forked from ioi/isolate
-
Notifications
You must be signed in to change notification settings - Fork 0
/
isolate.1.txt
452 lines (362 loc) · 18.9 KB
/
isolate.1.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
ISOLATE(1)
==========
NAME
----
isolate - Isolate a process using Linux Containers
SYNOPSIS
--------
*isolate* 'options' *--init*
*isolate* 'options' *--run* +--+ 'program' 'arguments'
*isolate* 'options' *--cleanup*
DESCRIPTION
-----------
Run 'program' within a sandbox, so that it cannot communicate with the
outside world and its resource consumption is limited. This can be used
for example in a programming contest to run untrusted programs submitted
by contestants in a controlled environment.
The sandbox is used in the following way:
* Run *isolate --init*, which initializes the sandbox, creates its working directory and
prints its name to the standard output. If the sandbox already existed, it
is reset.
* Populate the directory with the executable file of the program and its
input files.
* Call *isolate --run* to run the program. A single line describing the
status of the program is written to the standard error stream.
* Fetch the output of the program from the directory.
* Run *isolate --cleanup* to remove temporary files. Does nothing if the sandbox
was already cleaned up.
Please note that by default, the program is not allowed to start multiple
processes of threads. If you need that, turn on the control group mode
(see below).
BASIC OPTIONS
-------------
*-b, --box-id=*'id'::
When you run multiple sandboxes in parallel, you have to assign unique
IDs to them by this option. See the discussion on UIDs in the INSTALLATION
section. The ID defaults to 0.
*-M, --meta=*'file'::
Output meta-data on the execution of the program to a given file.
See below for syntax of the meta-files.
*-i, --stdin=*'file'::
Redirect standard input from 'file'. The 'file' has to be accessible
inside the sandbox (which means that the sandboxed program can manipulate
it arbitrarily). If not specified, standard input is inherited from the
parent process.
*-o, --stdout=*'file'::
Redirect standard output to 'file'. The 'file' has to be accessible
inside the sandbox (which means that the sandboxed program can manipulate
it arbitrarily). If not specified, standard output is inherited from the
parent process and the sandbox manager does not write anything to it.
*-r, --stderr=*'file'::
Redirect standard error output to 'file'. The 'file' has to be accessible
inside the sandbox (which means that the sandboxed program can manipulate
it arbitrarily). If not specified, standard error output is inherited from the
parent process. See also *--stderr-to-stdout*.
*--stderr-to-stdout*::
Redirect standard error output to standard output. This is performed after
the standard output is redirected by *--stdout*. Mutually exclusive with *--stderr*.
*-c, --chdir=*'dir'::
Change directory to 'dir' before executing the program. This path must be
relative to the root of the sandbox.
*-v, --verbose*::
Tell the sandbox manager to be verbose and report on what is going on.
Using *-v* multiple times produces even more jabber.
*-s, --silent*::
Tell the sandbox manager to keep silence. No status messages are printed
to stderr except for fatal errors of the sandbox itself. The combination of
*--verbose* and *--silent* has an undefined effect.
*--wait*::
Multiple instances of Isolate cannot manage the same sandbox simultaneously.
If you attempt to do that, the new instance refuses to run. With this option,
the new instance waits for the other instance to finish.
LIMITS
------
The following options can limit system resources consumed by the program.
*-m, --mem=*'size'::
Limit address space of the program to 'size' kilobytes. If more processes
are allowed, this applies to each of them separately. If this limit is reached,
further memory allocations fail (e.g., malloc returns NULL).
*-t, --time=*'time'::
Limit run time of the program to 'time' seconds. Fractional numbers are allowed.
Time in which the OS assigns the processor to other tasks is not counted.
If this limit is exceeded, the program is killed (after *--extra-time*, if set).
*-w, --wall-time=*'time'::
Limit wall-clock time to 'time' seconds. Fractional values are allowed.
This clock measures the time from the start of the program to its exit,
so it does not stop when the program has lost the CPU or when it is waiting
for an external event. We recommend to use *--time* as the main limit,
but set *--wall-time* to a much higher value as a precaution against
sleeping programs.
If this limit is exceeded, the program is killed.
*-x, --extra-time=*'time'::
When the *--time* limit is exceeded, do not kill the program immediately,
but wait until *--extra-time* seconds elapse since the start of the program.
This allows to report the real execution time, even if it exceeds the limit
slightly. Fractional numbers are allowed.
*-k, --stack=*'size'::
Limit process stack to 'size' kilobytes. By default, the whole address
space is available for the stack, but it is subject to the *--mem* limit.
If this limit is exceeded, the program receives the SIGSEGV signal.
*-n, --open-files=*'max'::
Limit number of open files to 'max'. The default value is 64. Setting this
option to 0 will result in unlimited open files.
If this limit is reached, system calls creating file descriptors fail
with error EMFILE.
*-f, --fsize=*'size'::
Limit size of each file created (or modified) by the program to 'size' kilobytes.
In most cases, it is better to restrict overall disk usage by a disk quota
(see below). This option can help in cases when quotas are not enabled
on the underlying filesystem.
If this limit is reached, system calls expanding files fail with error
EFBIG and the program receives the SIGXFSZ signal.
*-q, --quota=*'blocks'*,*'inodes'::
Set disk quota to a given number of blocks and inodes. This requires the
filesystem to be mounted with support for quotas. Unlike other options,
this one must be given to *isolate --init*. Please note that this
currently works only on the ext family of filesystems (other filesystems
use other interfaces for setting quotas).
If the quota is reached, system calls expanding files fail with error EDQUOT.
*--core=*'size'::
Limit size of core files created when a process crashes to 'size' kilobytes.
Defaults to zero, meaning that no core files are produced inside the sandbox.
*-p, --processes*[*=*'max']::
Permit the program to create up to 'max' processes and/or threads. Please
keep in mind that time and memory limit do not work with multiple processes
unless you enable the control group mode. If 'max' is not given, an arbitrary
number of processes can be run. By default, only one process is permitted.
If this limit is exceeded, system calls creating processes fail with error
EAGAIN.
ENVIRONMENT RULES
-----------------
UNIX processes normally inherit all environment variables from their parent. The
sandbox however passes only those variables which are explicitly requested by
environment rules:
*-E, --env=*'var'::
Inherit the variable 'var' from the parent.
*-E, --env=*'var'*=*'value'::
Set the variable 'var' to 'value'. When the 'value' is empty, the
variable is removed from the environment.
*-e, --full-env*::
Inherit all variables from the parent.
The rules are applied in the order in which they were given, except for
*--full-env*, which is applied first.
The list of rules is automatically initialized with *-ELIBC_FATAL_STDERR_=1*.
DIRECTORY RULES
---------------
The sandboxed process gets its own filesystem namespace, which contains only subtrees
requested by directory rules:
*-d, --dir=*'in'*=*'out'[*:*'options']::
Bind the directory 'out' as seen by the caller to the path 'in' inside the sandbox.
If there already was a directory rule for 'in', it is replaced.
*-d, --dir=*'dir'[*:*'options']::
Bind the directory +/+'dir' to 'dir' inside the sandbox.
If there already was a directory rule for 'in', it is replaced.
*-d, --dir=*'in'*=*::
Remove a directory rule for the path 'in' inside the sandbox.
By default, all directories are bound read-only and restricted (no devices,
no setuid binaries). This behavior can be modified using the 'options':
*rw*::
Allow read-write access.
*dev*::
Allow access to character and block devices.
*noexec*::
Disallow execution of binaries.
*maybe*::
Silently ignore the rule if the directory to be bound does not exist.
*fs*::
Instead of binding a directory, mount a device-less filesystem called 'in'.
For example, this can be 'proc' or 'sysfs'.
*tmp*::
Bind a freshly created temporary directory writeable for the sandbox user.
Accepts no 'out', implies *rw*.
*norec*::
Do not bind recursively. Without this option, mount points in the outside
directory tree are automatically propagated to the sandbox.
Unless *--no-default-dirs* is specified, the default set of directory rules binds +/bin+,
+/dev+ (with devices allowed), +/lib+, +/lib64+ (if it exists), and +/usr+. It also binds
the working directory to +/box+ (read-write), mounts the proc filesystem at +/proc+, and
creates a temporary directory +/tmp+.
*-D, --no-default-dirs*::
Do not bind the default set of directories. Care has to be taken to specify
the correct set of rules (using *--dir*) for the executed program to run
correctly. In particular, +/box+ has to be bound.
The rules are executed in the order in which they are given. Default rules come before
all user rules. When a rule is replaced, it retains the original position
in the order. This matters when one rule's 'in' is a sub-directory of another
rule's 'in'. For example if you first bind to 'a' and then to 'a/b', it will work as
expected, but a sub-directory 'b' must have existed in the directory bound to 'a' (isolate
never creates subdirectories in bound directories for security reasons). If the
order is 'a/b' before 'a', then the directory bound to 'a/b' becomes invisible
by the later binding on 'a'.
CONTROL GROUPS
--------------
Isolate can make use of system control groups provided by the kernel
to constrain programs consisting of multiple processes. Please note
that this feature needs special system setup described in the INSTALLATION
section.
*--cg*::
Enable use of control groups. This should be specified with *--init*,
*--run* and *--cleanup*.
*--cg-mem=*'size'::
Limit total memory usage by the whole control group to 'size' kilobytes.
This should be specified with *--run*.
Effect of reaching this limit depends on circumstances.
If it happens during memory allocation, the allocation can fail or memory
can be over-committed by the kernel.
If it happens when handling a page fault, the whole process is killed
by the OOM killer with the SIGSEGV signal.
*--print-cg-root*::
Print the root of the control group hierarchy in */sys/* and exit.
This is used by the *isolate-check-environment* script.
SPECIAL OPTIONS
---------------
The following options can be useful in special cases.
*--share-net*::
By default, isolate creates a new network namespace for its child process.
This namespace contains no network devices except for a per-namespace loopback.
This prevents the program from communicating with the outside world. If you want
to permit communication, you can use this switch to keep the child process
in parent's network namespace.
*--inherit-fds*::
By default, isolate closes all file descriptors passed from its parent
except for descriptors 0, 1, and 2.
This prevents unintentional descriptor leaks. In some cases, passing extra
descriptors to the sandbox can be desirable, so you can use this switch
to make them survive.
*--tty-hack*::
Try to handle interactive programs communicating over a tty.
The sandboxed program will run in a separate process group, which will temporarily
become the foreground process group of the terminal. When the program exits, the
process group will be switched back to the caller. Please note that the program
can do many nasty things including (but not limited to) changing terminal settings,
changing the line discipline, and stuffing characters to the terminal's input queue
using the TIOCSTI ioctl. Use with extreme caution.
*--special-files*::
By default, Isolate removes all special files (other than regular files
and directories) created inside the sandbox. If you need them, this option disables
that behavior, but you need to carefully check what you open.
*--as-uid=*'uid', *--as-gid=*'gid'::
Act on behalf of the specified user and group (only if Isolate was invoked by root).
This is used in scenarios where a root-controlled process manages creation of sandboxes
for regular users, usually in conjunction with the *restricted_init* option in
the configuration file.
META-FILES
----------
The meta-file contains miscellaneous meta-information on execution of the
program within the sandbox. It is a textual file consisting of lines
of format 'key'*:*'value'. The following keys are defined:
*cg-mem*::
When control groups are enabled, this is the total memory use
by the whole control group (in kilobytes). If you use *isolate --run*
multiple times in the same sandbox, the control group retains cached
data from the previous runs, which also contributes to *cg-mem*.
*cg-oom-killed*::
Present when the program was killed by the out-of-memory killer
(e.g., because it has exceeded the memory limit of its control group).
This is reported only on Linux 4.13 and later.
*csw-forced*::
Number of context switches forced by the kernel.
*csw-voluntary*::
Number of context switches caused by the process giving up the CPU
voluntarily.
*exitcode*::
The program has exited normally with this exit code.
*exitsig*::
The program has exited after receiving this fatal signal.
*killed*::
Present when the program was terminated by the sandbox
(e.g., because it has exceeded the time limit).
*max-rss*::
Maximum resident set size of the process (in kilobytes).
*message*::
Status message, not intended for machine processing.
E.g., "Time limit exceeded."
*status*::
Two-letter status code:
* *RE* -- run-time error, i.e., exited with a non-zero exit code
* *SG* -- program died on a signal
* *TO* -- timed out
* *XX* -- internal error of the sandbox
*time*::
Run time of the program in fractional seconds.
*time-wall*::
Wall clock time of the program in fractional seconds.
Please note that not all keys have to be present.
For example, no *status* nor *message* is reported upon normal termination.
RETURN VALUE
------------
When the program inside the sandbox finishes correctly, the sandbox returns 0.
If it finishes incorrectly, it returns 1.
All other return codes signal an internal error.
INSTALLATION
------------
Isolate depends on several advanced features of the Linux kernel, like different
kinds of namespaces and control groups. These features are available in kernels
of most Linux distributions now, but if you are building your own kernel, you
have to be careful.
Isolate is designed to run setuid to root. The sub-process inside the sandbox
then switches to a non-privileged user ID (different for each *--box-id*).
The range of UIDs available and several filesystem paths are set in a configuration
file, by default located in /usr/local/etc/isolate.
For control group mode:
- Linux supports two incompatible implementations of control groups: cgroup v1 and v2.
This version of Isolate requires v2, which is the default on recent systems.
- If you are using systemd, you need to start the `isolate.service` (see service files
in the `systemd` directory in Isolate's source tree). It establishes
`isolate.scope` whose cgroup subtree is delegated to Isolate by systemd.
The service runs a simple daemon `isolate-cg-keeper` to keep the scope alive.
- If you are not using systemd, make sure that Isolate's configuration file
refers to the correct location where you have the cgroup filesystem mounted.
Also make sure that whatever service manager you are using, it does not
interfere with Isolate's use of control groups.
- Running Isolate in containers is not recommended, since container managers
usually do not delegate control groups properly. Besides, you do not want
to share the machine with other workloads, which would influence measurement
of execution time. If you still want to use containers, you are on your own
and you probably have to make them privileged.
- Reporting memory usage requires Linux kernel 5.19 or newer.
- Since memory limits do not affect swapped-out data, we recommend turning off
swap completely.
Isolate expects that the root directory "/" is a mount point. When running
isolate inside a chroot, this may not be the case, and isolate may fail with
"Cannot privatize mounts". A workaround for this is to convert the root
directory of the chroot into a mount point using a bind mount, prior to
entering the chroot and running isolate. For example:
mount --bind /path/to/chroot /path/to/chroot
It is recommended to have +sys.fs.protected_hardlinks+ sysctl set to 1
(which is probably default on modern Linux systems). Otherwise, the user running
the sandbox could trick isolate to changing the owner of unrelated files.
If you have systemd-coredump installed, please keep in mind that it records core
files even for processes inside the sandbox. As it configures the kernel to deliver
core dumps using a pipe, it is not affected by the *--core* limit.
REPRODUCIBILITY
---------------
The reproducibility of results can be improved by tuning some kernel
parameters, listed below. Some of these parameters can be checked using the
program isolate-check-environment.
* Disable address space randomization: +sysctl kernel.randomize_va_space=0+.
Address space randomization can affect timing, memory usage, and program
behavior. This setting can be made persistent through /etc/sysctl.d/.
* Disable dynamic CPU frequency scaling. This is done by setting the cpufreq
scaling governor in /sys/device/system/cpu/cpufreq/*/scaling_governor to +performance+.
(On Intel CPUs, frequency scaling can be controlled by the `intel_pstate` driver,
but it still provides its own +performance+ controller to the cpufreq subsystem.)
* Consider disabling frequency boosting on CPUs that might support it (this
includes most i3/i5/i7 Intel CPUs and the AMD Zen architecture). This is done
either by writing 1 to /sys/devices/system/cpu/intel_pstate/no_turbo (on Intel CPUs)
or by writing 0 to /sys/devices/system/cpu/cpufreq/boost (other machines).
* Run evaluations on a single CPU (core). The Linux scheduler has a tendency to randomly
migrate tasks between CPUs, incurring cache migration costs. You can use isolate's
configuration file to pin the process to a specified CPU.
* Disable automatic kernel support for transparent huge pages. Both /sys/kernel/mm/transparent_hugepage/enabled
and /sys/kernel/mm/transparent_hugepage/defrag should be set to "madvise" or "never", and
/sys/kernel/mm/transparent_hugepage/khugepaged/defrag to 0.
* Disable swapping. If you really need swap space and you are using cgroups,
make sure that you have the memsw controller enabled, so that swap space is
properly accounted for.
See further suggestions in the https://ioi.github.io/checklist/[IOI Technical Checklist].
LICENSE
-------
Isolate was written by Martin Mares and Bernard Blackham.
It can be distributed and used under the terms of the GNU
General Public License version 2 or any later version.