-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement multithreaded stochastic swap in rust #7658
Implement multithreaded stochastic swap in rust #7658
Commits on Feb 14, 2022
-
Implement multithreaded stochastic swap in rust
This commit is a rewrite of the core swap trials functionality in the StochasticSwap transpiler pass. Previously this core routine was written using Cython (see Qiskit#1789) which had great performance, but that implementation was single threaded. The core of the stochastic swap algorithm by it's nature is well suited to be executed in parallel, it attempts a number of random trials and then picks the best result from all the trials and uses that for that layer. These trials can easily be run in parallel as there is no data dependency between the trials (there are shared inputs but read-only). As the algorithm generally scales exponentially the speed up from running the trials in parallel can offset this and improve the scaling of the pass. Running the pass in parallel was previously tried in Qiskit#4781 using Python multiprocessing but the overhead of launching an additional process and serializing the input arrays for each trial was significantly larger than the speed gains. To run the algorithm efficiently in parallel multithreading is needed to leverage shared memory on shared inputs. This commit rewrites the cython routine using rust. This was done for two reasons. The first is that rust's safety guarantees make dealing with and writing parallel code much easier and safer. It's also multiplatform because the rust language supports native threading primatives in language. The second is while writing parallel cython code using open-mp there are limitations with it, mainly on windows. In practice it was also difficult to write and maintain parallel cython code as it has very strict requirements on python and c code interactions. It was much faster and easier to port it to rust and the performance for each iteration (outside of parallelism) is the same (in some cases marginally faster) in rust. The implementation here reuses the data structures that the previous cython implementation introduced (mainly flattening all the terra objects into 1d or 2d numpy arrays for efficient access from C). The speedups from this PR can be significant, calling transpile() on a 400 qubit (with a depth of 10) QV model circuit targetting a 409 heavy hex coupling map goes from ~200 seconds with the single threaded cython to ~60 seconds with this PR locally on a 32 core system, When transpiling a 1000 qubit (also with a depth of 10) QV model circuit targetting a 1081 qubit heavy hex coupling map goes from taking ~6500 seconds to ~720 seconds. The tradeoff with this PR is for local qiskit-terra development a rust compiler needs to be installed. This is made trivial using rustup (https://rustup.rs/), but it is an additional burden and one that we might not want to make. If so we can look at turning this PR into a separate repository/package that qiskit-terra can depend on. The tradeoff here is that we'll be adding friction to the api boundary between the pass and the core swap trials interface. But, it does ease the dependency on development for qiskit-terra.
Configuration menu - View commit details
-
Copy full SHA for 9d9c2dc - Browse repository at this point
Copy the full SHA 9d9c2dcView commit details
Commits on Feb 16, 2022
-
Sanitize packaging to support future modules
This commit fixes how we package the compiled rust module in qiskit-terra. As a single rust project only gives us a single compiled binary output we can't use the same scheme we did previously with cython with a separate dynamic lib file for each module. This shifts us to making the rust code build a `qiskit._accelerate` module and in that we have submodules for everything we need from compiled code. For this PR there is only one submodule, `stochastic_swap`, so for example the parallel swap_trials routine can be imported from `qiskit._accelerate.stochastic_swap.swap_trials`. In the future we can have additional submodules for other pieces of compiled code in qiskit. For example, the likely next candidate is the pauli expectation value cython module, which we'll likely port to rust and also make parallel (for sufficiently large number of qubits). In that case we'd add a new submodule for that functionality.
Configuration menu - View commit details
-
Copy full SHA for 7af1c43 - Browse repository at this point
Copy the full SHA 7af1c43View commit details -
Adjust random normal distribution to use correct mean
This commit corrects the use of the normal distribution to have the mean set to 1.0. Previously we were doing this out of band for each value by adding 1 to the random value which wasn't necessary because we could just generate it with a mean of 1.0.
Configuration menu - View commit details
-
Copy full SHA for 2f8de01 - Browse repository at this point
Copy the full SHA 2f8de01View commit details -
Remove unecessary extra scope from locked read
This commit removes an unecessary extra scope around the locked read for where we store the best solution. The scope was previously there to release the lock after we check if there is a solution or not. However this wasn't actually needed as we can just do the check inline and the lock will release after the condition block.
Configuration menu - View commit details
-
Copy full SHA for 6da1b7b - Browse repository at this point
Copy the full SHA 6da1b7bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 3cf6e04 - Browse repository at this point
Copy the full SHA 3cf6e04View commit details -
Fix indices typo in NLayout constructor
Co-authored-by: Jake Lishman <jake@binhbar.com>
Configuration menu - View commit details
-
Copy full SHA for f838ff6 - Browse repository at this point
Copy the full SHA f838ff6View commit details -
Remove explicit lifetime annotation from swap_trials
Previously the swap_trials() function had an explicit lifetime annotation `'p` which wasn't necessary because the compiler can determine this on it's own. Normally when dealing with numpy views and a Python object (i.e. a GIL handle) we need a lifetime annotation to tell the rust compiler the numpy view and the python gil handle will have the same lifetime. But since swap_trials doesn't take a gil handle and operates purely in rust we don't need this lifetime and the rust compiler can deal with the lifetime of the numpy views on their own.
Configuration menu - View commit details
-
Copy full SHA for 780f327 - Browse repository at this point
Copy the full SHA 780f327View commit details -
Configuration menu - View commit details
-
Copy full SHA for 35fba96 - Browse repository at this point
Copy the full SHA 35fba96View commit details -
Fix lint and add rust style and lint checks to CI
This commit fixes the python lint failures and also updates the ci configuration for the lint job to also run rust's style and lint enforcement.
Configuration menu - View commit details
-
Copy full SHA for d72d6c5 - Browse repository at this point
Copy the full SHA d72d6c5View commit details
Commits on Feb 17, 2022
-
Fix returned layout mapping from NLayout
This commit fixes the output list from the `layout_mapping()` method of `NLayout`. Previously, it incorrectly would return the wrong indices it should be a list of virtual -> physical to qubit pairs. This commit corrects this error Co-authored-by: georgios-ts <45130028+georgios-ts@users.noreply.github.com>
Configuration menu - View commit details
-
Copy full SHA for 0b6e4d7 - Browse repository at this point
Copy the full SHA 0b6e4d7View commit details -
Configuration menu - View commit details
-
Copy full SHA for a9a879d - Browse repository at this point
Copy the full SHA a9a879dView commit details -
Make swap_trials parallelization configurable
This commit makes the parallelization of the swap_trials() configurable. This is dones in two ways, first a new argument parallel_threshold is added which takes an optional int which is the number of qubits to switch between a parallel and serial version. The second is that it takes into account the the state of the QISKIT_IN_PARALLEL environment variable. This variable is set to TRUE by parallel_map() when we're running in a multiprocessing context. In those cases also running stochastic swap in parallel will likely just cause too much load as we're potentially oversubscribing work to the number of available CPUs. So, if QISKIT_IN_PARALLEL is set to True we run swap_trials serially.
Configuration menu - View commit details
-
Copy full SHA for 57790c8 - Browse repository at this point
Copy the full SHA 57790c8View commit details -
Configuration menu - View commit details
-
Copy full SHA for ea23eae - Browse repository at this point
Copy the full SHA ea23eaeView commit details -
Revert "Make swap_trials parallelization configurable"
This reverts commit 57790c8. That commit attempted to sovle some issues in test running, mainly around multiple parallel dispatch causing exceess load. But in practice it was broken and caused more issues than it fixed. We'll investigate and add control for the parallelization in a future commit separately after all the tests are passing so we have a good baseline.
Configuration menu - View commit details
-
Copy full SHA for 18a8b56 - Browse repository at this point
Copy the full SHA 18a8b56View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6a63d11 - Browse repository at this point
Copy the full SHA 6a63d11View commit details
Commits on Feb 18, 2022
-
Fix race condition leading to non-deterministic behavior
Previously, in the case of circuits that had multiple best possible depth == 1 solutions for a layer, there was a race condition in the fast exit path between the threads which could lead to a non-deterministic result even with a fixed seed. The output was always valid, but which result was dependent on which parallel thread with an ideal solution finished last and wrote to the locked best result last. This was causing weird non-deterministic test failures for some tests because of Qiskit#1794 as the exact match result would change between runs. This could be a bigger issue because user expectations are that with a fixed seed set on the transpiler that the output circuit will be deterministically reproducible. To address this is issue this commit trades off some performance to ensure we're always returning a deterministic result in this case. This is accomplished by updating/checking if a depth==1 solution has been found in another trial thread we only act (so either exit early or update the already found depth == 1 solution) if that solution already found has a trial number that is less than this thread's trial number. This does limit the effectiveness of the fast exit, but in practice it should hopefully not effect the speed too much. As part of this commit some tests are updated because the new deterministic behavior is slightly different from the previous results from the cython serial implementation. I manually verified that the new output circuits are still valid (it also looks like the quality of the results in some of those cases improved, but this is strictly anecdotal and shouldn't be taken as a general trend with this PR).
Configuration menu - View commit details
-
Copy full SHA for 873de02 - Browse repository at this point
Copy the full SHA 873de02View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2d9aae0 - Browse repository at this point
Copy the full SHA 2d9aae0View commit details -
Apply suggestions from code review
Co-authored-by: georgios-ts <45130028+georgios-ts@users.noreply.github.com>
Configuration menu - View commit details
-
Copy full SHA for 76167c7 - Browse repository at this point
Copy the full SHA 76167c7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2685c01 - Browse repository at this point
Copy the full SHA 2685c01View commit details -
Revert accidental commit of parallel reduction in compute_cost
This was only a for local testing to prove it was a bad idea and was accidently included in the branch. We should not nest the parallel execution like this.
Configuration menu - View commit details
-
Copy full SHA for 610b306 - Browse repository at this point
Copy the full SHA 610b306View commit details -
Eliminate short circuit for depth == 1 swap_trial() result
This commit eliminates the short circuit fast return in swap_trial() when another trial thread has found an ideal solution. Trying to do this in a parallel context is tricky to make deterministic because in cases of >1 depth == 1 solutions there is an inherent race condition between the threads for writing out their depth == 1 result to the shared location. Different strategies were tried to make this reliably deterministic but there wa still a race condition. Since this was just a performance optimization to avoid doing unnecessary work this commit removes this step. Weighing improved performance against repeatability in the output of the compiler, the reproducible results are more important. After we've adopted a multithreaded stochastic swap we can investigate adding this back as a potential future optimization.
Configuration menu - View commit details
-
Copy full SHA for c510764 - Browse repository at this point
Copy the full SHA c510764View commit details -
Configuration menu - View commit details
-
Copy full SHA for b7607c1 - Browse repository at this point
Copy the full SHA b7607c1View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5e31fb2 - Browse repository at this point
Copy the full SHA 5e31fb2View commit details -
Configuration menu - View commit details
-
Copy full SHA for cfc8fcb - Browse repository at this point
Copy the full SHA cfc8fcbView commit details -
Configuration menu - View commit details
-
Copy full SHA for d726101 - Browse repository at this point
Copy the full SHA d726101View commit details -
Configuration menu - View commit details
-
Copy full SHA for fb70158 - Browse repository at this point
Copy the full SHA fb70158View commit details -
Revert "Eliminate short circuit for depth == 1 swap_trial() result"
This reverts commit c510764. The removal there was premature and we had a fix for the non-determinism in place, ignoring a typo which was preventing it from working. Co-Authored-By: Georgios Tsilimigkounakis <45130028+georgios-ts@users.noreply.github.com>
Configuration menu - View commit details
-
Copy full SHA for 25187b8 - Browse repository at this point
Copy the full SHA 25187b8View commit details -
Configuration menu - View commit details
-
Copy full SHA for b2a0536 - Browse repository at this point
Copy the full SHA b2a0536View commit details -
Configuration menu - View commit details
-
Copy full SHA for f2737aa - Browse repository at this point
Copy the full SHA f2737aaView commit details -
Configuration menu - View commit details
-
Copy full SHA for 2588837 - Browse repository at this point
Copy the full SHA 2588837View commit details
Commits on Feb 19, 2022
-
Disable multiprocessing parallelism in unit tests
This commit disables the multiprocessing based parallelism when running unittest jobs in CI. We historically have defaulted the use of multiprocessing in environments only where the "fork" start method is available because this has the best performance and has no caveats around how it is used by users (you don't need an `if __name__ == "__main__"` guard). However, the use of the "fork" method isn't always 100% reliable (see https://bugs.python.org/issue40379), which we saw on Python 3.9 Qiskit#6188. In unittest CI (and tox) by default we use stestr which spawns (not using fork) parallel workers to run tests in parallel. With this PR this means in unittest we're now running multiple test runner subprocesses, which are executing parallel dispatched code using multiprocessing's fork start method, which is executing multithreaded rust code. This three layers of nesting is fairly reliably hanging as Python's fork doesn't seem to be able to handle this many layers of nested parallelism. There are 2 ways I've been able to fix this, the first is to change the start method used by `parallel_map()` to either "spawn" or "forkserver" either of these does not suffer from random hanging. However, doing this in the unittest context causes significant overhead and slows down test executing significantly. The other is to just disable the multiprocessing which fixes the hanging and doesn't impact runtime performance signifcantly (and might actually help in CI so we're not oversubscribing the limited resources. As I have not been able to reproduce `parallel_map()` hanging in a standalone context with multithreaded stochastic swap this commit opts for just disabling multiprocessing in CI and documenting the known issue in the release notes as this is the simpler solution. It's unlikely that users will nest parallel processes as it typically hurts performance (and parallel_map() actively guards against it), we only did it in testing previously because the tests which relied on it were a small portion of the test suite (roughly 65 tests) and typically did not have a significant impact on the total throughput of the test suite.
Configuration menu - View commit details
-
Copy full SHA for 93ded7f - Browse repository at this point
Copy the full SHA 93ded7fView commit details
Commits on Feb 20, 2022
-
Configuration menu - View commit details
-
Copy full SHA for e7a6b93 - Browse repository at this point
Copy the full SHA e7a6b93View commit details -
Configuration menu - View commit details
-
Copy full SHA for 91f0da9 - Browse repository at this point
Copy the full SHA 91f0da9View commit details -
Add test script to explicitly verify parallel dispatch
In an earlier commit we disabled the use of parallel dispatch in parallel_map() to avoid a bug in cpython associated with their fork() based subprocess launch. Doing this works around the bug which was reliably triggered by running multiprocessing in parallel subprocesses. It also has the side benefit of providing a ~2x speed up for test suite execution in CI. However, this meant we lost our test coverage in CI for running parallel_map() with actual multiprocessing based parallel dispatch. To ensure we don't inadvertandtly regress this code path moving forward this commit adds a dedicated test script which runs a simple transpilation in parallel and verifies that everything works as expected with the default parallelism settings.
Configuration menu - View commit details
-
Copy full SHA for 6da2a3a - Browse repository at this point
Copy the full SHA 6da2a3aView commit details
Commits on Feb 22, 2022
-
Avoid multi-threading when run in a multiprocessing context
This commit adds a switch on running between a single threaded and a multithreaded variant of the swap_trials loop based on whether the QISKIT_IN_PARALLEL flag is set. If QISKIT_IN_PARALLEL is set to TRUE this means the `parallel_map()` function is running in the outer python context and we're running in multiprocessing already. This means we do not want to be running in multiple threads generally as that will lead to potential resource exhaustion by spawn n processes each potentially running with m threads where `n` is `min(num_phys_cpus, num_tasks)` and `m` is num_logical_cpus (although only `min(num_logical_cpus, num_trials)` will be active) which on the typical system there aren't enough cores to leverage both multiprocessing and multithreading. However, in case a user does have such an environment they can set the `QISKIT_FORCE_THREADS` env variable to `TRUE` which will use threading regardless of the status of `QISKIT_IN_PARALLEL`.
Configuration menu - View commit details
-
Copy full SHA for 3d1d561 - Browse repository at this point
Copy the full SHA 3d1d561View commit details -
Configuration menu - View commit details
-
Copy full SHA for 25b2747 - Browse repository at this point
Copy the full SHA 25b2747View commit details
Commits on Feb 23, 2022
-
Apply suggestions from code review
Co-authored-by: Jake Lishman <jake@binhbar.com>
Configuration menu - View commit details
-
Copy full SHA for fc109d2 - Browse repository at this point
Copy the full SHA fc109d2View commit details -
Minor fixes from review comments
This commits fixes some minor details found during code review. It expands the section on building from source to explain how to build a release optimized binary with editable mode, makes the QISKIT_PARALLEL env variable usage consistent across all jobs, and adds a missing shebang to the `install_rush.sh` script which is used to install rust in the manylinux container environment.
Configuration menu - View commit details
-
Copy full SHA for eaf92f2 - Browse repository at this point
Copy the full SHA eaf92f2View commit details -
Configuration menu - View commit details
-
Copy full SHA for b318a2a - Browse repository at this point
Copy the full SHA b318a2aView commit details -
In earlier commits the tox configuration was changed to try and fix the docs CI job by going to great effort to try and enforce that setuptools-rust was installed in all situations, even before it was actually needed. However, the problem with the docs ci job was unrelated to the tox configuration and this reverts the configuration to something that works with more versions of tox and setuptools-rust.
Configuration menu - View commit details
-
Copy full SHA for 4c4c9d4 - Browse repository at this point
Copy the full SHA 4c4c9d4View commit details -
Configuration menu - View commit details
-
Copy full SHA for 73296a4 - Browse repository at this point
Copy the full SHA 73296a4View commit details
Commits on Feb 28, 2022
-
Configuration menu - View commit details
-
Copy full SHA for eb69288 - Browse repository at this point
Copy the full SHA eb69288View commit details