Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement multithreaded stochastic swap in rust #7658

Merged
merged 42 commits into from
Feb 28, 2022

Commits on Feb 14, 2022

  1. Implement multithreaded stochastic swap in rust

    This commit is a rewrite of the core swap trials functionality in the
    StochasticSwap transpiler pass. Previously this core routine was written
    using Cython (see Qiskit#1789) which had great performance, but that
    implementation was single threaded. The core of the stochastic swap
    algorithm by it's nature is well suited to be executed in parallel, it
    attempts a number of random trials and then picks the best result
    from all the trials and uses that for that layer. These trials can
    easily be run in parallel as there is no data dependency between the
    trials (there are shared inputs but read-only). As the algorithm
    generally scales exponentially the speed up from running the trials in
    parallel can offset this and improve the scaling of the pass. Running
    the pass in parallel was previously tried in Qiskit#4781 using Python
    multiprocessing but the overhead of launching an additional process and
    serializing the input arrays for each trial was significantly larger
    than the speed gains. To run the algorithm efficiently in parallel
    multithreading is needed to leverage shared memory on shared inputs.
    
    This commit rewrites the cython routine using rust. This was done for
    two reasons. The first is that rust's safety guarantees make dealing
    with and writing parallel code much easier and safer. It's also
    multiplatform because the rust language supports native threading
    primatives in language. The second is while writing parallel cython
    code using open-mp there are limitations with it, mainly on windows. In
    practice it was also difficult to write and maintain parallel cython
    code as it has very strict requirements on python and c code
    interactions. It was much faster and easier to port it to rust and the
    performance for each iteration (outside of parallelism) is the same (in
    some cases marginally faster) in rust. The implementation here reuses
    the data structures that the previous cython implementation introduced
    (mainly flattening all the terra objects into 1d or 2d numpy arrays for
    efficient access from C).
    
    The speedups from this PR can be significant, calling transpile() on a
    400 qubit (with a depth of 10) QV model circuit targetting a 409 heavy
    hex coupling map goes from ~200 seconds with the single threaded cython
    to ~60 seconds with this PR locally on a 32 core system, When transpiling
    a 1000 qubit (also with a depth of 10) QV model circuit targetting a 1081
    qubit heavy hex coupling map goes from taking ~6500 seconds to ~720
    seconds.
    
    The tradeoff with this PR is for local qiskit-terra development a rust
    compiler needs to be installed. This is made trivial using rustup
    (https://rustup.rs/), but it is an additional burden and one that we
    might not want to make. If so we can look at turning this PR into a
    separate repository/package that qiskit-terra can depend on. The
    tradeoff here is that we'll be adding friction to the api boundary
    between the pass and the core swap trials interface. But, it does ease
    the dependency on development for qiskit-terra.
    mtreinish committed Feb 14, 2022
    Configuration menu
    Copy the full SHA
    9d9c2dc View commit details
    Browse the repository at this point in the history

Commits on Feb 16, 2022

  1. Sanitize packaging to support future modules

    This commit fixes how we package the compiled rust module in
    qiskit-terra. As a single rust project only gives us a single compiled
    binary output we can't use the same scheme we did previously with cython
    with a separate dynamic lib file for each module. This shifts us to
    making the rust code build a `qiskit._accelerate` module and in that we
    have submodules for everything we need from compiled code. For this PR
    there is only one submodule, `stochastic_swap`, so for example the
    parallel swap_trials routine can be imported from
    `qiskit._accelerate.stochastic_swap.swap_trials`. In the future we can
    have additional submodules for other pieces of compiled code in qiskit.
    For example, the likely next candidate is the pauli expectation value
    cython module, which we'll likely port to rust and also make parallel
    (for sufficiently large number of qubits). In that case we'd add a new
    submodule for that functionality.
    mtreinish committed Feb 16, 2022
    Configuration menu
    Copy the full SHA
    7af1c43 View commit details
    Browse the repository at this point in the history
  2. Adjust random normal distribution to use correct mean

    This commit corrects the use of the normal distribution to have the mean
    set to 1.0. Previously we were doing this out of band for each value by
    adding 1 to the random value which wasn't necessary because we could
    just generate it with a mean of 1.0.
    mtreinish committed Feb 16, 2022
    Configuration menu
    Copy the full SHA
    2f8de01 View commit details
    Browse the repository at this point in the history
  3. Remove unecessary extra scope from locked read

    This commit removes an unecessary extra scope around the locked read for
    where we store the best solution. The scope was previously there to
    release the lock after we check if there is a solution or not. However
    this wasn't actually needed as we can just do the check inline and the
    lock will release after the condition block.
    mtreinish committed Feb 16, 2022
    Configuration menu
    Copy the full SHA
    6da1b7b View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    3cf6e04 View commit details
    Browse the repository at this point in the history
  5. Fix indices typo in NLayout constructor

    Co-authored-by: Jake Lishman <jake@binhbar.com>
    mtreinish and jakelishman authored Feb 16, 2022
    Configuration menu
    Copy the full SHA
    f838ff6 View commit details
    Browse the repository at this point in the history
  6. Remove explicit lifetime annotation from swap_trials

    Previously the swap_trials() function had an explicit lifetime
    annotation `'p` which wasn't necessary because the compiler can
    determine this on it's own. Normally when dealing with numpy views and a
    Python object (i.e. a GIL handle) we need a lifetime annotation to tell
    the rust compiler the numpy view and the python gil handle will have the
    same lifetime. But since swap_trials doesn't take a gil handle and
    operates purely in rust we don't need this lifetime and the rust
    compiler can deal with the lifetime of the numpy views on their own.
    mtreinish committed Feb 16, 2022
    Configuration menu
    Copy the full SHA
    780f327 View commit details
    Browse the repository at this point in the history
  7. Use sum() instead of fold()

    mtreinish committed Feb 16, 2022
    Configuration menu
    Copy the full SHA
    35fba96 View commit details
    Browse the repository at this point in the history
  8. Fix lint and add rust style and lint checks to CI

    This commit fixes the python lint failures and also updates the ci
    configuration for the lint job to also run rust's style and lint
    enforcement.
    mtreinish committed Feb 16, 2022
    Configuration menu
    Copy the full SHA
    d72d6c5 View commit details
    Browse the repository at this point in the history

Commits on Feb 17, 2022

  1. Fix returned layout mapping from NLayout

    This commit fixes the output list from the `layout_mapping()`
    method of `NLayout`. Previously, it incorrectly would return the
    wrong indices it should be a list of virtual -> physical to
    qubit pairs. This commit corrects this error
    
    Co-authored-by: georgios-ts <45130028+georgios-ts@users.noreply.github.com>
    mtreinish and georgios-ts authored Feb 17, 2022
    Configuration menu
    Copy the full SHA
    0b6e4d7 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    a9a879d View commit details
    Browse the repository at this point in the history
  3. Make swap_trials parallelization configurable

    This commit makes the parallelization of the swap_trials() configurable.
    This is dones in two ways, first a new argument parallel_threshold is
    added which takes an optional int which is the number of qubits to
    switch between a parallel and serial version. The second is that it
    takes into account the the state of the QISKIT_IN_PARALLEL environment
    variable. This variable is set to TRUE by parallel_map() when we're
    running in a multiprocessing context. In those cases also running
    stochastic swap in parallel will likely just cause too much load as
    we're potentially oversubscribing work to the number of available CPUs.
    So, if QISKIT_IN_PARALLEL is set to True we run swap_trials serially.
    mtreinish committed Feb 17, 2022
    Configuration menu
    Copy the full SHA
    57790c8 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    ea23eae View commit details
    Browse the repository at this point in the history
  5. Revert "Make swap_trials parallelization configurable"

    This reverts commit 57790c8. That
    commit attempted to sovle some issues in test running, mainly around
    multiple parallel dispatch causing exceess load. But in practice it was
    broken and caused more issues than it fixed. We'll investigate and add
    control for the parallelization in a future commit separately after all
    the tests are passing so we have a good baseline.
    mtreinish committed Feb 17, 2022
    Configuration menu
    Copy the full SHA
    18a8b56 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    6a63d11 View commit details
    Browse the repository at this point in the history

Commits on Feb 18, 2022

  1. Fix race condition leading to non-deterministic behavior

    Previously, in the case of circuits that had multiple best possible
    depth == 1 solutions for a layer, there was a race condition in the fast
    exit path between the threads which could lead to a non-deterministic
    result even with a fixed seed. The output was always valid, but which
    result was dependent on which parallel thread with an ideal solution
    finished last and wrote to the locked best result last. This was causing
    weird non-deterministic test failures for some tests because of Qiskit#1794 as
    the exact match result would change between runs. This could be a bigger
    issue because user expectations are that with a fixed seed set on the
    transpiler that the output circuit will be deterministically
    reproducible.
    
    To address this is issue this commit trades off some performance to
    ensure we're always returning a deterministic result in this case. This
    is accomplished by updating/checking if a depth==1 solution has been
    found in another trial thread we only act (so either exit early or
    update the already found depth == 1 solution) if that solution already
    found has a trial number that is less than this thread's trial number.
    This does limit the effectiveness of the fast exit, but in practice it
    should hopefully not effect the speed too much.
    
    As part of this commit some tests are updated because the new
    deterministic behavior is slightly different from the previous results
    from the cython serial implementation. I manually verified that the
    new output circuits are still valid (it also looks like the quality
    of the results in some of those cases improved, but this is strictly
    anecdotal and shouldn't be taken as a general trend with this PR).
    mtreinish committed Feb 18, 2022
    Configuration menu
    Copy the full SHA
    873de02 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    2d9aae0 View commit details
    Browse the repository at this point in the history
  3. Apply suggestions from code review

    Co-authored-by: georgios-ts <45130028+georgios-ts@users.noreply.github.com>
    mtreinish and georgios-ts authored Feb 18, 2022
    Configuration menu
    Copy the full SHA
    76167c7 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    2685c01 View commit details
    Browse the repository at this point in the history
  5. Revert accidental commit of parallel reduction in compute_cost

    This was only a for local testing to prove it was a bad idea and was
    accidently included in the branch. We should not nest the parallel
    execution like this.
    mtreinish committed Feb 18, 2022
    Configuration menu
    Copy the full SHA
    610b306 View commit details
    Browse the repository at this point in the history
  6. Eliminate short circuit for depth == 1 swap_trial() result

    This commit eliminates the short circuit fast return in swap_trial()
    when another trial thread has found an ideal solution. Trying to do this
    in a parallel context is tricky to make deterministic because in cases
    of >1 depth == 1 solutions there is an inherent race condition between
    the threads for writing out their depth == 1 result to the shared
    location. Different strategies were tried to make this reliably
    deterministic but there wa still a race condition. Since this was just a
    performance optimization to avoid doing unnecessary work this commit
    removes this step. Weighing improved performance against repeatability
    in the output of the compiler, the reproducible results are more
    important. After we've adopted a multithreaded stochastic swap we can
    investigate adding this back as a potential future optimization.
    mtreinish committed Feb 18, 2022
    Configuration menu
    Copy the full SHA
    c510764 View commit details
    Browse the repository at this point in the history
  7. Add missing docstrings

    mtreinish committed Feb 18, 2022
    Configuration menu
    Copy the full SHA
    b7607c1 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    5e31fb2 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    cfc8fcb View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    d726101 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    fb70158 View commit details
    Browse the repository at this point in the history
  12. Revert "Eliminate short circuit for depth == 1 swap_trial() result"

    This reverts commit c510764. The
    removal there was premature and we had a fix for the non-determinism in
    place, ignoring a typo which was preventing it from working.
    
    Co-Authored-By: Georgios Tsilimigkounakis <45130028+georgios-ts@users.noreply.github.com>
    mtreinish and georgios-ts committed Feb 18, 2022
    Configuration menu
    Copy the full SHA
    25187b8 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    b2a0536 View commit details
    Browse the repository at this point in the history
  14. Fix rust lint

    mtreinish committed Feb 18, 2022
    Configuration menu
    Copy the full SHA
    f2737aa View commit details
    Browse the repository at this point in the history
  15. Fix docs job definition

    mtreinish committed Feb 18, 2022
    Configuration menu
    Copy the full SHA
    2588837 View commit details
    Browse the repository at this point in the history

Commits on Feb 19, 2022

  1. Disable multiprocessing parallelism in unit tests

    This commit disables the multiprocessing based parallelism when running
    unittest jobs in CI. We historically have defaulted the use of
    multiprocessing in environments only where the "fork" start method is
    available because this has the best performance and has no caveats
    around how it is used by users (you don't need an
    `if __name__ == "__main__"` guard). However, the use of the "fork"
    method isn't always 100% reliable (see
    https://bugs.python.org/issue40379), which we saw on Python 3.9 Qiskit#6188.
    In unittest CI (and tox) by default we use stestr which spawns (not using
    fork) parallel workers to run tests in parallel. With this PR this means
    in unittest we're now running multiple test runner subprocesses, which
    are executing parallel dispatched code using multiprocessing's fork
    start method, which is executing multithreaded rust code. This three layers
    of nesting is fairly reliably hanging as Python's fork doesn't seem to
    be able to handle this many layers of nested parallelism. There are 2
    ways I've been able to fix this, the first is to change the start method
    used by `parallel_map()` to either "spawn" or "forkserver" either of
    these does not suffer from random hanging. However, doing this in the
    unittest context causes significant overhead and slows down test
    executing significantly. The other is to just disable the
    multiprocessing which fixes the hanging and doesn't impact runtime
    performance signifcantly (and might actually help in CI so we're not
    oversubscribing the limited resources.
    
    As I have not been able to reproduce `parallel_map()` hanging in
    a standalone context with multithreaded stochastic swap this commit opts
    for just disabling multiprocessing in CI and documenting the known issue
    in the release notes as this is the simpler solution. It's unlikely that
    users will nest parallel processes as it typically hurts performance
    (and parallel_map() actively guards against it), we only did it in
    testing previously because the tests which relied on it were a small
    portion of the test suite (roughly 65 tests) and typically did not have
    a significant impact on the total throughput of the test suite.
    mtreinish committed Feb 19, 2022
    Configuration menu
    Copy the full SHA
    93ded7f View commit details
    Browse the repository at this point in the history

Commits on Feb 20, 2022

  1. Configuration menu
    Copy the full SHA
    e7a6b93 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    91f0da9 View commit details
    Browse the repository at this point in the history
  3. Add test script to explicitly verify parallel dispatch

    In an earlier commit we disabled the use of parallel dispatch in
    parallel_map() to avoid a bug in cpython associated with their fork()
    based subprocess launch. Doing this works around the bug which was
    reliably triggered by running multiprocessing in parallel subprocesses.
    It also has the side benefit of providing a ~2x speed up for test suite
    execution in CI. However, this meant we lost our test coverage in CI for
    running parallel_map() with actual multiprocessing based parallel
    dispatch. To ensure we don't inadvertandtly regress this code path
    moving forward this commit adds a dedicated test script which runs a
    simple transpilation in parallel and verifies that everything works as
    expected with the default parallelism settings.
    mtreinish committed Feb 20, 2022
    Configuration menu
    Copy the full SHA
    6da2a3a View commit details
    Browse the repository at this point in the history

Commits on Feb 22, 2022

  1. Avoid multi-threading when run in a multiprocessing context

    This commit adds a switch on running between a single threaded and a
    multithreaded variant of the swap_trials loop based on whether the
    QISKIT_IN_PARALLEL flag is set. If QISKIT_IN_PARALLEL is set to TRUE
    this means the `parallel_map()` function is running in the outer python
    context and we're running in multiprocessing already. This means we do
    not want to be running in multiple threads generally as that will lead
    to potential resource exhaustion by spawn n processes each potentially
    running with m threads where `n` is `min(num_phys_cpus, num_tasks)` and
    `m` is num_logical_cpus (although only
    `min(num_logical_cpus, num_trials)` will be active) which on the typical
    system there aren't enough cores to leverage both multiprocessing and
    multithreading. However, in case a user does have such an environment
    they can set the `QISKIT_FORCE_THREADS` env variable to `TRUE` which
    will use threading regardless of the status of `QISKIT_IN_PARALLEL`.
    mtreinish committed Feb 22, 2022
    Configuration menu
    Copy the full SHA
    3d1d561 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    25b2747 View commit details
    Browse the repository at this point in the history

Commits on Feb 23, 2022

  1. Apply suggestions from code review

    Co-authored-by: Jake Lishman <jake@binhbar.com>
    mtreinish and jakelishman authored Feb 23, 2022
    Configuration menu
    Copy the full SHA
    fc109d2 View commit details
    Browse the repository at this point in the history
  2. Minor fixes from review comments

    This commits fixes some minor details found during code review. It
    expands the section on building from source to explain how to build a
    release optimized binary with editable mode, makes the QISKIT_PARALLEL
    env variable usage consistent across all jobs, and adds a missing
    shebang to the `install_rush.sh` script which is used to install rust in
    the manylinux container environment.
    mtreinish committed Feb 23, 2022
    Configuration menu
    Copy the full SHA
    eaf92f2 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    b318a2a View commit details
    Browse the repository at this point in the history
  4. Simplify tox configuration

    In earlier commits the tox configuration was changed to try and fix the
    docs CI job by going to great effort to try and enforce that
    setuptools-rust was installed in all situations, even before it was
    actually needed. However, the problem with the docs ci job was unrelated
    to the tox configuration and this reverts the configuration to something
    that works with more versions of tox and setuptools-rust.
    mtreinish committed Feb 23, 2022
    Configuration menu
    Copy the full SHA
    4c4c9d4 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    73296a4 View commit details
    Browse the repository at this point in the history

Commits on Feb 28, 2022

  1. Configuration menu
    Copy the full SHA
    eb69288 View commit details
    Browse the repository at this point in the history