Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crow causes process hang after main() successfully exits if run_async() future not waited on and a Crow route calls notify_all() on a std::condition_variable, with 1.2, master #849

Open
twfry opened this issue Jul 11, 2024 · 2 comments
Labels
bug Something isn't working discussion The viability / implementation of the issue is up for debate

Comments

@twfry
Copy link

twfry commented Jul 11, 2024

First, thanks for continuing this project.

The following was a hard to identify bug that took a couple of days to identify and is caused by a combination of usage patterns specific to how I instantiate and run Crow. I identified fixes, however I think some minor improvements to run_async() would better harden Crow and prevent others from running into this issue. The Resolution section at the bottom contains this if you want to skip the reproduction below which I only added for completeness.

The issue seems to be, if notify_all() is called on a std::condition_variable within a sub-blueprint route (a blueprint registered within a blueprint) when Crow was launched using run_async(), then every few instances Crow leaves dandling threads after successfully stopping and which prevent the process for exiting after main successfully exits.

The following is a minimum reproduction of the issue that demonstrates the hang when running on Ubuntu 22.04 and compiled with gcc 11.4.0 with g++ -std=c++20 crow_notify_issue.cpp

After the main thread configures and launches Crow with run_async(), it then blocks on a std::condition_variable waiting for a shutdown command. The shutdown command issues notify_all() to unblock the main thread which then stops Crow and exits. There are 2 ways to issue the shutdown command. 1) By sending a SIGINT signal and having the signal handler send notify_all() or 2) Having a Crow route send the notify_all() after receiving a request to run the route.

When sending the SIGINT signal and calling notify_all() from the signal handler the program reliably exits 100% of the time (which is how I stopped the application for awhile).

However I recently added a Crow route that receives a request and calls notify_all(), then once every few runs the process hung on exit, the main thread correctly receives the notice, stops Crow, cleans up and exits successfully, however Crow leaves dangling threads which hang the program after main exits

#include <iostream>
#include <signal.h>
#include <string.h>
#include <condition_variable>
#include <mutex>

#include <crow.h>

// Global shutdown variables
std::atomic<bool>           shutdown_requested {false};
std::mutex                  shutdown_mutex;
std::condition_variable     shutdown_cv;

// Crow variables
std::unique_ptr<crow::SimpleApp>        crow_app_ptr;
std::future<void>                       crow_async_app;
std::unique_ptr<crow::Blueprint>        bp_ptr;
std::unique_ptr<crow::Blueprint>        bp_sub_ptr;


// Blocks until shutdown is requested
void wait_for_shutdown() {
  std::unique_lock<std::mutex> lock(shutdown_mutex);
  if( shutdown_requested ) { return; }
  shutdown_cv.wait(lock, [] { return shutdown_requested.load(); });
}

// Notifies blocked thread shutdown was requested
void request_shutdown() {
  {
      std::lock_guard<std::mutex> lock(shutdown_mutex);
      shutdown_requested = true;
  }
  shutdown_cv.notify_all();
}

// Signal handler set to issue shutdown notice
void SIGINT_handler( int signal ) {
  std::cerr << "SIGINT_handler() - Sending Shutdown Notice" << std::endl;
  request_shutdown();
}

int main() {

  // Establish signal handler
  if( std::signal( SIGINT, SIGINT_handler ) == SIG_ERR ) {
    throw std::runtime_error( "Unable to register SIGINT handler" );
  }

  // Create crow objects (references needed for CROW_BP_ROUTE macros)
  crow_app_ptr  = std::make_unique<crow::SimpleApp>();
  bp_ptr     = std::make_unique<crow::Blueprint>( "v1", "cstat", "ctemplate" );
  bp_sub_ptr = std::make_unique<crow::Blueprint>( "server", "cstat", "ctemplate" );
  crow::SimpleApp &crow_app = *crow_app_ptr;
  crow::Blueprint &bp = *bp_ptr;
  crow::Blueprint &bp_sub = *bp_sub_ptr;

  // Route to request a shutdown
  CROW_BP_ROUTE( bp_sub, "/state/<string>" )
  ([]( crow::response& res, std::string request ) {
    crow::json::wvalue msg;
    if( request == "shutdown" ) {
      // ********ISSUE HERE WHEN CALLING notify_all() FROM CROW ROUTE**********
      std::cout << "Crow route issuing shutdown request" << std::endl;
      request_shutdown();
      msg["command"] = "success";
    } else {
      msg["command"] = "Error";
      res.code = crow::status::BAD_REQUEST;
    }
    res.write( msg.dump() );
    res.end();
  });

  // Register crow blueprints
  bp.register_blueprint( bp_sub );
  crow_app.register_blueprint( bp );

  crow_app.port( 9100 )
               .concurrency( 8 )
               .loglevel( crow::LogLevel::Info )
               .signal_clear();

  // Start the rest server
  crow_async_app = crow_app.run_async();
  crow_app.wait_for_server_start();

  std::cout << "Crow sucessfully started, waiting for shutdown signal" << std::endl;

  // Block main thread until shutdown notice
  wait_for_shutdown();

  std::cout << "Shutdown requested, stopping and deleting crow" << std::endl;

  // Stop crow
  crow_app.stop();

  // Delete crow objects
  crow_app_ptr.reset( nullptr );
  bp_ptr.reset( nullptr );
  bp_sub_ptr.reset( nullptr );

  std::cout << "Shutdown complete, exit main() with EXIT_SUCCESS" << std::endl;

  return EXIT_SUCCESS;
}

After compiling and running the above code, if SIGINT is signaled by pressing Ctrl-C then the program reliably exits with the following console output

$ ./a.out
(2024-07-11 17:38:51) [INFO    ] Crow/1.2.0 server is running at http://0.0.0.0:9100 using 8 threads
(2024-07-11 17:38:51) [INFO    ] Call `app.loglevel(crow::LogLevel::Warning)` to hide Info level logs.
Crow sucessfully started, waiting for shutdown signal
^[[A^C SIGINT_handler() - Sending Shutdown Notice
Shutdown requested, stopping and deleting crow
(2024-07-11 17:39:09) [INFO    ] Closing IO service 0x7fdd78000d50
(2024-07-11 17:39:09) [INFO    ] Closing IO service 0x7fdd78000d58
(2024-07-11 17:39:09) [INFO    ] Closing IO service 0x7fdd78000d60
(2024-07-11 17:39:09) [INFO    ] Closing IO service 0x7fdd78000d68
(2024-07-11 17:39:09) [INFO    ] Closing IO service 0x7fdd78000d70
(2024-07-11 17:39:09) [INFO    ] Closing IO service 0x7fdd78000d78
(2024-07-11 17:39:09) [INFO    ] Closing IO service 0x7fdd78000d80
(2024-07-11 17:39:09) [INFO    ] Closing main IO service (0x7fdd780018a8)
(2024-07-11 17:39:09) [INFO    ] Exiting.
Shutdown complete, exit main() with EXIT_SUCCESS

However if you use curl localhost:9100/v1/server/state/shutdown to have the Crow route send the shutdown notice, then once every few instances the notice will be received and the main thread will stop Crow and exit successfully, however several Crow threads will remain suspended which stops the program from exiting after main() exits. From there sending SIGKILL is the only way to end the process. This can be seen in the following output where the program hangs after exiting and needed to be killed.

$ ./a.out
(2024-07-11 17:39:18) [INFO    ] Crow/1.2.0 server is running at http://0.0.0.0:9100 using 8 threads
(2024-07-11 17:39:18) [INFO    ] Call `app.loglevel(crow::LogLevel::Warning)` to hide Info level logs.
Crow sucessfully started, waiting for shutdown signal
(2024-07-11 17:39:21) [INFO    ] Request: 127.0.0.1:60340 0x7fe308004080 HTTP/1.1 GET /v1/server/state/shutdown
Crow route issuing shutdown request
Shutdown requested, stopping and deleting crow
(2024-07-11 17:39:21) [INFO    ] Response: 0x7fe308004080 /v1/server/state/shutdown 200 0
(2024-07-11 17:39:21) [INFO    ] Closing IO service 0x7fe308000d50
(2024-07-11 17:39:21) [INFO    ] Closing IO service 0x7fe308000d58
(2024-07-11 17:39:21) [INFO    ] Closing IO service 0x7fe308000d60
(2024-07-11 17:39:21) [INFO    ] Closing IO service 0x7fe308000d68
(2024-07-11 17:39:21) [INFO    ] Closing IO service 0x7fe308000d70
(2024-07-11 17:39:21) [INFO    ] Closing IO service 0x7fe308000d78
(2024-07-11 17:39:21) [INFO    ] Closing IO service 0x7fe308000d80
(2024-07-11 17:39:21) [INFO    ] Closing main IO service (0x7fe308001cc8)
(2024-07-11 17:39:21) [INFO    ] Exiting.
Shutdown complete, exit main() with EXIT_SUCCESS
Segmentation fault (core dumped)     - (This happens after SIGKILL sent since it hung here)

This was hard to track down because it seems to depend on my usage in the example above. For example moving the Crow route from a sub-blueprint to a top level blueprint or the Crow app itself seemed to fix the issue, so at first I thought it was a blueprint bug.

Resolution

The root issue seems to be caused by not calling crow_async_app.wait() on the run_async() future after Crow is stopped but before the Crow objects are deleted. If I added crow_async_app.wait(); after the stop() command the issue seemed to resolve itself and the dangling Crow threads that remained after Crow stop() are cleaned up somehow.

However since the future does not return anything and the docs do not say wait() needs to be called, I did not originally think it necessary and instead hunted around different ways of calling this from Crow for awhile with varying degrees of success. Somehow there is a strange combination where if you use std::condition_variable within a Crow blueprint route and do not wait on the future you run into undefined behavior on main exit.

It might be more robust to have Crow's run_async() not use std::async which returns a future to manage and users might mess up (as I did), and instead simply use std::thread and detach the thread. With this then the thread will block when Crow is running and automatically destroy itself after Crow stops with nothing else needed to be managed on the user side. i.e. something simple like this

        void run_async()
        {
            std::thread t([&] {
                this->run();
            });
            t.detach();
        }

Again this probably falls into user error, but as-is it is easy to miss this and run into issues. For now I switched to not using run_async() and instead creating and detaching a thread as above that calls Crow run(). Thanks.

@gittiver gittiver added bug Something isn't working discussion The viability / implementation of the issue is up for debate labels Jul 12, 2024
@gittiver
Copy link
Member

gittiver commented Jul 19, 2024

void wait_for_shutdown() {
  std::unique_lock<std::mutex> lock(shutdown_mutex);
  while ( !shutdown_requested )
    shutdown_cv.wait(lock);
}

Could you check if this helps on the issue (took solution from other PR).

Other question here, are there other requests or web socket requests running in parallel or maybe are started while shutting down?

@twfry
Copy link
Author

twfry commented Jul 23, 2024

Thanks for the reply.

I tried the modification above, but that did not change the behavior, the process will still reliably lock up after exit with several dangling crow threads.

Other question here, are there other requests or web socket requests running in parallel or maybe are started while shutting down?

No, in this case only the shutdown route is called once with zero other activity.

After poking on this a bit more, the issue seems to be after crow stop() exits some number of crow worker threads are still active and some amount of time is required for the worker threads to clean up. If you delete the crow instance before the worker threads have a chance to clean up and then exit the application will reliably hang on exit. In this case I delete the crow instance immediately after calling stop, but I also saw the same behavior if the instances were deleted by c++ global object cleanup.

If I add 100us of delay after crow stop() and before crow object deletion, then the worker threads have a chance to clean up and the process shuts down correctly. If I add only 50us or less of delay then the worker threads do not clean up and the process hangs. Waiting on the async future as originally stated in the first post also forces thread cleanup.

It seems there is a race condition and you need to wait some period of time after stop() to delete the crow object or the worker threads will not have a chance to cleanup properly. Not sure if I missed in the docs that you need to wait on some event before deletion, i.e. some sort of wait_for_stop_to_complete() call.

To prevent this a mechanism to tell if crow is fully stopped and ready for deletion would be helpful, or maybe that could be handled in the destructor. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discussion The viability / implementation of the issue is up for debate
Projects
None yet
Development

No branches or pull requests

2 participants