models during enumeration omitted #80

rkaminsk · 2022-06-22T14:34:35Z

The following program run with queens.aspif.gz sometimes omits models. Unfortunately, it's a bit hard to reproduce. It happens quite frequently in my Ubuntu 20.04 virtual machine with 2 cores. I managed to trigger the problem using between 3 and 64 threads. The higher the number of threads the higher the chance that solutions are omitted.

I'll try to dig a bit deeper. Just wanted to put the example somewhere safe.

#include <clasp/clasp_facade.h>
#include <clasp/solver.h>

#include <iostream>
#include <fstream>

int main() {
	while (true) {
		Clasp::ClaspConfig config;
		config.solve.numModels = 0;
		config.solve.setSolvers(64);
		Clasp::ClaspFacade libclasp;

		std::ifstream input("queens.aspif");
		libclasp.start(config, input);

		size_t num_models[] = { 1, 0, 0, 2, 10, 4, 40, 92, 352, 724 };
		for (size_t i = 0; libclasp.read(); ++i) {
			libclasp.prepare();
			libclasp.solve();
			size_t n = libclasp.summary().numEnum;
			if (num_models[i] != n) {
				std::cerr
					<< "error expected " << num_models[i]
					<< " models in call " << (i + 1)
					<< " but got " << n << std::endl;
			}
			assert(num_models[i] == n);
		}
	}
}

rkaminsk · 2022-06-24T15:08:39Z

The problem seems to be somehow related to splitting. For example, competition + recording works. Splitting with either backtracking or recording fails randomly. This happens in configs using splitting right from the start or ones that switch from competition to splitting for enumeration. One gets relatively short failing runs using

#include <clasp/clasp_facade.h>
#include <clasp/solver.h>

#include <iostream>
#include <fstream>

int main() {
    while (true) {
        Clasp::ClaspConfig config;
        config.solve.enumMode = Clasp::EnumOptions::enum_record;
        config.solve.numModels = 0;
        config.solve.setSolvers(4);
        config.solve.algorithm.mode = Clasp::mt::ParallelSolveOptions::Algorithm::mode_split;
        Clasp::ClaspFacade libclasp;

        std::ifstream input("queens.aspif");
        libclasp.start(config, input);

        size_t num_models[] = { 1, 0, 0, 2, 10, 4, 40, 92, 352, 724 };
        for (size_t i = 0; libclasp.read() && i < 6; ++i) {
            libclasp.prepare();
            for (size_t j = 0; j < 1; ++j) {
                libclasp.solve();
                size_t n = libclasp.summary().numEnum;
                if (num_models[i] != n) {
                    std::cerr
                        << "error expected " << num_models[i]
                        << " models in call " << (i + 1)
                        << " but got " << n << std::endl;
                }
                assert(num_models[i] == n);
            }
        }
    }
}

rkaminsk · 2022-06-24T17:46:10Z

I added a few tracing commands and it seems like there is a terminate event send even though there are still solvers not having exhausted their search:

  thread[1]: basic solve!!!
  thread[1]: splitting work from solver!
  thread[1]: split off/continuing to work on:  123 87 ~5 ~38 ~18 ~4 6 39/~39
  thread[0]: got work of size:  123 87 ~5 ~38 ~18 ~4 6 39
  thread[1]: finished with result false
  thread[3]: no work left
  thread[1]: no work left
  thread[0]: got a terminate event
  thread[4]: stopping
  thread[2]: splitting work from solver!
  thread[2]: split off/continuing to work on:  123 87 ~5 ~38 ~18 ~58 ~4 ~6 ~2 ~17 20/~20
  thread[2]: got a terminate event
  thread[2]: finished with result false

Let's see if I can find the reason. 😄

BenKaufmann · 2022-06-24T19:13:18Z

@rkaminsk Thanks for looking into this. Unfortunately, I don't have time this and next week to investigate further :(
But I'll look into this asap.

rkaminsk · 2022-06-24T19:58:41Z

@rkaminsk Thanks for looking into this. Unfortunately, I don't have time this and next week to investigate further :( But I'll look into this asap.

No worries. I'll continue looking a bit. And post here if I find something.

rkaminsk · 2022-07-20T14:14:37Z

I dug a little further and it seems like in the ParallelSolve::SharedData::requestWork function the counter of the BarrierSemaphore gets decreased by calling down more than once for the same solver within one request. This results in a terminate event even though there is still a solver that is working.

Question: Could spurious wake ups in BarrierSemaphore::unsafe_wait be a problem?

It seems like the problem is indeed caused by spurious wake ups. With the patched condition variable below, all models are enumerated. I opened #82 to handle this problem right in the BarrierSemaphore class.

  class condition_variable {                                                    
  public:
      void wait(std::unique_lock<std::mutex> &lock) {
          {
              std::lock_guard<std::mutex> guard_{mut_};
              wait_ += 1;
          }
          var_.wait(lock, [this]() {
              std::lock_guard<std::mutex> guard_{mut_};
              if (notify_ > 0) {
                  --notify_;                                                                                                                         
                  --wait_;                                                                                                                           
                  return true;                                                                                                                       
              }                                                                                                                                      
              return false;                                                                                                                          
          });                                                                                                                                        
      }                                                                                                                                              
      bool wait_for(unique_lock<mutex>& lock, double timeInSecs) {                                                                                   
          // TODO: spurious wakeups should be suppressed here, too.
          return var_.wait_for(lock, std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::duration<double>(timeInSecs)))               
              == std::cv_status::no_timeout;                                                                                                         
      }                                                                                                                                              
      void notify_one() noexcept {                                                                                                                   
          std::lock_guard<std::mutex> guard_{mut_};                                                                                                  
          if (notify_ < wait_) {                                                                                                                           
              notify_ += 1;
          }
          var_.notify_one();
      }
      void notify_all() noexcept {
          std::lock_guard<std::mutex> guard_{mut_};                                                                                                                                           
          notify_ = wait_;                             
          var_.notify_all();
      }
  private:
      std::mutex mut_;
      std::condition_variable var_;
      size_t notify_ = 0;
      size_t wait_ = 0;
  };

* BarrierSemaphore failed to handle spurious wakeups correctly. In particular, a spurious wakeup in BarrierSemaphore::down() could result in an early termination from requestWork() since one thread could decrease the semaphore's counter multiple times thereby activating the barrier condition although not all threads are blocked. * Drop unecessary BarrierSemaphore in favor of two specific wait ops on a shared condition variable and a wait counter. On sync reqeusts, waitSync() is used to wait until the sync flag is no longer set. When a thread needs work, it calls waitWork() waiting until the work queue becomes non-empty. In either case, the wait fails if the caller is the last non-waiting thread. Since all waits are now coupled to a specific condition, spurious wakeups are benign.

BenKaufmann · 2022-07-24T12:01:31Z

Many thanks for your help. Hopefully fixed in https://github.com/potassco/clasp/tree/dev

BenKaufmann closed this as completed Jul 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

models during enumeration omitted #80

models during enumeration omitted #80

rkaminsk commented Jun 22, 2022 •

edited

Loading

rkaminsk commented Jun 24, 2022

rkaminsk commented Jun 24, 2022

BenKaufmann commented Jun 24, 2022

rkaminsk commented Jun 24, 2022

rkaminsk commented Jul 20, 2022 •

edited

Loading

BenKaufmann commented Jul 24, 2022 •

edited

Loading

models during enumeration omitted #80

models during enumeration omitted #80

Comments

rkaminsk commented Jun 22, 2022 • edited Loading

rkaminsk commented Jun 24, 2022

rkaminsk commented Jun 24, 2022

BenKaufmann commented Jun 24, 2022

rkaminsk commented Jun 24, 2022

rkaminsk commented Jul 20, 2022 • edited Loading

BenKaufmann commented Jul 24, 2022 • edited Loading

rkaminsk commented Jun 22, 2022 •

edited

Loading

rkaminsk commented Jul 20, 2022 •

edited

Loading

BenKaufmann commented Jul 24, 2022 •

edited

Loading