- every entry stood in for a library tsan could not see into; with libzmq,
libsodium and libstdc++ now tsan-instrumented in the tsan CI job, the
happens-before edges they establish are visible and nothing is left to
suppress
- suppressions were blunt (a race: entry matches any frame in the stack),
so they could also mask real races passing through those frames
- mirror spack-latest.yaml, with -fsanitize=thread on the libzmq and
libsodium nodes so tsan can observe the happens-before edges established
inside libzmq's lock-free queues, plus the libstdcxx-tsan root spec
- flags are applied per node instead of via the propagating '==' operator,
which could reach the gcc node and trigger a compiler rebuild
- unchanged roots (fairlogger, boost, ninja, cmake) keep their spec hashes,
so they are shared with the regular buildcache entries; the instrumented
nodes hash differently and coexist in the content-addressed cache
- exclude libstdcxx-tsan from concretizer reuse so recipe changes always
take effect; unchanged recipes still hit the buildcache because the spec
hash is identical
- add the tsan env to the buildcache matrix (rebuilding also on spack_repo
changes) so the instrumented binaries are cached instead of rebuilt on
every CI run
- gcc ships no supported switch to build libstdc++ with -fsanitize=thread,
and spack's gcc recipe filters all flags out of the target-library build
(CXXFLAGS_FOR_TARGET is owned by its generated --with-build-config=spack
makefile), so provide a dedicated libstdcxx-tsan package in a custom repo
- build only the libstdc++-v3 subtree from the matching gcc release tarball,
configured standalone against the already-installed toolchain (recipe
modeled on https://iree.dev/developers/debugging/sanitizers/), instead of
rebuilding all of gcc
- the result is a drop-in runtime replacement for the compiler's libstdc++
(same soname and symbol versions), to be loaded only by the instrumented
test executables
- normalize the install layout after make install: the standalone build puts
the runtime libraries into the multilib os dir (lib64 on x86_64) regardless
of --libdir, and --with-toolexeclibdir only applies to cross builds
- register the repo in the setup-deps action before creating the env
- introduce FAIRMQ_TEST_LD_LIBRARY_PATH, which prepends a directory to
each test's environment via ctest, so the tests can run against an
alternative runtime library (e.g. a tsan-instrumented libstdc++)
- LD_LIBRARY_PATH rather than an injected rpath: an rpath added via the
linker flags cannot precede the rpath spack's gcc adds through its
specs file, so the compiler's own libstdc++ would keep winning the
runtime search order
- scoped per test on purpose: an instrumented library has unresolved
__tsan_* symbols and must not be loaded into uninstrumented tools
like cmake, ctest or ninja
- fail the configuration instead of silently dropping the injection on
CMake < 3.22 (ENVIRONMENT_MODIFICATION)
- cover the example tests too; they share the instrumented runtime but
not the locale-cache warmup (their main() is the installed public
header). The custom-controller env block was dead before: it tested
lsan_options, which only ever existed in the add_example() function
scope, so the test also never received the LSan suppressions
- std::ctype<char> caches narrow()/widen() results per character in
plain char arrays of the global classic-locale facet, written without
synchronization from header-inlined code (locale_facets.h); two
threads exercising an uncached character concurrently (e.g. compiling
a std::regex in Channel::Validate) constitute a true data race that
ThreadSanitizer rightfully reports
- the stores are real and unsynchronized, so a tsan-instrumented
libstdc++ cannot help here; instead fill the caches before any thread
is spawned, which turns every later access into a pure read
- warm the lazily-installed num_put/num_get caches used by stream
insertion/extraction as well, via a small format/parse round-trip
- wire the warm-up into the gtest runner main() and, via a static
initializer, into the test device runner
- the subscriber threads captured the loop counter by reference while
the spawning loop kept incrementing it: a genuine data race
- depending on timing, threads could also end up with duplicate
subscriber names; capture the counter by value instead
Add two public entry points needed by the ALICE use case where shmem
messages are allocated via a transport but never sent — their metadata
is instead serialised into Arrow tables and delivered over a separate
channel, allowing consumer devices to resolve the payload pointer
without taking ownership.
shmem::Message::GetMeta() returns the MetaHeader of the message,
mirroring the existing positional-init pattern already used in Socket.h.
shmem::GetDataAddressFromHandle(TransportFactory&, const MetaHeader&)
is a free function declared in Common.h and defined in Manager.cxx.
Keeping it out of the TransportFactory class body means callers only
need to include Common.h (available transitively via Message.h) and do
not drag in Socket.h or zmq.h. The implementation handles both managed
segments and unmanaged regions, and throws SharedMemoryError with a
typed message on a bad segment or region id. TransportFactory also
gains a same-named member for callers that already have the concrete
type. Lifetime of the returned pointer is the caller's responsibility;
the cache device is expected to hold the messages alive.
A SideChannel test covers the GetMeta/GetDataAddressFromHandle
round-trip for both standard and expanded-metadata configurations.
- libzmq is not tsan-instrumented, so tsan cannot see the happens-before
its queues establish between user threads and libzmq I/O threads,
producing false-positive data races on message buffers
- add test/thread_sanitizer_suppressions.txt and point TSAN_OPTIONS at it
via the sanitizers job env so it reaches the tests and their device
subprocesses
- suppress: accesses made directly from libzmq, the zero-copy message
deleters libzmq runs from msg_t::close, shmem receive-side metadata
reads, and std::regex/locale lazy-init races in libstdc++
- LSan symbolizes the leak as the C++ method `zmq::msg_t::init_size`
in the Debug sanitizer build, no longer the C wrapper
`zmq_msg_init_size`
- substring match failed (`_` vs `::`), so the suppression no longer
applied and the asan+lsan+ubsan job failed in Pair/PubSub/Poller tests
- add the demangled frame, keep the old pattern for older libzmq
Committed lockfiles pinned gcc as a host-path external (from spack compiler
find), which is not portable across runners and broke CI. Cache the gcc
compiler itself as a buildcache node instead, so CI pulls it (~1 min) rather
than building it from source (~1 h).
- push the freshly-built gcc node in setup-deps BEFORE spack compiler find
(which marks it external and excludes it from buildcache push), gated behind
a push-gcc input used only by the buildcache workflow
- drop the committed-lockfile approach: remove test/ci/locks, the lockfile
install path in setup-deps, and the lockfile export in the buildcache workflow
- drop the ignored ref input from setup-spack (v3 renamed it to spack_ref)
Reusing concretization between the weekly buildcache (fresh) and weekday CI
(reuse) can drift if runner externals change, causing avoidable cache misses.
- setup-deps installs from test/ci/locks/<env>-gcc<N>.lock when it exists,
skipping concretization for byte-identical hashes; falls back to the spec
yaml otherwise
- buildcache exports each env's spack.lock as a downloadable artifact so the
lockfiles can be regenerated on the ubuntu-24.04 runner and committed
- document the manual regeneration flow in test/ci/locks/README.md
Boost 1.88 replaced Boost.Process with v2, breaking the v1 API.
Boost 1.89 restores v1 compatibility via <boost/process/v1.hpp>.
- Fail configuration if Boost 1.88 is detected
- Define FAIRMQ_BOOST_PROCESS_V1_HEADER for Boost >= 1.89
- Use conditional includes to select v1.hpp or process.hpp
- Add namespace aliases (bp, bp_this) for portable API access
- rename mirror to ghcr-buildcache
- find system compiler before building gcc
- separate update-index job to avoid race condition
- always attempt push even on partial failure
BREAKING CHANGE
Due to a lack of users, we remove the experimental code. The
latest implementation can be found in release v1.4.56. This does
not mean it will never be picked up again, but for now there are
no plans.
* Optimize appending another Parts container
* Remove redundant/verbose comments
* Change r-value args to move-only types into l-value args for
readability
* Deprecate `AtRef(int)`, redundant, just dereference at call site
* Deprecate `AddPart(Message*)`, avoid owning raw pointer args
* Add various const overloads
* Add `Empty()` and `Clear()` member functions
* Add `noexcept` where applicable
The logic of the GetNumberOfConnectedPeers test case relies on sleeping
a certain time. We have observed the 10ms sleep time to sometimes be too
short. Increasing it to 100ms should improve test stability.