9.5 KiB
Scheduler Integration Status
Branch: scheduler-integration
PR: #44810
✅ Completed
Per-App Arena Isolation (SIGSEGV Fix)
The scheduler integration exposed a latent bug where multiple test sessions shared a single thread-local element arena (ELEMENT_ARENA). When the scheduler correctly allowed cross-session task interleaving during block(), one session could clear the arena while another held references, causing SIGSEGV crashes.
Problem Scenario:
- App A starts drawing, allocates elements in the shared
ELEMENT_ARENA - App A's task yields during some async operation
- App B's task runs, draws, and clears
ELEMENT_ARENA - App A's task resumes and tries to use its now-invalid
ArenaBoxreferences → SIGSEGV
Solution: Each App now owns its own arena. A thread-local pointer (CURRENT_ELEMENT_ARENA) is set during Window::draw() to route element allocations to the correct arena.
Files Changed:
crates/gpui/src/app.rs- Added
element_arena: RefCell<Arena>field toAppstruct - Added
Arenaimport from crate - Initialize arena in
App::new_app()
- Added
crates/gpui/src/window.rs- Added
CURRENT_ELEMENT_ARENAthread-local (points to current app's arena during draw) - Added
with_element_arena()helper function for element allocation - Added
ElementArenaScopeRAII guard for setting/restoring the current arena - Updated
ArenaClearNeededto store a pointer to the arena that needs clearing - Modified
Window::draw()to set upElementArenaScopeand return arena-awareArenaClearNeeded
- Added
crates/gpui/src/element.rs- Changed
AnyElement::new()to usewith_element_arena()instead of directly accessingELEMENT_ARENA
- Changed
Scheduler Integration
The scheduler crate is fully integrated into GPUI:
TestSchedulerprovides deterministic async execution for testsPlatformSchedulerwraps platform dispatchers for production- Session-based foreground isolation prevents same-session reentrancy during blocking
- Cross-session task execution works correctly (different "machines" can make progress)
✅ Fixed: test_host_disconnect
Root Cause
The test used insufficient time (RECEIVE_TIMEOUT = 10s) after allow_connections() for client A to reconnect. The reconnection logic uses exponential backoff with jitter:
- 500ms → 1s → 2s → 4s → 8s → 16s → 30s (max)
- After 40s of failed attempts, the next retry could be 30-60s in the future
- 10s was not enough; other similar tests use
RECONNECT_TIMEOUT(30s)
Why It Passed on Main
The old scheduler processed tasks/timers in a slightly different order, which happened to allow reconnection within 10s by chance. The new scheduler's different (but equally valid) task ordering exposed the timing bug.
Fix Applied
Changed line 154 in crates/collab/src/tests/editor_tests.rs:
// Before:
cx_a.background_executor.advance_clock(RECEIVE_TIMEOUT);
// After:
cx_a.background_executor.advance_clock(RECONNECT_TIMEOUT);
Test now passes on all seeds tested (0-9).
✅ Fixed: test_following_to_channel_notes_without_a_shared_project
Root Cause
The test had a race condition where client B started following client A before client A's buffer edits were synced to the server.
The Race:
- Client A opens channel notes, inserts "Hello from A." (13 chars), sets selection to 3..4
- These operations are sent via
client.send()(fire-and-forget, non-blocking) - Before the server processes these operations, client B starts following
- Client B sends
JoinChannelBufferrequest - Server responds with buffer state that doesn't include A's latest edits
- Client B's ChannelView is created with empty/partial buffer
- The selection state (3..4) from the follow response can't be applied properly
- Client B ends up with cursor at end of buffer (13..13) instead of 3..4
Why It Passed on Main
The old scheduler's task ordering happened to process the buffer sync messages before client B's follow request. The new scheduler's different (but equally valid) ordering exposed this missing synchronization point.
Fix Applied
Added deterministic.run_until_parked() after client A's edits in crates/collab/src/tests/following_tests.rs:
channel_notes_1_a.update_in(cx_a, |notes, window, cx| {
assert_eq!(notes.channel(cx).unwrap().name, "channel-1");
notes.editor.update(cx, |editor, cx| {
editor.insert("Hello from A.", window, cx);
editor.change_selections(SelectionEffects::no_scroll(), window, cx, |selections| {
selections.select_ranges(vec![MultiBufferOffset(3)..MultiBufferOffset(4)]);
});
});
});
// Ensure client A's edits are synced to the server before client B starts following.
deterministic.run_until_parked();
// Client B follows client A.
workspace_b
.update_in(cx_b, |workspace, window, cx| {
workspace.start_following(client_a.peer_id().unwrap(), window, cx).unwrap()
})
.await
.unwrap();
Test now passes on all seeds tested (0-9).
🔄 Known Issue: test_joining_channel_ancestor_member (Pre-existing)
Symptom
This test fails at seed 3 when running all collab tests together, but passes when run in isolation.
Error
panicked at crates/livekit_client/src/test.rs:710:45:
called `Result::unwrap()` on an `Err` value: no server found for url
Analysis
This is a test isolation issue in the LiveKit test infrastructure, not related to our changes:
Room::test_server()tries to find a registered test server for a URL- When running all tests, some previous test's cleanup or the scheduler's task ordering causes the server to not be found
- The test passes in isolation because there's no interference from other tests
Recommendation
This is a pre-existing issue exposed by the new scheduler's task ordering. It should be investigated separately as a test infrastructure problem in the LiveKit mock client.
Test Results Summary
# All these pass:
SEED=0..9 cargo test -p collab test_host_disconnect
SEED=0..9 cargo test -p collab test_following_to_channel_notes_without_a_shared_project
cargo test -p gpui # 73 passed
# Full collab test suite:
SEED=0 cargo test -p collab # 155 passed
SEED=1 cargo test -p collab # 155 passed
SEED=2 cargo test -p collab # 155 passed
SEED=3 cargo test -p collab # 154 passed, 1 failed (test_joining_channel_ancestor_member - pre-existing issue)
# Clippy passes
./script/clippy
Quick Reference
# Run specific fixed tests
SEED=0 cargo test -p collab test_host_disconnect -- --nocapture
SEED=0 cargo test -p collab test_following_to_channel_notes_without_a_shared_project -- --nocapture
# Run GPUI tests
cargo test -p gpui
# Run full collab suite
SEED=0 cargo test -p collab
# Run clippy
./script/clippy
# Debug scheduler (conditional logging)
DEBUG_SCHEDULER=1 SEED=0 cargo test -p collab <test_name> -- --nocapture
Architecture
Per-App Arena Design
Each App now has:
├── element_arena: RefCell<Arena> (isolated per App)
├── ForegroundExecutor (with unique SessionId)
└── Windows (allocate from App's arena during draw)
Thread-locals:
├── ELEMENT_ARENA: RefCell<Arena> (fallback, used when no app arena is active)
└── CURRENT_ELEMENT_ARENA: Cell<Option<*const RefCell<Arena>>> (points to active app's arena)
During Window::draw():
1. ElementArenaScope::enter() sets CURRENT_ELEMENT_ARENA to this App's arena
2. Element allocation via with_element_arena() uses CURRENT_ELEMENT_ARENA
3. ElementArenaScope::drop() restores previous arena (handles nesting)
4. ArenaClearNeeded::clear() clears this App's arena (not the global one)
Key Scheduler Differences (OLD vs NEW)
OLD TestDispatcher:
delayed: Vec<(Duration, RunnableVariant)>- runnables stored directlytick()moves expired runnables to background queue at start- Flat random selection among all foreground + background tasks
- Uses
HashMap<TestDispatcherId, VecDeque>for foreground tasks
NEW TestScheduler:
timers: Vec<Timer>where Timer hasoneshot::Sender- dropping wakes futuresstep()expires timers first, then selects task- Priority-weighted random selection among candidates
- Session-based foreground isolation (first task per session is candidate)
- Blocked sessions are excluded from candidate selection
Implications: The new scheduler is more correct but may execute tasks in different (valid) orders. This exposes latent race conditions and timing assumptions in tests that happened to pass by accident with the old scheduler.
Files Modified in This Session
-
crates/gpui/src/app.rs- Added
Arenaimport - Added
element_arena: RefCell<Arena>field toAppstruct - Initialize arena in constructor
- Added
-
crates/gpui/src/window.rs- Added
CURRENT_ELEMENT_ARENAthread-local - Added
with_element_arena()function - Added
ElementArenaScopestruct withenter()andDropimpl - Modified
ArenaClearNeededto track arena pointer - Modified
Window::draw()to useElementArenaScope
- Added
-
crates/gpui/src/element.rs- Changed
AnyElement::new()to usewith_element_arena()
- Changed
-
crates/collab/src/tests/editor_tests.rs- Changed
RECEIVE_TIMEOUTtoRECONNECT_TIMEOUTon line 154
- Changed
-
crates/collab/src/tests/following_tests.rs- Added
deterministic.run_until_parked()after client A's edits
- Added
Files with Debug Logging (to be removed before merge)
crates/scheduler/src/test_scheduler.rs- Has conditionalDEBUG_SCHEDULERlogging inadvance_clockandstep_filtered