Files
zed/STATUS.md
2025-12-14 14:27:25 -07:00

9.5 KiB

Scheduler Integration Status

Branch: scheduler-integration

PR: #44810


Completed

Per-App Arena Isolation (SIGSEGV Fix)

The scheduler integration exposed a latent bug where multiple test sessions shared a single thread-local element arena (ELEMENT_ARENA). When the scheduler correctly allowed cross-session task interleaving during block(), one session could clear the arena while another held references, causing SIGSEGV crashes.

Problem Scenario:

  1. App A starts drawing, allocates elements in the shared ELEMENT_ARENA
  2. App A's task yields during some async operation
  3. App B's task runs, draws, and clears ELEMENT_ARENA
  4. App A's task resumes and tries to use its now-invalid ArenaBox references → SIGSEGV

Solution: Each App now owns its own arena. A thread-local pointer (CURRENT_ELEMENT_ARENA) is set during Window::draw() to route element allocations to the correct arena.

Files Changed:

  • crates/gpui/src/app.rs
    • Added element_arena: RefCell<Arena> field to App struct
    • Added Arena import from crate
    • Initialize arena in App::new_app()
  • crates/gpui/src/window.rs
    • Added CURRENT_ELEMENT_ARENA thread-local (points to current app's arena during draw)
    • Added with_element_arena() helper function for element allocation
    • Added ElementArenaScope RAII guard for setting/restoring the current arena
    • Updated ArenaClearNeeded to store a pointer to the arena that needs clearing
    • Modified Window::draw() to set up ElementArenaScope and return arena-aware ArenaClearNeeded
  • crates/gpui/src/element.rs
    • Changed AnyElement::new() to use with_element_arena() instead of directly accessing ELEMENT_ARENA

Scheduler Integration

The scheduler crate is fully integrated into GPUI:

  • TestScheduler provides deterministic async execution for tests
  • PlatformScheduler wraps platform dispatchers for production
  • Session-based foreground isolation prevents same-session reentrancy during blocking
  • Cross-session task execution works correctly (different "machines" can make progress)

Fixed: test_host_disconnect

Root Cause

The test used insufficient time (RECEIVE_TIMEOUT = 10s) after allow_connections() for client A to reconnect. The reconnection logic uses exponential backoff with jitter:

  • 500ms → 1s → 2s → 4s → 8s → 16s → 30s (max)
  • After 40s of failed attempts, the next retry could be 30-60s in the future
  • 10s was not enough; other similar tests use RECONNECT_TIMEOUT (30s)

Why It Passed on Main

The old scheduler processed tasks/timers in a slightly different order, which happened to allow reconnection within 10s by chance. The new scheduler's different (but equally valid) task ordering exposed the timing bug.

Fix Applied

Changed line 154 in crates/collab/src/tests/editor_tests.rs:

// Before:
cx_a.background_executor.advance_clock(RECEIVE_TIMEOUT);
// After:
cx_a.background_executor.advance_clock(RECONNECT_TIMEOUT);

Test now passes on all seeds tested (0-9).


Fixed: test_following_to_channel_notes_without_a_shared_project

Root Cause

The test had a race condition where client B started following client A before client A's buffer edits were synced to the server.

The Race:

  1. Client A opens channel notes, inserts "Hello from A." (13 chars), sets selection to 3..4
  2. These operations are sent via client.send() (fire-and-forget, non-blocking)
  3. Before the server processes these operations, client B starts following
  4. Client B sends JoinChannelBuffer request
  5. Server responds with buffer state that doesn't include A's latest edits
  6. Client B's ChannelView is created with empty/partial buffer
  7. The selection state (3..4) from the follow response can't be applied properly
  8. Client B ends up with cursor at end of buffer (13..13) instead of 3..4

Why It Passed on Main

The old scheduler's task ordering happened to process the buffer sync messages before client B's follow request. The new scheduler's different (but equally valid) ordering exposed this missing synchronization point.

Fix Applied

Added deterministic.run_until_parked() after client A's edits in crates/collab/src/tests/following_tests.rs:

channel_notes_1_a.update_in(cx_a, |notes, window, cx| {
    assert_eq!(notes.channel(cx).unwrap().name, "channel-1");
    notes.editor.update(cx, |editor, cx| {
        editor.insert("Hello from A.", window, cx);
        editor.change_selections(SelectionEffects::no_scroll(), window, cx, |selections| {
            selections.select_ranges(vec![MultiBufferOffset(3)..MultiBufferOffset(4)]);
        });
    });
});

// Ensure client A's edits are synced to the server before client B starts following.
deterministic.run_until_parked();

// Client B follows client A.
workspace_b
    .update_in(cx_b, |workspace, window, cx| {
        workspace.start_following(client_a.peer_id().unwrap(), window, cx).unwrap()
    })
    .await
    .unwrap();

Test now passes on all seeds tested (0-9).


🔄 Known Issue: test_joining_channel_ancestor_member (Pre-existing)

Symptom

This test fails at seed 3 when running all collab tests together, but passes when run in isolation.

Error

panicked at crates/livekit_client/src/test.rs:710:45:
called `Result::unwrap()` on an `Err` value: no server found for url

Analysis

This is a test isolation issue in the LiveKit test infrastructure, not related to our changes:

  • Room::test_server() tries to find a registered test server for a URL
  • When running all tests, some previous test's cleanup or the scheduler's task ordering causes the server to not be found
  • The test passes in isolation because there's no interference from other tests

Recommendation

This is a pre-existing issue exposed by the new scheduler's task ordering. It should be investigated separately as a test infrastructure problem in the LiveKit mock client.


Test Results Summary

# All these pass:
SEED=0..9 cargo test -p collab test_host_disconnect
SEED=0..9 cargo test -p collab test_following_to_channel_notes_without_a_shared_project
cargo test -p gpui  # 73 passed

# Full collab test suite:
SEED=0 cargo test -p collab  # 155 passed
SEED=1 cargo test -p collab  # 155 passed
SEED=2 cargo test -p collab  # 155 passed
SEED=3 cargo test -p collab  # 154 passed, 1 failed (test_joining_channel_ancestor_member - pre-existing issue)

# Clippy passes
./script/clippy

Quick Reference

# Run specific fixed tests
SEED=0 cargo test -p collab test_host_disconnect -- --nocapture
SEED=0 cargo test -p collab test_following_to_channel_notes_without_a_shared_project -- --nocapture

# Run GPUI tests
cargo test -p gpui

# Run full collab suite
SEED=0 cargo test -p collab

# Run clippy
./script/clippy

# Debug scheduler (conditional logging)
DEBUG_SCHEDULER=1 SEED=0 cargo test -p collab <test_name> -- --nocapture

Architecture

Per-App Arena Design

Each App now has:
├── element_arena: RefCell<Arena>  (isolated per App)
├── ForegroundExecutor (with unique SessionId)
└── Windows (allocate from App's arena during draw)

Thread-locals:
├── ELEMENT_ARENA: RefCell<Arena>         (fallback, used when no app arena is active)
└── CURRENT_ELEMENT_ARENA: Cell<Option<*const RefCell<Arena>>>  (points to active app's arena)

During Window::draw():
1. ElementArenaScope::enter() sets CURRENT_ELEMENT_ARENA to this App's arena
2. Element allocation via with_element_arena() uses CURRENT_ELEMENT_ARENA
3. ElementArenaScope::drop() restores previous arena (handles nesting)
4. ArenaClearNeeded::clear() clears this App's arena (not the global one)

Key Scheduler Differences (OLD vs NEW)

OLD TestDispatcher:

  • delayed: Vec<(Duration, RunnableVariant)> - runnables stored directly
  • tick() moves expired runnables to background queue at start
  • Flat random selection among all foreground + background tasks
  • Uses HashMap<TestDispatcherId, VecDeque> for foreground tasks

NEW TestScheduler:

  • timers: Vec<Timer> where Timer has oneshot::Sender - dropping wakes futures
  • step() expires timers first, then selects task
  • Priority-weighted random selection among candidates
  • Session-based foreground isolation (first task per session is candidate)
  • Blocked sessions are excluded from candidate selection

Implications: The new scheduler is more correct but may execute tasks in different (valid) orders. This exposes latent race conditions and timing assumptions in tests that happened to pass by accident with the old scheduler.


Files Modified in This Session

  1. crates/gpui/src/app.rs

    • Added Arena import
    • Added element_arena: RefCell<Arena> field to App struct
    • Initialize arena in constructor
  2. crates/gpui/src/window.rs

    • Added CURRENT_ELEMENT_ARENA thread-local
    • Added with_element_arena() function
    • Added ElementArenaScope struct with enter() and Drop impl
    • Modified ArenaClearNeeded to track arena pointer
    • Modified Window::draw() to use ElementArenaScope
  3. crates/gpui/src/element.rs

    • Changed AnyElement::new() to use with_element_arena()
  4. crates/collab/src/tests/editor_tests.rs

    • Changed RECEIVE_TIMEOUT to RECONNECT_TIMEOUT on line 154
  5. crates/collab/src/tests/following_tests.rs

    • Added deterministic.run_until_parked() after client A's edits

Files with Debug Logging (to be removed before merge)

  • crates/scheduler/src/test_scheduler.rs - Has conditional DEBUG_SCHEDULER logging in advance_clock and step_filtered