mixa/zed

Fork 0

Files

Nathan Sobo e007b3227f Use per-app element arena only and scope test draws

2025-12-14 14:27:25 -07:00

9.5 KiB

Raw Blame History

Scheduler Integration Status

Branch: `scheduler-integration`

PR: #44810

✅ Completed

Per-App Arena Isolation (SIGSEGV Fix)

The scheduler integration exposed a latent bug where multiple test sessions shared a single thread-local element arena (ELEMENT_ARENA). When the scheduler correctly allowed cross-session task interleaving during block(), one session could clear the arena while another held references, causing SIGSEGV crashes.

Problem Scenario:

App A starts drawing, allocates elements in the shared ELEMENT_ARENA
App A's task yields during some async operation
App B's task runs, draws, and clears ELEMENT_ARENA
App A's task resumes and tries to use its now-invalid ArenaBox references → SIGSEGV

Solution: Each App now owns its own arena. A thread-local pointer (CURRENT_ELEMENT_ARENA) is set during Window::draw() to route element allocations to the correct arena.

Files Changed:

crates/gpui/src/app.rs
- Added element_arena: RefCell<Arena> field to App struct
- Added Arena import from crate
- Initialize arena in App::new_app()
crates/gpui/src/window.rs
- Added CURRENT_ELEMENT_ARENA thread-local (points to current app's arena during draw)
- Added with_element_arena() helper function for element allocation
- Added ElementArenaScope RAII guard for setting/restoring the current arena
- Updated ArenaClearNeeded to store a pointer to the arena that needs clearing
- Modified Window::draw() to set up ElementArenaScope and return arena-aware ArenaClearNeeded
crates/gpui/src/element.rs
- Changed AnyElement::new() to use with_element_arena() instead of directly accessing ELEMENT_ARENA

Scheduler Integration

The scheduler crate is fully integrated into GPUI:

TestScheduler provides deterministic async execution for tests
PlatformScheduler wraps platform dispatchers for production
Session-based foreground isolation prevents same-session reentrancy during blocking
Cross-session task execution works correctly (different "machines" can make progress)

✅ Fixed: test_host_disconnect

Root Cause

The test used insufficient time (RECEIVE_TIMEOUT = 10s) after allow_connections() for client A to reconnect. The reconnection logic uses exponential backoff with jitter:

500ms → 1s → 2s → 4s → 8s → 16s → 30s (max)
After 40s of failed attempts, the next retry could be 30-60s in the future
10s was not enough; other similar tests use RECONNECT_TIMEOUT (30s)

Why It Passed on Main

The old scheduler processed tasks/timers in a slightly different order, which happened to allow reconnection within 10s by chance. The new scheduler's different (but equally valid) task ordering exposed the timing bug.

Fix Applied

Changed line 154 in crates/collab/src/tests/editor_tests.rs:

// Before:
cx_a.background_executor.advance_clock(RECEIVE_TIMEOUT);
// After:
cx_a.background_executor.advance_clock(RECONNECT_TIMEOUT);

Test now passes on all seeds tested (0-9).

✅ Fixed: test_following_to_channel_notes_without_a_shared_project

Root Cause

The test had a race condition where client B started following client A before client A's buffer edits were synced to the server.

The Race:

Client A opens channel notes, inserts "Hello from A." (13 chars), sets selection to 3..4
These operations are sent via client.send() (fire-and-forget, non-blocking)
Before the server processes these operations, client B starts following
Client B sends JoinChannelBuffer request
Server responds with buffer state that doesn't include A's latest edits
Client B's ChannelView is created with empty/partial buffer
The selection state (3..4) from the follow response can't be applied properly
Client B ends up with cursor at end of buffer (13..13) instead of 3..4

Why It Passed on Main

The old scheduler's task ordering happened to process the buffer sync messages before client B's follow request. The new scheduler's different (but equally valid) ordering exposed this missing synchronization point.

Fix Applied

Added deterministic.run_until_parked() after client A's edits in crates/collab/src/tests/following_tests.rs:

channel_notes_1_a.update_in(cx_a, |notes, window, cx| {
    assert_eq!(notes.channel(cx).unwrap().name, "channel-1");
    notes.editor.update(cx, |editor, cx| {
        editor.insert("Hello from A.", window, cx);
        editor.change_selections(SelectionEffects::no_scroll(), window, cx, |selections| {
            selections.select_ranges(vec![MultiBufferOffset(3)..MultiBufferOffset(4)]);
        });
    });
});

// Ensure client A's edits are synced to the server before client B starts following.
deterministic.run_until_parked();

// Client B follows client A.
workspace_b
    .update_in(cx_b, |workspace, window, cx| {
        workspace.start_following(client_a.peer_id().unwrap(), window, cx).unwrap()
    })
    .await
    .unwrap();

Test now passes on all seeds tested (0-9).

🔄 Known Issue: test_joining_channel_ancestor_member (Pre-existing)

Symptom

This test fails at seed 3 when running all collab tests together, but passes when run in isolation.

Error

panicked at crates/livekit_client/src/test.rs:710:45:
called `Result::unwrap()` on an `Err` value: no server found for url

Analysis

This is a test isolation issue in the LiveKit test infrastructure, not related to our changes:

Room::test_server() tries to find a registered test server for a URL
When running all tests, some previous test's cleanup or the scheduler's task ordering causes the server to not be found
The test passes in isolation because there's no interference from other tests

Recommendation

This is a pre-existing issue exposed by the new scheduler's task ordering. It should be investigated separately as a test infrastructure problem in the LiveKit mock client.

Test Results Summary

# All these pass:
SEED=0..9 cargo test -p collab test_host_disconnect
SEED=0..9 cargo test -p collab test_following_to_channel_notes_without_a_shared_project
cargo test -p gpui  # 73 passed

# Full collab test suite:
SEED=0 cargo test -p collab  # 155 passed
SEED=1 cargo test -p collab  # 155 passed
SEED=2 cargo test -p collab  # 155 passed
SEED=3 cargo test -p collab  # 154 passed, 1 failed (test_joining_channel_ancestor_member - pre-existing issue)

# Clippy passes
./script/clippy

Quick Reference

# Run specific fixed tests
SEED=0 cargo test -p collab test_host_disconnect -- --nocapture
SEED=0 cargo test -p collab test_following_to_channel_notes_without_a_shared_project -- --nocapture

# Run GPUI tests
cargo test -p gpui

# Run full collab suite
SEED=0 cargo test -p collab

# Run clippy
./script/clippy

# Debug scheduler (conditional logging)
DEBUG_SCHEDULER=1 SEED=0 cargo test -p collab <test_name> -- --nocapture

Architecture

Per-App Arena Design

Each App now has:
├── element_arena: RefCell<Arena>  (isolated per App)
├── ForegroundExecutor (with unique SessionId)
└── Windows (allocate from App's arena during draw)

Thread-locals:
├── ELEMENT_ARENA: RefCell<Arena>         (fallback, used when no app arena is active)
└── CURRENT_ELEMENT_ARENA: Cell<Option<*const RefCell<Arena>>>  (points to active app's arena)

During Window::draw():
1. ElementArenaScope::enter() sets CURRENT_ELEMENT_ARENA to this App's arena
2. Element allocation via with_element_arena() uses CURRENT_ELEMENT_ARENA
3. ElementArenaScope::drop() restores previous arena (handles nesting)
4. ArenaClearNeeded::clear() clears this App's arena (not the global one)

Key Scheduler Differences (OLD vs NEW)

OLD TestDispatcher:

delayed: Vec<(Duration, RunnableVariant)> - runnables stored directly
tick() moves expired runnables to background queue at start
Flat random selection among all foreground + background tasks
Uses HashMap<TestDispatcherId, VecDeque> for foreground tasks

NEW TestScheduler:

timers: Vec<Timer> where Timer has oneshot::Sender - dropping wakes futures
step() expires timers first, then selects task
Priority-weighted random selection among candidates
Session-based foreground isolation (first task per session is candidate)
Blocked sessions are excluded from candidate selection

Implications: The new scheduler is more correct but may execute tasks in different (valid) orders. This exposes latent race conditions and timing assumptions in tests that happened to pass by accident with the old scheduler.

Files Modified in This Session

crates/gpui/src/app.rs
- Added Arena import
- Added element_arena: RefCell<Arena> field to App struct
- Initialize arena in constructor
crates/gpui/src/window.rs
- Added CURRENT_ELEMENT_ARENA thread-local
- Added with_element_arena() function
- Added ElementArenaScope struct with enter() and Drop impl
- Modified ArenaClearNeeded to track arena pointer
- Modified Window::draw() to use ElementArenaScope
crates/gpui/src/element.rs
- Changed AnyElement::new() to use with_element_arena()
crates/collab/src/tests/editor_tests.rs
- Changed RECEIVE_TIMEOUT to RECONNECT_TIMEOUT on line 154
crates/collab/src/tests/following_tests.rs
- Added deterministic.run_until_parked() after client A's edits

Files with Debug Logging (to be removed before merge)

crates/scheduler/src/test_scheduler.rs - Has conditional DEBUG_SCHEDULER logging in advance_clock and step_filtered

9.5 KiB Raw Blame History

Scheduler Integration Status

Branch: scheduler-integration

PR: #44810

✅ Completed

Per-App Arena Isolation (SIGSEGV Fix)

Scheduler Integration

✅ Fixed: test_host_disconnect

Root Cause

Why It Passed on Main

Fix Applied

✅ Fixed: test_following_to_channel_notes_without_a_shared_project

Root Cause

Why It Passed on Main

Fix Applied

🔄 Known Issue: test_joining_channel_ancestor_member (Pre-existing)

Symptom

Error

Analysis

Recommendation

Test Results Summary

Quick Reference

Architecture

Per-App Arena Design

Key Scheduler Differences (OLD vs NEW)

Files Modified in This Session

Files with Debug Logging (to be removed before merge)

9.5 KiB

Raw Blame History

Branch: `scheduler-integration`