Files
clash-verge-rev/src-tauri/src/cmd/profile_switch/driver.rs
Sline c2dcd86722 refactor: profile switch (#5197)
* refactor: proxy refresh

* fix(proxy-store): properly hydrate and filter backend provider snapshots

* fix(proxy-store): add monotonic fetch guard and event bridge cleanup

* fix(proxy-store): tweak fetch sequencing guard to prevent snapshot invalidation from wiping fast responses

* docs: UPDATELOG.md

* fix(proxy-snapshot, proxy-groups): restore last-selected proxy and group info

* fix(proxy): merge static and provider entries in snapshot; fix Virtuoso viewport height

* fix(proxy-groups): restrict reduced-height viewport to chain-mode column

* refactor(profiles): introduce a state machine

* refactor:replace state machine with reducer

* refactor:introduce a profile switch worker

* refactor: hooked up a backend-driven profile switch flow

* refactor(profile-switch): serialize switches with async queue and enrich frontend events

* feat(profiles): centralize profile switching with reducer/driver queue to fix stuck UI on rapid toggles

* chore: translate comments and log messages to English to avoid encoding issues

* refactor: migrate backend queue to SwitchDriver actor

* fix(profile): unify error string types in validation helper

* refactor(profile): make switch driver fully async and handle panics safely

* refactor(cmd): move switch-validation helper into new profile_switch module

* refactor(profile): modularize switch logic into profile_switch.rs

* refactor(profile_switch): modularize switch handler

- Break monolithic switch handler into proper module hierarchy
- Move shared globals, constants, and SwitchScope guard to state.rs
- Isolate queue orchestration and async task spawning in driver.rs
- Consolidate switch pipeline and config patching in workflow.rs
- Extract request pre-checks/YAML validation into validation.rs

* refactor(profile_switch): centralize state management and add cancellation flow

- Introduced SwitchManager in state.rs to unify mutex, sequencing, and SwitchScope handling.
- Added SwitchCancellation and SwitchRequest wrappers to encapsulate cancel tokens and notifications.
- Updated driver to allocate task IDs via SwitchManager, cancel old tokens, and queue next jobs in order.
- Updated workflow to check cancellation and sequence at each phase, replacing global flags with manager APIs.

* feat(profile_switch): integrate explicit state machine for profile switching

- workflow.rs:24 now delegates each switch to SwitchStateMachine, passing an owned SwitchRequest.
  Queue cancellation and state-sequence checks are centralized inside the machine instead of scattered guards.
- workflow.rs:176 replaces the old helper with `SwitchStateMachine::new(manager(), None, profiles).run().await`,
  ensuring manual profile patches follow the same workflow (locking, validation, rollback) as queued switches.
- workflow.rs:180 & 275 expose `validate_profile_yaml` and `restore_previous_profile` for reuse inside the state machine.

- workflow/state_machine.rs:1 introduces a dedicated state machine module.
  It manages global mutex acquisition, request/cancellation state, YAML validation, draft patching,
  `CoreManager::update_config`, failure rollback, and tray/notification side-effects.
  Transitions check for cancellations and stale sequences; completions release guards via `SwitchScope` drop.

* refactor(profile-switch): integrate stage-aware panic handling

- src-tauri/src/cmd/profile_switch/workflow/state_machine.rs:1
  Defines SwitchStage and SwitchPanicInfo as crate-visible, wraps each transition in with_stage(...) with catch_unwind, and propagates CmdResult<bool> to distinguish validation failures from panics while keeping cancellation semantics.

- src-tauri/src/cmd/profile_switch/workflow.rs:25
  Updates run_switch_job to return Result<bool, SwitchPanicInfo>, routing timeout, validation, config, and stage panic cases separately. Reuses SwitchPanicInfo for logging/UI notifications; patch_profiles_config maps state-machine panics into user-facing error strings.

- src-tauri/src/cmd/profile_switch/driver.rs:1
  Adds SwitchJobOutcome to unify workflow results: normal completions carry bool, and panics propagate SwitchPanicInfo. The driver loop now logs panics explicitly and uses AssertUnwindSafe(...).catch_unwind() to guard setup-phase panics.

* refactor(profile-switch): add watchdog, heartbeat, and async timeout guards

- Introduce SwitchHeartbeat for stage tracking and timing; log stage transitions with elapsed durations.
- Add watchdog in driver to cancel stalled switches (5s heartbeat timeout).
- Wrap blocking ops (Config::apply, tray updates, profiles_save_file_safe, etc.) with time::timeout to prevent async stalls.
- Improve logs for stage transitions and watchdog timeouts to clarify cancellation points.

* refactor(profile-switch): async post-switch tasks, early lock release, and spawn_blocking for IO

* feat(profile-switch): track cleanup and coordinate pipeline

- Add explicit cleanup tracking in the driver (`cleanup_profiles` map + `CleanupDone` messages) to know when background post-switch work is still running before starting a new workflow. (driver.rs:29-50)
- Update `handle_enqueue` to detect “cleanup in progress”: same-profile retries are short-circuited; other requests collapse the pending queue, cancelling old tokens so only the latest intent survives. (driver.rs:176-247)
- Rework scheduling helpers: `start_next_job` refuses to start while cleanup is outstanding; discarded requests release cancellation tokens; cleanup completion explicitly restarts the pipeline. (driver.rs:258-442)

* feat(profile-switch): unify post-switch cleanup handling

- workflow.rs (25-427) returns `SwitchWorkflowResult` (success + CleanupHandle) or `SwitchWorkflowError`.
  All failure/timeout paths stash post-switch work into a single CleanupHandle.
  Cleanup helpers (`notify_profile_switch_finished` and `close_connections_after_switch`) run inside that task for proper lifetime handling.

- driver.rs (29-439) propagates CleanupHandle through `SwitchJobOutcome`, spawns a bridge to wait for completion, and blocks `start_next_job` until done.
  Direct driver-side panics now schedule failure cleanup via the shared helper.

* tmp

* Revert "tmp"

This reverts commit e582cf4a65.

* refactor: queue frontend events through async dispatcher

* refactor: queue frontend switch/proxy events and throttle notices

* chore: frontend debug log

* fix: re-enable only ProfileSwitchFinished events - keep others suppressed for crash isolation

- Re-enabled only ProfileSwitchFinished events; RefreshClash, RefreshProxy, and ProfileChanged remain suppressed (they log suppression messages)
- Allows frontend to receive task completion notifications for UI feedback while crash isolation continues
- src-tauri/src/core/handle.rs now only suppresses notify_profile_changed
- Serialized emitter, frontend logging bridge, and other diagnostics unchanged

* refactor: refreshClashData

* refactor(proxy): stabilize proxy switch pipeline and rendering

- Add coalescing buffer in notification.rs to emit only the latest proxies-updated snapshot
- Replace nextTick with queueMicrotask in asyncQueue.ts for same-frame hydration
- Hide auto-generated GLOBAL snapshot and preserve optional metadata in proxy-snapshot.ts
- Introduce stable proxy rendering state in AppDataProvider (proxyTargetProfileId, proxyDisplayProfileId, isProxyRefreshPending)
- Update proxy page to fade content during refresh and overlay status banner instead of showing incomplete snapshot

* refactor(profiles): move manual activating logic to reducer for deterministic queue tracking

* refactor: replace proxy-data event bridge with pure polling and simplify proxy store

- Replaced the proxy-data event bridge with pure polling: AppDataProvider now fetches the initial snapshot and drives refreshes from the polled switchStatus, removing verge://refresh-* listeners (src/providers/app-data-provider.tsx).
- Simplified proxy-store by dropping the proxies-updated listener queue and unused payload/normalizer helpers; relies on SWR/provider fetch path + calcuProxies for live updates (src/stores/proxy-store.ts).
- Trimmed layout-level event wiring to keep only notice/show/hide subscriptions, removing obsolete refresh listeners (src/pages/_layout/useLayoutEvents.ts).

* refactor(proxy): streamline proxies-updated handling and store event flow

- AppDataProvider now treats `proxies-updated` as the fast path: the listener
  calls `applyLiveProxyPayload` immediately and schedules only a single fallback
  `fetchLiveProxies` ~600 ms later (replacing the old 0/250/1000/2000 cascade).
  Expensive provider/rule refreshes run in parallel via `Promise.allSettled`, and
  the multi-stage queue on profile updates completion was removed
  (src/providers/app-data-provider.tsx).

- Rebuilt proxy-store to support the event flow: restored `setLive`, provider
  normalization, and an animation-frame + async queue that applies payloads without
  blocking. Exposed `applyLiveProxyPayload` so providers can push events directly
  into the store (src/stores/proxy-store.ts).

* refactor: switch delay

* refactor(app-data-provider): trigger getProfileSwitchStatus revalidation on profile-switch-finished

- AppDataProvider now listens to `profile-switch-finished` and calls `mutate("getProfileSwitchStatus")` to immediately update state and unlock buttons (src/providers/app-data-provider.tsx).
- Retain existing detailed timing logs for monitoring other stages.
- Frontend success notifications remain instant; background refreshes continue asynchronously.

* fix(profiles): prevent duplicate toast on page remount

* refactor(profile-switch): make active switches preemptible and prevent queue piling

- Add notify mechanism to SwitchCancellation to await cancellation without busy-waiting (state.rs:82)
- Collapse pending queue to a single entry in the driver; cancel in-flight task on newer request (driver.rs:232)
- Update handle_update_core to watch cancel token and 30s timeout; release locks, discard draft, and exit early if canceled (state_machine.rs:301)
- Providers revalidate status immediately on profile-switch-finished events (app-data-provider.tsx:208)

* refactor(core): make core reload phase controllable, reduce 0xcfffffff risk

- CoreManager::apply_config now calls `reload_config_with_retry`, each attempt waits up to 5s, retries 3 times; on failure, returns error with duration logged and triggers core restart if needed (src-tauri/src/core/manager/config.rs:175, 205)
- `reload_config_with_retry` logs attempt info on timeout or error; if error is a Mihomo connection issue, fallback to original restart logic (src-tauri/src/core/manager/config.rs:211)
- `reload_config_once` retains original Mihomo call for retry wrapper usage (src-tauri/src/core/manager/config.rs:247)

* chore(frontend-logs): downgrade routine event logs from info to debug

- Logs like `emit_via_app entering spawn_blocking`, `Async emit…`, `Buffered proxies…` are now debug-level (src-tauri/src/core/notification.rs:155, :265, :309…)
- Genuine warnings/errors (failures/timeouts) remain at warn/error
- Core stage logs remain info to keep backend tracking visible

* refactor(frontend-emit): make emit_via_app fire-and-forget async

- `emit_via_app` now a regular function; spawns with `tokio::spawn` and logs a warn if `emit_to` fails, caller returns immediately (src-tauri/src/core/notification.rs:269)
- Removed `.await` at Async emit and flush_proxies calls; only record dispatch duration and warn on failure (src-tauri/src/core/notification.rs:211, :329)

* refactor(ui): restructure profile switch for event-driven speed + polling stability

- Backend
  - SwitchManager maintains a lightweight event queue: added `event_sequence`, `recent_events`, and `SwitchResultEvent`; provides `push_event` / `events_after` (state.rs)
  - `handle_completion` pushes events on success/failure and keeps `last_result` (driver.rs) for frontend incremental fetch
  - New Tauri command `get_profile_switch_events(after_sequence)` exposes `events_after` (profile_switch/mod.rs → profile.rs → lib.rs)
- Notification system
  - `NotificationSystem::process_event` only logs debug, disables WebView `emit_to`, fixes 0xcfffffff
  - Related emit/buffer functions now safe no-op, removed unused structures and warnings (notification.rs)
- Frontend
  - services/cmds.ts defines `SwitchResultEvent` and `getProfileSwitchEvents`
  - `AppDataProvider` holds `switchEventSeqRef`, polls incremental events every 0.25s (busy) / 1s (idle); each event triggers:
      - immediate `globalMutate("getProfiles")` to refresh current profile
      - background refresh of proxies/providers/rules via `Promise.allSettled` (failures logged, non-blocking)
      - forced `mutateSwitchStatus` to correct state
  - original switchStatus effect calls `handleSwitchResult` as fallback; other toast/activation logic handled in profiles.tsx
- Commands / API cleanup
  - removed `pub use profile_switch::*;` in cmd::mod.rs to avoid conflicts; frontend uses new command polling

* refactor(frontend): optimize profile switch with optimistic updates

* refactor(profile-switch): switch to event-driven flow with Profile Store

- SwitchManager pushes events; frontend polls get_profile_switch_events
- Zustand store handles optimistic profiles; AppDataProvider applies updates and background-fetches
- UI flicker removed

* fix(app-data): re-hook profile store updates during switch hydration

* fix(notification): restore frontend event dispatch and non-blocking emits

* fix(app-data-provider): restore proxy refresh and seed snapshot after refactor

* fix: ensure switch completion events are received and handle proxies-updated

* fix(app-data-provider): dedupe switch results by taskId and fix stale profile state

* fix(profile-switch): ensure patch_profiles_config_by_profile_index waits for real completion and handle join failures in apply_config_with_timeout

* docs: UPDATELOG.md

* chore: add necessary comments

* fix(core): always dispatch async proxy snapshot after RefreshClash event

* fix(proxy-store, provider): handle pending snapshots and proxy profiles

- Added pending snapshot tracking in proxy-store so `lastAppliedFetchId` no longer jumps on seed. Profile adoption is deferred until a qualifying fetch completes. Exposed `clearPendingProfile` for rollback support.
- Cleared pending snapshot state whenever live payloads apply or the store resets, preventing stale optimistic profile IDs after failures.
- In provider integration, subscribed to the pending proxy profile and fed it into target-profile derivation. Cleared it on failed switch results so hydration can advance and UI status remains accurate.

* fix(proxy): re-hook tray refresh events into proxy refresh queue

- Reattached listen("verge://refresh-proxy-config", …) at src/providers/app-data-provider.tsx:402 and registered it for cleanup.
- Added matching window fallback handler at src/providers/app-data-provider.tsx:430 so in-app dispatches share the same refresh path.

* fix(proxy-snapshot/proxy-groups): address review findings on snapshot placeholders

- src/utils/proxy-snapshot.ts:72-95 now derives snapshot group members solely from proxy-groups.proxies, so provider ids under `use` no longer generate placeholder proxy items.
- src/components/proxy/proxy-groups.tsx:665-677 lets the hydration overlay capture pointer events (and shows a wait cursor) so users can’t interact with snapshot-only placeholders before live data is ready.

* fix(profile-switch): preserve queued requests and avoid stale connection teardown

- Keep earlier queued switches intact by dropping the blanket “collapse” call: after removing duplicates for the same profile, new requests are simply appended, leaving other profiles pending (driver.rs:376). Resolves queue-loss scenario.
- Gate connection cleanup on real successes so cancelled/stale runs no longer tear down Mihomo connections; success handler now skips close_connections_after_switch when success == false (workflow.rs:419).

* fix(profile-switch, layout): improve profile validation and restore backend refresh

- Hardened profile validation using `tokio::fs` with a 5s timeout and offloading YAML parsing to `AsyncHandler::spawn_blocking`, preventing slow disks or malformed files from freezing the runtime (src-tauri/src/cmd/profile_switch/validation.rs:9, 71).
- Restored backend-triggered refresh handling by listening for `verge://refresh-clash-config` / `verge://refresh-verge-config` and invoking shared refresh services so SWR caches stay in sync with core events (src/pages/_layout/useLayoutEvents.ts:6, 45, 55).

* feat(profile-switch): handle cancellations for superseded requests

- Added a `cancelled` flag and constructor so superseded requests publish an explicit cancellation instead of a failure (src-tauri/src/cmd/profile_switch/state.rs:249, src-tauri/src/cmd/profile_switch/driver.rs:482)
- Updated the profile switch effect to log cancellations as info, retain the shared `mutate` call, and skip emitting error toasts while still refreshing follow-up work (src/pages/profiles.tsx:554, src/pages/profiles.tsx:581)
- Exposed the new flag on the TypeScript contract to keep downstream consumers type-safe (src/services/cmds.ts:20)

* fix(profiles): wrap logging payload for Tauri frontend_log

* fix(profile-switch): add rollback and error propagation for failed persistence

- Added rollback on apply failure so Mihomo restores to the previous profile
  before exiting the success path early (state_machine.rs:474).
- Reworked persist_profiles_with_timeout to surface timeout/join/save errors,
  convert them into CmdResult failures, and trigger rollback + error propagation
  when persistence fails (state_machine.rs:703).

* fix(profile-switch): prevent mid-finalize reentrancy and lingering tasks

* fix(profile-switch): preserve pending queue and surface discarded switches

* fix(profile-switch): avoid draining Mihomo sockets on failed/cancelled switches

* fix(app-data-provider): restore backend-driven refresh and reattach fallbacks

* fix(profile-switch): queue concurrent updates and add bounded wait/backoff

* fix(proxy): trigger live refresh on app start for proxy snapshot

* refactor(profile-switch): split flow into layers and centralize async cleanup

- Introduced `SwitchDriver` to encapsulate queue and driver logic while keeping the public Tauri command API.
- Added workflow/cleanup helpers for notification dispatch and Mihomo connection draining, re-exported for API consistency.
- Replaced monolithic state machine with `core.rs`, `context.rs`, and `stages.rs`, plus a thin `mod.rs` re-export layer; stage methods are now individually testable.
- Removed legacy `workflow/state_machine.rs` and adjusted visibility on re-exported types/constants to ensure compilation.
2025-10-30 17:29:15 +08:00

684 lines
22 KiB
Rust

use super::{
CmdResult,
state::{
ProfileSwitchStatus, SwitchCancellation, SwitchManager, SwitchRequest, SwitchResultStatus,
SwitchTaskStatus, current_millis, manager,
},
workflow::{self, SwitchPanicInfo, SwitchStage},
};
use crate::{logging, utils::logging::Type};
use futures::FutureExt;
use once_cell::sync::OnceCell;
use smartstring::alias::String as SmartString;
use std::{
collections::{HashMap, VecDeque},
panic::AssertUnwindSafe,
time::Duration,
};
use tokio::{
sync::{
Mutex as AsyncMutex,
mpsc::{self, error::TrySendError},
oneshot,
},
time::{self, MissedTickBehavior},
};
// Single shared queue so profile switches are executed sequentially and can
// collapse redundant requests for the same profile.
const SWITCH_QUEUE_CAPACITY: usize = 32;
static SWITCH_QUEUE: OnceCell<mpsc::Sender<SwitchDriverMessage>> = OnceCell::new();
type CompletionRegistry = AsyncMutex<HashMap<u64, oneshot::Sender<SwitchResultStatus>>>;
static SWITCH_COMPLETION_WAITERS: OnceCell<CompletionRegistry> = OnceCell::new();
/// Global map of task id -> completion channel sender used when callers await the result.
fn completion_waiters() -> &'static CompletionRegistry {
SWITCH_COMPLETION_WAITERS.get_or_init(|| AsyncMutex::new(HashMap::new()))
}
/// Register a oneshot sender so `switch_profile_and_wait` can be notified when its task finishes.
async fn register_completion_waiter(task_id: u64) -> oneshot::Receiver<SwitchResultStatus> {
let (sender, receiver) = oneshot::channel();
let mut guard = completion_waiters().lock().await;
if guard.insert(task_id, sender).is_some() {
logging!(
warn,
Type::Cmd,
"Replacing existing completion waiter for task {}",
task_id
);
}
receiver
}
/// Remove an outstanding completion waiter; used when enqueue fails or succeeds immediately.
async fn remove_completion_waiter(task_id: u64) -> Option<oneshot::Sender<SwitchResultStatus>> {
completion_waiters().lock().await.remove(&task_id)
}
/// Fire-and-forget notify helper so we do not block the driver loop.
fn notify_completion_waiter(task_id: u64, result: SwitchResultStatus) {
tokio::spawn(async move {
let sender = completion_waiters().lock().await.remove(&task_id);
if let Some(sender) = sender {
let _ = sender.send(result);
}
});
}
const WATCHDOG_TIMEOUT: Duration = Duration::from_secs(5);
const WATCHDOG_TICK: Duration = Duration::from_millis(500);
// Mutable snapshot of the driver's world; all mutations happen on the driver task.
#[derive(Debug, Default)]
struct SwitchDriverState {
active: Option<SwitchRequest>,
queue: VecDeque<SwitchRequest>,
latest_tokens: HashMap<SmartString, SwitchCancellation>,
cleanup_profiles: HashMap<SmartString, tokio::task::JoinHandle<()>>,
last_result: Option<SwitchResultStatus>,
}
// Messages passed through SWITCH_QUEUE so the driver can react to events in order.
#[derive(Debug)]
enum SwitchDriverMessage {
Request {
request: SwitchRequest,
respond_to: oneshot::Sender<bool>,
},
Completion {
request: SwitchRequest,
outcome: SwitchJobOutcome,
},
CleanupDone {
profile: SmartString,
},
}
#[derive(Debug)]
enum SwitchJobOutcome {
Completed {
success: bool,
cleanup: workflow::CleanupHandle,
},
Panicked {
info: SwitchPanicInfo,
cleanup: workflow::CleanupHandle,
},
}
pub(super) async fn switch_profile(
profile_index: impl Into<SmartString>,
notify_success: bool,
) -> CmdResult<bool> {
switch_profile_impl(profile_index.into(), notify_success, false).await
}
pub(super) async fn switch_profile_and_wait(
profile_index: impl Into<SmartString>,
notify_success: bool,
) -> CmdResult<bool> {
switch_profile_impl(profile_index.into(), notify_success, true).await
}
async fn switch_profile_impl(
profile_index: SmartString,
notify_success: bool,
wait_for_completion: bool,
) -> CmdResult<bool> {
// wait_for_completion is used by CLI flows that must block until the switch finishes.
let manager = manager();
let sender = switch_driver_sender();
let request = SwitchRequest::new(
manager.next_task_id(),
profile_index.clone(),
notify_success,
);
logging!(
info,
Type::Cmd,
"Queue profile switch task {} -> {} (notify={})",
request.task_id(),
profile_index,
notify_success
);
let task_id = request.task_id();
let mut completion_rx = if wait_for_completion {
Some(register_completion_waiter(task_id).await)
} else {
None
};
let (tx, rx) = oneshot::channel();
let enqueue_result = match sender.try_send(SwitchDriverMessage::Request {
request,
respond_to: tx,
}) {
Ok(_) => match rx.await {
Ok(result) => Ok(result),
Err(err) => {
logging!(
error,
Type::Cmd,
"Failed to receive enqueue result for profile {}: {}",
profile_index,
err
);
Err("switch profile queue unavailable".into())
}
},
Err(TrySendError::Full(msg)) => {
logging!(
warn,
Type::Cmd,
"Profile switch queue is full; waiting for space: {}",
profile_index
);
match sender.send(msg).await {
Ok(_) => match rx.await {
Ok(result) => Ok(result),
Err(err) => {
logging!(
error,
Type::Cmd,
"Failed to receive enqueue result after wait for {}: {}",
profile_index,
err
);
Err("switch profile queue unavailable".into())
}
},
Err(err) => {
logging!(
error,
Type::Cmd,
"Profile switch queue closed while waiting ({}): {}",
profile_index,
err
);
Err("switch profile queue unavailable".into())
}
}
}
Err(TrySendError::Closed(_)) => {
logging!(
error,
Type::Cmd,
"Profile switch queue is closed, cannot enqueue: {}",
profile_index
);
Err("switch profile queue unavailable".into())
}
};
let accepted = match enqueue_result {
Ok(result) => result,
Err(err) => {
if completion_rx.is_some() {
remove_completion_waiter(task_id).await;
}
return Err(err);
}
};
if !accepted {
if completion_rx.is_some() {
remove_completion_waiter(task_id).await;
}
return Ok(false);
}
if let Some(rx_completion) = completion_rx.take() {
match rx_completion.await {
Ok(status) => Ok(status.success),
Err(err) => {
logging!(
error,
Type::Cmd,
"Switch task {} completion channel dropped: {}",
task_id,
err
);
Err("profile switch completion unavailable".into())
}
}
} else {
Ok(true)
}
}
fn switch_driver_sender() -> &'static mpsc::Sender<SwitchDriverMessage> {
SWITCH_QUEUE.get_or_init(|| {
let (tx, rx) = mpsc::channel::<SwitchDriverMessage>(SWITCH_QUEUE_CAPACITY);
let driver_tx = tx.clone();
tokio::spawn(async move {
let manager = manager();
let driver = SwitchDriver::new(manager, driver_tx);
driver.run(rx).await;
});
tx
})
}
struct SwitchDriver {
manager: &'static SwitchManager,
sender: mpsc::Sender<SwitchDriverMessage>,
state: SwitchDriverState,
}
impl SwitchDriver {
fn new(manager: &'static SwitchManager, sender: mpsc::Sender<SwitchDriverMessage>) -> Self {
let state = SwitchDriverState::default();
manager.set_status(state.snapshot(manager));
Self {
manager,
sender,
state,
}
}
async fn run(mut self, mut rx: mpsc::Receiver<SwitchDriverMessage>) {
while let Some(message) = rx.recv().await {
match message {
SwitchDriverMessage::Request {
request,
respond_to,
} => {
self.handle_enqueue(request, respond_to);
}
SwitchDriverMessage::Completion { request, outcome } => {
self.handle_completion(request, outcome);
}
SwitchDriverMessage::CleanupDone { profile } => {
self.handle_cleanup_done(profile);
}
}
}
}
fn handle_enqueue(&mut self, request: SwitchRequest, respond_to: oneshot::Sender<bool>) {
// Each new request supersedes older ones for the same profile to avoid thrashing the core.
let mut responder = Some(respond_to);
let accepted = true;
let profile_key = request.profile_id().clone();
let cleanup_pending =
self.state.active.is_none() && !self.state.cleanup_profiles.is_empty();
if cleanup_pending && self.state.cleanup_profiles.contains_key(&profile_key) {
logging!(
debug,
Type::Cmd,
"Cleanup running for {}; queueing switch task {} -> {} to run afterwards",
profile_key,
request.task_id(),
profile_key
);
if let Some(previous) = self
.state
.latest_tokens
.insert(profile_key.clone(), request.cancel_token().clone())
{
previous.cancel();
}
self.state
.queue
.retain(|queued| queued.profile_id() != &profile_key);
self.state.queue.push_back(request);
if let Some(sender) = responder.take() {
let _ = sender.send(accepted);
}
self.publish_status();
return;
}
if cleanup_pending {
logging!(
debug,
Type::Cmd,
"Cleanup running for {} profile(s); queueing task {} -> {} to run after cleanup without clearing existing requests",
self.state.cleanup_profiles.len(),
request.task_id(),
profile_key
);
}
if let Some(previous) = self
.state
.latest_tokens
.insert(profile_key.clone(), request.cancel_token().clone())
{
previous.cancel();
}
if let Some(active) = self.state.active.as_mut()
&& active.profile_id() == &profile_key
{
active.cancel_token().cancel();
active.merge_notify(request.notify());
self.state
.queue
.retain(|queued| queued.profile_id() != &profile_key);
self.state.queue.push_front(request.clone());
if let Some(sender) = responder.take() {
let _ = sender.send(accepted);
}
self.publish_status();
return;
}
if let Some(active) = self.state.active.as_ref() {
logging!(
debug,
Type::Cmd,
"Cancelling active switch task {} (profile={}) in favour of task {} -> {}",
active.task_id(),
active.profile_id(),
request.task_id(),
profile_key
);
active.cancel_token().cancel();
}
self.state
.queue
.retain(|queued| queued.profile_id() != &profile_key);
self.state.queue.push_back(request.clone());
if let Some(sender) = responder.take() {
let _ = sender.send(accepted);
}
self.start_next_job();
self.publish_status();
}
fn handle_completion(&mut self, request: SwitchRequest, outcome: SwitchJobOutcome) {
// Translate the workflow result into an event the frontend can understand.
let result_record = match &outcome {
SwitchJobOutcome::Completed { success, .. } => {
logging!(
info,
Type::Cmd,
"Switch task {} completed (success={})",
request.task_id(),
success
);
if *success {
SwitchResultStatus::success(request.task_id(), request.profile_id())
} else {
SwitchResultStatus::failed(request.task_id(), request.profile_id(), None, None)
}
}
SwitchJobOutcome::Panicked { info, .. } => {
logging!(
error,
Type::Cmd,
"Switch task {} panicked at stage {:?}: {}",
request.task_id(),
info.stage,
info.detail
);
SwitchResultStatus::failed(
request.task_id(),
request.profile_id(),
Some(format!("{:?}", info.stage)),
Some(info.detail.clone()),
)
}
};
if let Some(active) = self.state.active.as_ref()
&& active.task_id() == request.task_id()
{
self.state.active = None;
}
if let Some(latest) = self.state.latest_tokens.get(request.profile_id())
&& latest.same_token(request.cancel_token())
{
self.state.latest_tokens.remove(request.profile_id());
}
let cleanup = match outcome {
SwitchJobOutcome::Completed { cleanup, .. } => cleanup,
SwitchJobOutcome::Panicked { cleanup, .. } => cleanup,
};
self.track_cleanup(request.profile_id().clone(), cleanup);
let event_record = result_record.clone();
self.state.last_result = Some(result_record);
notify_completion_waiter(request.task_id(), event_record.clone());
self.manager.push_event(event_record);
self.start_next_job();
self.publish_status();
}
fn handle_cleanup_done(&mut self, profile: SmartString) {
if let Some(handle) = self.state.cleanup_profiles.remove(&profile) {
handle.abort();
}
self.start_next_job();
self.publish_status();
}
fn start_next_job(&mut self) {
if self.state.active.is_some() || !self.state.cleanup_profiles.is_empty() {
self.publish_status();
return;
}
while let Some(request) = self.state.queue.pop_front() {
if request.cancel_token().is_cancelled() {
self.discard_request(request);
continue;
}
self.state.active = Some(request.clone());
self.start_switch_job(request);
break;
}
self.publish_status();
}
fn track_cleanup(&mut self, profile: SmartString, cleanup: workflow::CleanupHandle) {
if let Some(existing) = self.state.cleanup_profiles.remove(&profile) {
existing.abort();
}
let driver_tx = self.sender.clone();
let profile_clone = profile.clone();
let handle = tokio::spawn(async move {
let profile_label = profile_clone.clone();
if let Err(err) = cleanup.await {
logging!(
warn,
Type::Cmd,
"Cleanup task for profile {} failed: {}",
profile_label.as_str(),
err
);
}
if let Err(err) = driver_tx
.send(SwitchDriverMessage::CleanupDone {
profile: profile_clone,
})
.await
{
logging!(
error,
Type::Cmd,
"Failed to push cleanup completion for profile {}: {}",
profile_label.as_str(),
err
);
}
});
self.state.cleanup_profiles.insert(profile, handle);
}
fn start_switch_job(&self, request: SwitchRequest) {
// Run the workflow in a background task while the driver keeps processing messages.
let driver_tx = self.sender.clone();
let manager = self.manager;
let completion_request = request.clone();
let heartbeat = request.heartbeat().clone();
let cancel_token = request.cancel_token().clone();
let task_id = request.task_id();
let profile_label = request.profile_id().clone();
tokio::spawn(async move {
let mut watchdog_interval = time::interval(WATCHDOG_TICK);
watchdog_interval.set_missed_tick_behavior(MissedTickBehavior::Skip);
let workflow_fut =
AssertUnwindSafe(workflow::run_switch_job(manager, request)).catch_unwind();
tokio::pin!(workflow_fut);
let job_result = loop {
tokio::select! {
res = workflow_fut.as_mut() => {
break match res {
Ok(Ok(result)) => SwitchJobOutcome::Completed {
success: result.success,
cleanup: result.cleanup,
},
Ok(Err(error)) => SwitchJobOutcome::Panicked {
info: error.info,
cleanup: error.cleanup,
},
Err(payload) => {
let info = SwitchPanicInfo::driver_task(
workflow::describe_panic_payload(payload.as_ref()),
);
let cleanup = workflow::schedule_post_switch_failure(
profile_label.clone(),
completion_request.notify(),
completion_request.task_id(),
);
SwitchJobOutcome::Panicked { info, cleanup }
}
};
}
_ = watchdog_interval.tick() => {
if cancel_token.is_cancelled() {
continue;
}
let elapsed = heartbeat.elapsed();
if elapsed > WATCHDOG_TIMEOUT {
let stage = SwitchStage::from_code(heartbeat.stage_code())
.unwrap_or(SwitchStage::Workflow);
logging!(
warn,
Type::Cmd,
"Switch task {} watchdog timeout (profile={} stage={:?}, elapsed={:?}); cancelling",
task_id,
profile_label.as_str(),
stage,
elapsed
);
cancel_token.cancel();
}
}
}
};
let request_for_error = completion_request.clone();
if let Err(err) = driver_tx
.send(SwitchDriverMessage::Completion {
request: completion_request,
outcome: job_result,
})
.await
{
logging!(
error,
Type::Cmd,
"Failed to push switch completion to driver: {}",
err
);
notify_completion_waiter(
request_for_error.task_id(),
SwitchResultStatus::failed(
request_for_error.task_id(),
request_for_error.profile_id(),
Some("driver".to_string()),
Some(format!("completion dispatch failed: {}", err)),
),
);
}
});
}
/// Mark a request as failed because a newer request superseded it.
fn discard_request(&mut self, request: SwitchRequest) {
let key = request.profile_id().clone();
let should_remove = self
.state
.latest_tokens
.get(&key)
.map(|latest| latest.same_token(request.cancel_token()))
.unwrap_or(false);
if should_remove {
self.state.latest_tokens.remove(&key);
}
if !request.cancel_token().is_cancelled() {
request.cancel_token().cancel();
}
let event = SwitchResultStatus::cancelled(
request.task_id(),
request.profile_id(),
Some("request superseded".to_string()),
);
self.state.last_result = Some(event.clone());
notify_completion_waiter(request.task_id(), event.clone());
self.manager.push_event(event);
}
fn publish_status(&self) {
self.manager.set_status(self.state.snapshot(self.manager));
}
}
impl SwitchDriverState {
/// Lightweight struct suitable for sharing across the command boundary.
fn snapshot(&self, manager: &SwitchManager) -> ProfileSwitchStatus {
let active = self
.active
.as_ref()
.map(|req| SwitchTaskStatus::from_request(req, false));
let queue = self
.queue
.iter()
.map(|req| SwitchTaskStatus::from_request(req, true))
.collect::<Vec<_>>();
let cleanup_profiles = self
.cleanup_profiles
.keys()
.map(|key| key.to_string())
.collect::<Vec<_>>();
ProfileSwitchStatus {
is_switching: manager.is_switching(),
active,
queue,
cleanup_profiles,
last_result: self.last_result.clone(),
last_updated: current_millis(),
}
}
}