1. RADICE: Causal Graph Based Root Cause Analysis for System Performance Diagnostic
- Author
-
Tonon, Andrea, Zhang, Meng, Caglayan, Bora, Shen, Fei, Gui, Tong, Wang, MingXue, and Zhou, Rong
- Subjects
Computer Science - Software Engineering - Abstract
Root cause analysis is one of the most crucial operations in software reliability regarding system performance diagnostic. It aims to identify the root causes of system performance anomalies, allowing the resolution or the future prevention of issues that can cause millions of dollars in losses. Common existing approaches relying on data correlation or full domain expert knowledge are inaccurate or infeasible in most industrial cases, since correlation does not imply causation, and domain experts may not have full knowledge of complex and real-time systems. In this work, we define a novel causal domain knowledge model representing causal relations about the underlying system components to allow domain experts to contribute partial domain knowledge for root cause analysis. We then introduce RADICE, an algorithm that through the causal graph discovery, enhancement, refinement, and subtraction processes is able to output a root cause causal sub-graph showing the causal relations between the system components affected by the anomaly. We evaluated RADICE with simulated data and reported a real data use case, sharing the lessons we learned. The experiments show that RADICE provides better results than other baseline methods, including causal discovery algorithms and correlation based approaches for root cause analysis., Comment: Accepted at IEEE SANER 2025
- Published
- 2025