In the era of Industry 4.0, fault tolerance is essential for maintaining the robustness and resilience of industrial systems facing unforeseen or undesirable disturbances. Current methodologies for fault tolerance stages namely, detection, diagnosis, and recovery, do not correspond with the accelerated technological evolution pace over the past two decades. Driven by the advent of digital technologies such as Internet of Things, cloud and edge computing, and artificial intelligence, associated with enhanced computational processing and communication capabilities, local or monolithic centralized fault tolerance methodologies are out of sync with contemporary and future systems. Consequently, these methodologies are limited in achieving the maximum benefits enabled by the integration of these technologies, such as accuracy and performance improvements.
Accordingly, in this paper, a collaborative fault tolerance methodology for cyber–physical systems, named Collaborative Fault * (CF*), is proposed. The proposed methodology takes advantage of the inherent data analysis and communication capabilities of cyber–physical components. The proposed methodology is based on multi-agent system principles, where key components are self-fault tolerant, and adopts collaborative and distributed intelligence behavior when necessary to improve its fault tolerance capabilities. Experiments were conducted focusing on the fault detection stage for temperature and humidity sensors in warehouse racks. The experimental results confirmed the accuracy and performance improvements under CF* compared with the local methodology and competitiveness when compared with a centralized approach.