Using precision reduction to efficiently improve mixed-precision GPUs reliability
Visualizar/abrir
Data
2021Autor
Orientador
Co-orientador
Nível acadêmico
Graduação
Assunto
Abstract
Duplication With Comparison (DWC) is a traditional and accepted method for improving systems’ reliability. DWC consists of duplicating critical regions in Software or in Hardware level by creating redundant operations in order to decrease the probability of an unwanted event. However, this technique introduces an expensive overhead in power consumption, processing time and in resources allocation. This obstacle is due to the fact that the critical operations are computed at least two times in t ...
Duplication With Comparison (DWC) is a traditional and accepted method for improving systems’ reliability. DWC consists of duplicating critical regions in Software or in Hardware level by creating redundant operations in order to decrease the probability of an unwanted event. However, this technique introduces an expensive overhead in power consumption, processing time and in resources allocation. This obstacle is due to the fact that the critical operations are computed at least two times in this process. Reduced Precision Duplication With Comparison (RP-DWC) is an effective software level solution to improve the performance of the conventional DWC. RP-DWC aims to mitigate these overheads by enabling parallel processing in underused Floating Point Units (FPUs) in mixed precision Graphic Processing Units (GPUs). By making use of precision reduction to efficiently improve the reliability in mixed precision GPUs, RPDWC extends the DWC technique, introducing proper ways to handle redundancy with different precision operations. Improving GPUs reliability is an extremely valuable challenge in the fault tolerance field since GPUs are adopted in both High-Performance Computing (HPC) and in automotive real-time applications. When GPUs are exposed to a natural environment, such as the surface of the Earth at sea level, they are also exposed to the Earth’s surface radiation. Furthermore, this exposure can be critical, given that these radiation particles may hit the GPU’s internal circuit, corrupt sensitive data and consequently generate undesired outputs. Introducing duplication with reduced precision in a trustworthy manner to maintain reliability in safety-critical systems is an arduous task that we propose to further investigate in this work. ...
Instituição
Universidade Federal do Rio Grande do Sul. Instituto de Informática. Curso de Ciência da Computação: Ênfase em Engenharia da Computação: Bacharelado.
Coleções
-
TCC Ciência da Computação (1024)
Este item está licenciado na Creative Commons License