Explainable Artificial Intelligence (XAI) is essential for deploying complex Computer Vision (CV) models in areas such as medical diagnosis, where transparency and accountability are required. This paper explores a hybrid interpretability framework that balances faithfulness, how well the explanation matches the model’s decision, and computational efficiency. We assess three main types of XAI: attribution-based (Grad-CAM), perturbation-based (RISE), and transformer-based attention methods. Studies show that perturbation-based methods such as RISE achieve the highest fidelity (Insertion AUC 0.727, Pointing Game Accuracy 91.9%), but they are too slow for real-time clinical use (0.05 FPS). Transformer-based XAI methods, by contrast, align more closely with expert annotations in medical tasks (IoU 0.099) and operate at a moderate speed (25.0 FPS). We suggest combining the localisation accuracy of attention-based models with the efficiency needed in clinical settings to create high-quality, useful saliency maps for medical diagnosis.