SecureVision: Real-Time Multimodal Cyber Deepfake Identification System
**DOI :****https://doi.org/10.5281/zenodo.19017743**
Download Full-Text PDF Cite this Publication
Dr. R. Kaviarasan, D. Mahammad Rafi, K. Thulasi Teja, C. Devendra Obulareddy, 2026, SecureVision: Real-Time Multimodal Cyber Deepfake Identification System, INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT) Volume 15, Issue 03 , March – 2026
* **Open Access**
* Article Download / Views: 3
* **Authors :** Dr. R. Kaviarasan, D. Mahammad Rafi, K. Thulasi Teja, C. Devendra Obulareddy
* **Paper ID :** IJERTV15IS030284
* **Volume & Issue : ** Volume 15, Issue 03 , March – 2026
* **Published (First Online):** 14-03-2026
* **ISSN (Online) :** 2278-0181
* **Publisher Name :** IJERT
* **License:** This work is licensed under a Creative Commons Attribution 4.0 International License
__ PDF Version
View
__ Text Only Version
#### SecureVision: Real-Time Multimodal Cyber Deepfake Identification System
Dr. R. Kaviarasan
Associate Professor, Dept of CSE(CS), RGM College of Engineering and Technology, Nandyal, AP
D. Mahammad Rafi
UG Scholar Dept of CSE(CS) RGM College of Engineering and Technology, Nandyal, AP
K. Thulasi Teja
UG Scholar Dept of CSE(CS) RGM College of Engineering and Technology, Nandyal, AP
C. Devendra Obulareddy
UG Scholar Dept of CSE(CS) RGM College of Engineering and Technology, Nandyal, AP
Abstract – Deepfake technology has rapidly evolved into a serious cybersecurity concern, making it possible to create highly convincing fake audio and video content that is difficult to distinguish from real media. These manipulations can lead to misinformation, identity theft, and financial fraud. To address this growing challenge, this project introduces SecureVision, a smart and reliable multimodal deepfake detection framework. SecureVision combines deep learning, self-supervised learning, Vision Transformers (ViT), and big data analytics to build a strong defense against digital manipulation. Instead of analyzing only one type of media, the system simultaneously examines both audio and images, improving overall detection accuracy and reliability. For audio deepfake detection, the model leverages SpecRNet architecture, while image classification is performed using a Vision Transformer-based approach.
The system is trained on large-scale datasets such as ASVspoof 2021, multilingual audio datasets, and diverse web-scraped facial image collections. Experimental results show promising performance, achieving 92.34% accuracy for audio detection and 89.35% for image detection. Despite its advanced capabilities, SecureVision is designed to operate efficiently with moderate GPU requirements. Overall, the framework offers a scalable, practical, and real-world solution to combat the increasing threat of deepfake attacks
Keywords- Deepfake videos; Multimodal Learning; Vision Transformer (VIT); SpecRNet
1. INTRODUCTION
SecureVision shows strong performance, but it still has some important limitations [1][3]. Even though it achieves around 89.35% accuracy in image detection, this is slightly lower than some top-performing deepfake models [2], and the training results indicate possible overfitting, which means it might not perform as well on completely new data. The audio dataset mainly focuses on ten Indian languages, so the system may face challenges when analyzing voices from other parts of the world. Additionally, since some images were collected from the web, there may be noise or incorrect labels in the dataset, which can affect reliability and consistency. From a practical standpoint, the system needs about 8GB of RAM and GPU support, making it harder to run on low-power devices, and because it has only been tested in controlled settings, its real- time performance at large scale is still uncertain.
SecureVision combines deep learning, large-scale data processing, and cybersecurity features into one integrated system [3]. It analyzes multilingual audio using neural networks to detect synthetic speech and applies a Vision Transformer model to identify manipulated facial images. Supported by large datasets and protected with security measures like authentication and encryption, it functions as both a detection tool and a secure platform.
The paper highlights that combining audio and visual analysis improves deepfake detection accuracy while remaining practical without high-end hardware. It also shows strong potential for reducing fraud, misinformation, and fake media, while encouraging future improvements such as real-time detection and broader language coverage [5] [6].
Challenges and Issues
* Rapid Evolution of Deepfakes: Deepfake technology is advancing very quickly, creating more realistic fake videos and voices. Because of this, detection systems can become outdated and must be updated regularly to remain effective.
* Generalization Issues: A model might perform well during testing, but in real-world situations with different lighting, accents, or background noise, its accuracy may decrease.
* Limited Dataset Diversity & Overfitting: If the training data lacks enough variety, the system may not work equally well for all users. Overfitting can also cause the model to memorize training data instead of learning general patterns.
* High Hardware & Real-Time Challenges: Many detection systems require powerful GPUs and fast processors, making it difficult to use them on low-end devices or for live, large-scale monitoring.
* Accuracy, Security & Ethical Risks: Incorrect results, targeted attacks to bypass detection, the need for continuous retraining, and concerns about privacy and data protection remain significant challenges.
Highlights of Audio and Image Deepfake Detection
* The system combines advanced audio analysis (SpecRNet with LFCC and Whisper features) and
image analysis (Vision Transformer) trained on large multilingual datasets, allowing it to understand deeper patterns and generalize better to real-world data.
* Previous models often used limited datasets, required expensive GPUs, focused only on small visual details, supported only one language in audio, and depended on heavy manual labelling.
* The proposed system is more reliable, reaching 92.34% accuracy for audio and 89.35% for images, along with high precision, a strong F1-score, and a high AUC score, meaning it can correctly distinguish real and fake content with very few mistakes.
The remaining sections of the paper are organized as follows: Section I: Introduction and their Challenges and Issues. Section II: Discuss about Literature Survey with its Pros and Cons. Section III: Highlights of the Proposed Method. Section IV: Discuss about Experimental Results and Simulation Environment. Section V: Discuss about Conclusion and Future Enhancements followed by References.
2. LITERATURE SURVEY
1. Naresh Kumar and Ankit Kundu (2024) proposed a multimodal deepfake detection framework named SecureVision, which integrates Vision Transformer (ViT) for video frame analysis and SpecRNet for audio spoof detection. In this approach, facial frames are first extracted from videos and preprocessed before being fed into the Vision Transformer [1], where images are divided into patches to capture global contextual relationships effectively. At the same time, the corresponding audio signals are transformed into LFCC-based spectrogram features and processed through SpecRNet to identify spectral inconsistencies commonly found in synthetic speech [7][8]. The features extracted from both visual and audio modalities are then fused to perform the final classification, improving detection reliability. The experimental results demonstrated 92.34% accuracy for audio detection and 89.35% accuracy for video detection, showing improved robustness in multimodal deepfake scenarios. The main advantages of this framework include its ability to detect both audio and video manipulations, scalability for big data environments, and strong feature representation capability. However, the approach has certain limitations, such as the requirement for large labeled datasets, high training time, and significant computational cost de to the complexity of transformer-based architectures.
2. Xin Wang and Junichi Yamagishi (2022) proposed a Self- Supervised Spoof Detection method that leverages large amounts of unlabeled speech data to learn robust speech representations before fine-tuning the model for spoof detection tasks. Instead of relying entirely on labeled data, the model first undergoes self-supervised pretraining to capture intrinsic speech characteristics and then applies anomaly detection techniques to distinguish between genuine and spoofed speech [2]. This approach improves feature learning efficiency and reduces dependence on manually annotated datasets. The method was evaluated using the ASVspoof 2021 benchmark dataset, where it achieved an Equal Error Rate (EER) of less than 5%, demonstrating strong detection
performance. However, the system shows limitations when exposed to unseen spoofing attacks that differ from the training distribution, and it may face domain adaptation challenges when applied to different recording environments or speech conditions.
3. Alexei Baevski et al. (2020) introduced wav2vec 2.0, a self- supervised learning framework that extracts rich contextual speech embeddings directly from raw waveform inputs using transformer-based encoders [9]. Unlike traditional methods that rely on handcrafted features such as MFCC, wav2vec 2.0 learns latent speech representations through large-scale pretraining and then fine-tunes them for downstream tasks like spoof detection. The model was initially pretrained on the LibriSpeech corpus and later fine-tuned on the ASVspoof 2019 dataset for spoof detection tasks [7][8]. Experimental results demonstrated significant performance improvements compared to conventional MFCC-based systems, while also reducing the requirement for large amounts of labeled data. However, the approach has limitations, including heavy pretraining computational cost and high hardware demand, making it resource-intensive for real-time or low-resource environments.
4. Jung Jee-weon et al. (2022) proposed AASIST (Audio Anti- Spoofing using Integrated Spectro Temporal Graph Attention Networks), a deep learning framework designed to enhance spoof speech detection by modeling both spectral and temporal dependencies. In this method, input speech signals are first converted into spectrogram representations, which are then transformed into graph structures to capture relationships between timefrequency components. A Graph Attention Network (GAT) is applied to learn discriminative spoof patterns by assigning adaptive importance weights to different nodes in the graph. The model was evaluated on the ASVspoof 2019 and ASVspoof 2021 datasets, achieving above 95% detection accuracy, demonstrating strong robustness against various spoofing attacks [10]. However, the architecture is relatively complex due to the integration of graph-based learning mechanisms, and it may suffer from slower real-time inference performance because of high computational and memory requirements.
5. Hemlata Tak et al. (2021) proposed RawNet2, an end-to-end deep learning model designed for spoof speech detection by directly processing raw waveform signals without relying on handcrafted acoustic features. The architecture employs deep convolutional neural network (CNN) layers to automatically learn discriminative representations from the raw audio input, enabling the model to capture subtle artifacts introduced by spoofing techniques. By eliminating traditional feature extraction methods such as MFCC, RawNet2 allows the network to learn task specific features directly from the signal domain. The model was evaluated on the ASVspoof 2019 dataset, where it achieved approximately 94% accuracy, demonstrating strong detection capability [7][11]. However, the system is sensitive to background noise and may experience performance degradation under channel mismatch conditions, such as variations in recording devices or transmission environments.
6. Junichi Yamagishi et al. (2021) introduced the ASVspoof Evaluation Framework, a standardized benchmark platform designed to evaluate automatic speaker verification (ASV)
systems against spoofing attacks. The framework provides well-structured datasets, clearly defined protocols, and standardized evaluation metrics to ensure fair comparison among different spoof detection approaches. It primarily uses datasets released under the ASVspoof Challenge, which include various types of spoofing attacks such as text-to- speech (TTS), voice conversion (VC), and replay attacks. The experimental results are reported using Equal Error Rate (EER) as the primary evaluation metric, enabling consistent performance comparison across research works. Although the framework significantly improves benchmarking consistency and research reproducibility, it has limitations such as limited real-world diversity in attack scenarios and potential dataset bias that may not fully represent practical deployment environments.
7. Parth Patel et al. (2020) proposed Trans-DF, a transfer learningbased deepfake detection framework that utilizes pre- trained convolutional neural network (CNN) models fine- tuned specifically for manipulated face detection. In this approach, a CNN pre-trained on large-scale image datasets is adapted to detect deepfake artifacts by learning discriminative facial manipulation features [12]. Transfer learning helps reduce training time and improves performance when labeled data is limited. The model was evaluated on the FaceForensics++ and Celeb-DF datasets, achieving around 90% detection accuracy. While the method benefits from faster convergence and efficient feature reuse, it has certain limitations, including overfitting to specific datasets and limited generalization performance when tested across different datasets or unseen manipulation techniques.
8. Umur Aybars Ciftci et al. (2020) proposed FakeCatcher, a deepfake detection approach that leverages biological signals to identify manipulated videos. Instead of relying solely on visual artifacts, the method analyzes subtle photoplethysmography (PPG) signalsvariations in facial blood flow patternscaptured from video frames [14][15]. Authentic videos naturally contain consistent pulse signals across facial regions, whereas deepfake videos often fail to replicate these physiological patterns accurately. The model was evaluated using the FaceForensics++ and Celeb-DF datasets, achieving approximately 96% detection accuracy. Although FakeCatcher demonstrates high effectiveness and robustness against visual manipulation techniques, it has limitations, including the requirement for high-resolution and high-quality video to accurately extract biological signals, as well as increased computational cost due to complex signal processing and analysis.
9. Andreas Rössler et al. (2019) introduced FaceForensics++, a large-scale benchmark dataset designed to support research in face manipulation detection. The dataset contains a wide variety of manipulated videos generated using different facial manipulation techniques, enabling researchers to train and evaluate convolutional neural network (CNN) models effectively [12]. By providing both original and tampered video samples with varying compression levels, the dataset facilitates robust training and fair comparison among deepfake detection methods. The study utilized the FaceForensics++ dataset, and models trained on it achieved accuracy levels ranging from 85% to 95%, depending on the architecture and manipulation type. The primary advantages of this work
include establishing a standardized benchmark dataset and incorporating multiple manipulation methods for comprehensive evaluation. However, its limitations include a focus primarily on face-based manipulations and limited real- world diversity, which may affect generalization to more comple, real-world deepfake scenarios.
10. Joel Frank and Lea Schönherr (2021) proposed WaveFake, a spoof speech detection approach that focuses on identifying synthetic audio by analyzing frequency-domain artifacts introduced by generative speech models. The method examines spectral inconsistencies and abnormal frequency patterns that commonly occur in AI-generated speech but are less prevalent in genuine human recordings. By leveraging signal processing techniques along with machine learning classifiers, the system distinguishes between real and fake audio samples. The model was evaluated on the WaveFake Dataset and ASVspoof 2019 datasets, achieving around 90% detection accuracy on known speech generation models[7][16]. However, the approach has limitations, including poor generalization to unseen or newly developed generative models and sensitivity to audio compression artifacts, which may reduce detection performance in real- world scenarios.
3. PROPOSED METHODOLOGY
Multimodal framework for detecting deepfakes by analyzing both audio and image content together [17]. Instead of relying on a single type of media, the system strengthens detection accuracy by combining advanced deep learning models with big data analytics, making it more robust against modern deepfake techniques.
For audio deepfake detection, the system uses the SpecRNet architecture, which integrates Whisper-based embeddings with LFCC (Linear Frequency Cepstral Coefficients) features extracted from multilingual and ASVspoof datasets [7]. When
an input audio signal is received, it first undergoes signal processing steps such as Short-Time Fourier Transform (STFT), filter bank analysis, and Discrete Cosine Transform (DCT) to extract meaningful spectral representations:
These extracted features form a vector , which is then passed into a neural network classifier. The model calculates the probability of the audio being real or fake using the softmax function:
To train the model effectively, Cross-Entropy Loss is used to measure the difference between predicted and actual labels:
Here, represents the true class label (real or fake). This process allows the system to learn subtle inconsistencies present in synthetic audio.
For image deepfake detection, the system employs a Vision
Transformer (ViT) model [1]. An input image is divided into smaller fixed-size patches. These patches are flattened and converted into embeddings before being processed through the transformer architecture. The core of ViT is the self-attention mechanism:
Here, , , and represent query, key, and value matrices derived from image embeddings. This attention mechanism helps the model focus on important spatial relationships and detect subtle visual manipulations. The final output is passed through a fully connected layer with softmax activation, and the model is optimized using:
Finally, the system combines predictions from both audio and image models to make a more reliable decision. The fusion strategy balances both modalities using:
where controls the contribution of each modality. By integrating multimodal learning, transformer-based architectures, and large-scale data processing, SecureVision achieves strong detection performance (92.34% accuracy for audio and 89.35% for image). This combined approach improves scalability, adaptability, and overall cybersecurity resilience against increasingly sophisticated deepfake attacks.
1. audio_signal Load(A)
2. cleaned_signal Preprocess(audio_signal)
3. spectral_features STFT(cleaned_signal)
4. lfcc_features Compute_LFCC(spectral_features)
5. whisper_embeddings Extract_Whisper(cleaned_signal)
6. feature_vector Concatenate(lfcc_features,
whisper_embeddings)
7. logits SpecRNet_Model(feature_vector)
8. probabilities Softmax(logits)
9. if probabilities[FAKE] > probabilities[REAL] then
10. return “FAKE”
11. else
12. return “REAL”
13. end if.
The Audio Deepfake Detection algorithm begins by loading the input audio file and performing preprocessing steps such as noise removal and normalization to improve signal quality. The cleaned audio signal is then converted into the frequency domain using the Short-Time Fourier Transform (STFT) to capture important timefrequency characteristics. From these spectral representations, LFCC features are computed to model detailed acoustic patterns that may indicate manipulation. At the same time, Whisper embeddings are extracted to capture high-level contextual and speech
representations from the audio. Both LFCC features and Whisper embeddings are combined to form a single comprehensive feature vector [17]. This fused feature vector is then passed into the SpecRNet deep learning model for classification. The model generates output scores (logits), which are converted into probabilities using the Softmax function. Finally, the algorithm compares the probabilities of the REAL and FAKE classes and returns the label corresponding to the higher probability, thereby determining whether the audio is genuine or deepfake.
Algorithm 2 :Image Deepfake Detection
1. image Load(I)
2. image Resize_Normalize(image)
3. patches Split_into_Patches(image)
4. embeddings Linear_Projection(patches)
5. embeddings Add_Positional_Encoding(embeddings)
6. transformer_output Vision_Transformer(embeddings)
7. cls_token Extract_CLS(transformer_output)
8. logits FullyConnected(cls_token)
9. probabilities Softmax(logits)
10. if probabilities[FAKE] > probabilities[REAL] then
11. return “FAKE”
12. else
13. return “REAL”
14. end if
The Image Deepfake Detection algorithm begins by loading the input image and performing preprocessing steps such as resizing and normalization to ensure consistent input format. The image is then divided into fixed-size patches, and each patch is flattened and converted into embeddings. Positional encoding is added so the model can retain spatial information about patch locations. These embeddings are passed through a Vision Transformer encoder, where multi-head self-attention captures relationships between different image regions. The classification token output is then processed through a softmax layer to compute class probabilities. Finally, the image is labeled as REAL or FAKE based on the highest predicted probability
Algorithm 3:MultimodalFusion(P_audio, P_image, alpha)
1. P_final alpha * P_audio + (1 – alpha) * P_image
2. if P_final[FAKE] > P_final[REAL] then
3. return “FAKE”
4. else
5. return “REAL”
6. end if
The Multimodal Fusion Decision algorithm combines the prediction probabilities obtained from both the audio and image detection models. A weighted average of these two scores is calculated, where a parameter controls how much importance is given to each modality. The combined score is then compared against a predefined threshold to determine authnticity. If the final probability indicates higher likelihood of manipulation, the content is labeled as FAKE; otherwise, it is classified as REAL. This fusion approach improves reliability by leveraging evidence from both audio and visual sources.
The proposed system is a smart multi-modal deepfake detection framework that combines audio and image analysis with big data and cybersecurity support. It starts with an input layer where audio and image data are collected from various datasets. In the preprocessing stage, audio files are cleaned by removing noise, segmenting waveforms, and extracting important features like LFCC and other spectral characteristics. At the same time, images are resized, normalized, and enhanced using data augmentation techniques.After preprocessing, the refined data are sent to specialized deep learning models. The audio branch uses the SpecRNet model with self-supervised learning to detect manipulated voice content. The image branch applies a Vision Transformer (ViT) model to identify visual deepfakes. The results from both branches are then combined using a multimodal fusion strategy, which improves overall detection accuracy and reliability.
To handle large-scale data efficiently, the system integrates a big data layer for scalability. A cybersecurity layer is also included to ensure secure authentication and protect sensitive information. Finally, the system provides a clear REAL or FAKE output with high accuracy and efficient resource usage
4. EXPERIMENTAL RESULTS
Deepfake detection system was built using a smart combination of modern programming tools and powerful deep
learning frameworks to achieve both high accuracy and real- world usability. The development was mainly carried out in Python because of its flexibility and strong support for artificial intelligence applications. PyTorch was chosen as the primary framework for building and training the models, while TensorFlow was used in certain stages to compare and validate results. For audio analysis, Librosa helped extract important sound features, and OpenCV was used to preprocess images through resizing and normalization. The Vision Transformer model was implemented using the HuggingFace Transformers library, which made transfer learning efficient and practical.
For audio deepfake detection, the system adopted the SpecRNet architecture, combining LFCC and Whisper-based features to strengthen multilingual and spoof detection capability. The models were trained using Adam and SGD optimizers with Cross-Entropy Loss to ensure accurate classification between real and fake samples. On the image side, a Vision Transformer (ViT) model with pretrained weights improved detection performance and generalization through data augmentation techniques[1][18]. Beyond model accuracy, the system also emphasized security by integrating multi-factor authentication with OTP-based login in a web platform. Model checkpointing was included to allow future updates and retraining as deepfake techniques evolve. Overall, the implementation balances innovation, efficiency, and security, making the system reliable and suitable for practical cybersecurity deployment.
System was trained and evaluated in a practical and resource- conscious simulation environment to demonstrate its real- world applicability. The experiments were conducted on a 64- bit Windows 11 Home operating system powered by an 11th Generation Intel Core i5 processor. The system was equipped with 8 GB of RAM, a 512 GB SSD for fast storage access, and an integrated Intel HD Graphics 620 GPU. Rather than relying on high-end dedicated GPUs, the model was intentionally tested on moderate hardware to assess its efficiency and deployment feasibility.
One of the most significant observations from this setup is that the proposed model achieved high detection accuracy even with moderate GPU resources. This highlights the computational efficiency of the architecture and confirms that the system does not require expensive hardware to function effectively. As a result, SecureVision is suitable for real-time deployment in resource-limited environments such as small organizations, educational institutions, and mid-scale enterprises, making it both cost-effective and scalable for practical cybersecurity applications.
Audio dataset, data were collected from reliable and widely used sources such as ASVspoof 2021, a multilingual dataset covering 10 Indian languages, VoxCeleb2, and LibriSpeech [1][18]. In total, 60,000 audio samples were used, with 70% (42,000 samples) allocated for training and 30% (18,000 samples) reserved for testing. The dataset was carefully balanced, maintaining an equal distribution of real and fake samples to avoid bias during model training. To enhance model robustness and simulate real-world variations, several augmentation techniques were applied, including pitch shifting, noise addition, and time stretching. These methods helped the model learn to detect deepfakes even under different recording conditions and distortions[8].
Similarly, the image dataset was compiled from diverse sources such as CelebA, FFHQ, web-scraped image collections, and the Deepfake Detection Challenge dataset. A total of 50,653 images were used, with 70% (35,452 images) for training and 30% (15,201 images) for testing, ensuring a balanced mix of real and manipulated images. To improve generalization and reduce overfitting, various image augmentation techniques were applied, including rotation, resizing, color modifications, and random oversampling. This diverse and augmented dataset significantly strengthened the models ability to detect deepfakes under different lighting conditions, facial expressions, and image qualities, making the system more reliable for real-world deployment.
The image deepfake detection results compare three models: FakeCatcher, XceptionNet, and the proposed Vision Transformer (ViT) model. Although FakeCatcher achieves the highest accuracy at 96%, it depends on high GPU resources, which may not always be practical. The proposed ViT model reaches an accuracy of 89.35%, which is clearly better than XceptionNets 81%, while only requiring moderate GPU usage. This indicates that the proposed model maintains a good balance between strong performance and efficient use of computational resources, making it more suitable for real- world applications.
The audio deepfake detection comparison includes three models: COVAREP + LSTM, CAPTCHA, and the proposed SpecRNet model. Among them, the SpecRNet model performs the best, achieving an accuracy of 92.34%, which is higher than COVAREP + LSTM (89%) and significantly better than CAPTCHA (71%). This improvement shows that training the model on ASVspoof and multilingual datasets helps it detect fake audio more effectively. In addition to its strong performance, the model only requires moderate GPU resources, making it suitable and practical for real-world applications.
The audio model demonstrates very low misclassification, correctly identifying 2500 fake samples (true positives) and 3200 real samples (true negatives). Only 42 samples were wrongly classified in each of the false positive and false negative categories, which is a very small number compared to the total predictions. This shows that the model is reliable and maintains balanced performance between real and fake classes. The minimal error rate also suggests that the model can effectively handle new and unseen audio deepfake samples.
The image model shows very balanced and consistent classification results, correctly identifying 4731 fake images and 4728 real images. The number of misclassifications is quite low, with only 30 false positives and 32 false negatives. This small error count highlights the models strong ability to accurately detect deepfake images. Since the true positive and
true negative values are almost equal, it is clear that the model treats both classes fairly. Overall, the results reflect high precision, strong recall, and stable overall perormance.
5. CONCLUSION
In conclusion, SecureVision provides a highly effective and efficient framework for detecting multimodal deepfake media content. This has been achieved through the effective integration of state-of-the-art deep learning models and realistic cybersecurity principles. By employing audio and image-based detection techniques, the proposed framework has been able to achieve higher accuracy compared to existing single-modal-based frameworks. Furthermore, the proposed model has been able to achieve higher accuracy through the effective utilization of SpecRNet model-based audio detection and image-based detection employing a Vision Transformer (ViT) model.
The experimental results obtained have been highly promising, achieving 92.34% accuracy in audio-based deepfake detection and 89.35% accuracy in image-based deepfake detection. At the same time, the proposed model has been able to achieve moderate hardware requirements. Furthermore, the balanced confusion matrix results obtained have confirmed the effectiveness of the proposed model in providing balanced results with minimal false negative and false positive rates. Unlike existing models requiring high-end hardware to operate, the proposed model has been able to operate in a resource-conscious environment.
Despite all its positive features, it does have a few limitations, such as the possibility of overfitting risks, a lack of diversity in the data sets used, and the uncertainty of its deployment in a real-time environment on a large scale. Also, the ever-changing nature of deepfake generation techniques demands that the model should be retrensed to remain effective.
Improvements that can be made to the model in the future:
* Real-Time Deployment Optimization: The model should use techniques such as pruning and knowledge distillation to optimize the real-time deployment.
* Expanded Language Coverage: The model should cover more global language data sets to enhance the audio.
* Advanced Fusion Strategies: The model should use advanced techniques such as attention fusion instead of simple weighted fusion.
* Adversarial Robustness: The model should use adversarial techniques to make it more robust against bypass attacks.
In summary, the SecureVision framework makes a notable contribution to the development of AI-based cybersecurity, providing a scalable, secure, and multimodal solution to combat the menace of deepfakes. This framework has a great potential to be applied in real-world scenarios after more refinement and validation.
REFERENCES
1. N. Kumar and A. Kundu, SecureVision: Advanced Cybersecurity Deepfake Detection with Big Data Analytics, Sensors, vol. 24, no. 19, p. 6300, Sep. 2024, doi: 10.3390/s24196300.
2. G. Wang, F. Lin, T. Wu, Z. Yan, and K. Ren, Scalable face security vision foundation model for deepfake, diffusion, and spoofing detection, arXiv preprint, arXiv:2510.10663, 2025.
3. M. S. Afgan, B. Liu, A. Shifa, and M. N. Asghar, SecureFace: A controlled deepfake generation framework for exposing detector vulnerabilities, in Proc. 2025 Cyber Research ConferenceIreland (Cyber-RCI), 2025, pp. 18.
4. . K. Jayashree, S. Chakaravarthi, J. Samyuktha, J. Savitha, M. Chaarulatha,
1. Yogeswari, and G. Samyuktha, Secure Vision: Integrated anti- spoofing and deep-fake detection system using knowledge distillation approach, Signal Processing: Image Communication, vol. 117, p. 117481, 2026, doi: 10.1016/j.image.2026.117481.
5. J. Kultan, S. Meruyert, T. Danara, and D. Nassipzhan, Application of computer vision methods for information security, R&E-SOURCE, pp. 161177, 2025.
6. Y. Zhao, B. Liu, M. Ding, B. Liu, T. Zhu, and X. Yu, Proactive deepfake defence via identity watermarking, in Proc. IEEE/CVF Winter Conf. Applications of Computer Vision (WACV), 2023, pp. 46024611, doi: 10.1109/WACV56688.2023.00456..
7. J.-W. Jung et al., AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks, ICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 63676371, Apr. 2022, doi: 10.1109/icassp43922.2022.9747766.
8. H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, End-to-End Anti-Spoofing with RawNet2, arXiv preprint arXiv:2011.01108, 2021.
9. Luisa Verdoliva, Media Forensics and DeepFakes: an overview,
arXiv:2001.06564 , 2020
10. B. Dolhansky et al., The DeepFake Detection Challenge Dataset, arXiv (Cornell University), Jun. 2020, [Online]. Available: https://arxiv.org/pdf/2006.07397
11. Andreas R¨ ossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner, FaceForensics++: Learning to Detect Manipulated Facial Images, Roßler et al., ICCV, 2019.
12. L. E. Demir and Y. Canbay, Deepfake Image Detection with Transfer Learning Models, Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, vol. 14, no. 1, pp. 546560, Mar. 2025, doi: 10.17798/bitlisfen.1610300.
13. Y. Lee, N. Kim, J. Jeong, and I.-Y. Kwak, Experimental case study of Self-Supervised Learning for Voice Spoofing Detection, IEEE Access, vol. 11, pp. 2421624226, Jan. 2023, doi: 10.1109/access.2023.3254880.
14. K. Bhagtani, A. K. S. Yadav, E. R. Bartusiak, Z. Xiang, R. Shao, S. Baireddy, and E. J. Delp, An Overview of Recent Work in Media Forensics: Methods and Threats, arXiv preprint arXiv:2204.12067, 2022.
15. A. Dosovitskiy et al., An image is worth 16×16 words: Transformers for image recognition at scale, arXiv preprint, arXiv:2010.11929, 2020.
16. H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen, Multi-task learning for detecting and segmenting manipulated facial images and videos, in Proc. 2019 IEEE 10th Int. Conf. Biometrics Theory, Applications and Systems (BTAS), Tampa, FL, USA, Sep. 2019, pp. 18, doi: 10.1109/BTAS46853.2019.9185972..
17. A. Kaur, A. Noori Hoshyar, V. Saikrishna, S. Firmin, and F. Xia, Deepfake video detection: Challenges and opportunities, Artificial Intelligence Review, vol. 57, no. 6, p. 159, 2024, doi: 10.1007/s10462-
024-10639-3.
18. A. Ibnouzaher and N. Moumkine, Enhanced deepfake detection using a multi-model approach, in Proc. Int. Conf. Digital Technologies and Applications, Cham, Switzerland: Springer Nature, 2024, pp. 317325, doi: 10.1007/978-3-031-53363-0_31.
______________
SecureVision: Real-Time Multimodal Cyber Deepfake Identification System View Abstract & download full text of SecureVision: Real-Time Multimodal Cyber Deepfake Identification System Download Fu...
#Volume #15, #Issue #03 #(March #2026)
Origin | Interest | Match