This project introduces a hyper-efficient Vision Transformer (ViT) model built from the ground up to classify plant diseases with high accuracy. Created as a competition submission of FIT Data Science Competition 2025.
The primary challenge addressed is the accurate and timely identification of plant diseases in agriculture, which is crucial for preventing significant crop losses. Traditional Convolutional Neural Networks (CNNs) often fall short in this area because they focus on local features and may miss the global patterns of many plant diseases. Furthermore, agricultural datasets are often imbalanced, leading to models that are biased toward more common diseases. This project aims to create a diagnostic tool that is not only highly accurate and efficient but also robust to data imbalance.
A Vision Transformer (ViT) architecture was built from scratch to address the problem of plant disease classification. By treating images as a sequence of patches, the ViT model can effectively learn the long-range dependencies and global context of plant diseases, which is a key advantage over traditional CNNs. To tackle the issue of class imbalance, a suite of advanced training techniques was integrated into the methodology.
The implementation details are provided in the Jupyter Notebook and include:
- Custom ViT Components: The core components of the ViT model, including
PatchEmbedding,MultiHeadAttention,MLP, andBlock, were implemented from scratch. - Custom Training Components: To optimize the model's training, custom classes were created for the loss function (
CustomCrossEntropyLoss), optimizer (CustomAdam), and learning rate scheduler (CustomCosineAnnealingLR). - Data Preprocessing: The "PlantVillage" dataset was used for training and validation. The images were resized to 224x224 pixels and normalized before being fed into the model.
The meticulously designed and trained ViT model has set a new benchmark in both predictive accuracy and computational efficiency. The key results are as follows:
- Accuracy: The model achieved a state-of-the-art accuracy of 97% on the curated dataset of plant images.
- Inference Time: The model is highly efficient, with an average inference time of only 2.46 milliseconds per image.
- Performance Metrics: A comprehensive analysis of the model's performance was conducted, evaluating it on accuracy, precision, F1-score, and recall.
The results demonstrate that a from-scratch Vision Transformer can outperform traditional CNNs in automated agricultural diagnostics, providing a more accurate and efficient solution for plant disease classification. The complete code for reproducing these results, including model training, validation, and evaluation, is available in the provided Jupyter Notebook.