Creating a Vehicle Detection and Classification ML pipeline using YOLO and MobileNet transfer learning

Prakhar Gurawa
7 min readOct 24, 2021


Introduction to Computer Vision pipeline

We will be creating a vehicle detection and classification pipeline using machine learning techniques. We will be detecting the number of cars per frame and classify them into two classes SUV and sedan, all in real-time. This article has a few prerequisites such as the basics of machine learning, deep learning, convolution networks, and transfer learning. To design and implement a computer vision pipeline we have bifurcated the pipeline in the following subtasks :

  1. Video Reading: For this task, we have used OpenCV to read the video frame by frame and then finally display the original video to the user and then pass a queue of frames for the next task.
  2. Object Detection: Used a pretrained state of art object detection model TinyYOLO which is trained on the COCO dataset. The role of TinyYOLO is to detect cars in frames which is one of the classes present in the COCO dataset.
  3. Car Type Classifier: Once we get the cars with their bounded box for every frame we need to classify them as either SUV or Sedan. For this, we have used the MobileNetV2 classifier and applied transfer learning concepts to modify the model according to our use case.
Figure 1 . Description of computer vision pipeline

Pipeline Design, Model Configurations, and Working

  • For video reading, we have used OpenCV which reads the video frame by frame at a rate of 30 FPS.
  • Dataset Preparation: The dataset includes images from two classes SUV and Sedan, with 1540 images of Sedan and 1519 images for SUV. The images have been collected using google images web scraping and Stanford Car Dataset[1] which contains images of multiple kinds of cars including SUVs and Sedans. All the images collected from web scraping and Stanford dataset were manually checked once to remove all the ones with were irrelevant or highly dubious. This step is a part of data preparation.
  • For Object Detection, we have preferred TinyYOLO over YOLO as it’s lightweight and faster as compared to later. TinyYOLO can process at 220 FPS whereas YOLO processed 20–40 FPS. Due to the requirement to detect in real-time better to use TinyYOLO.
  • Once all the frames are read using OpenCV we pass a queue of frames to the detection function which uses the YOLO’s detect_image function to detect classes in any given frame.
  • After we get all the detected classes and their position using YOLO we consider only those detected objects whose class is the car.
  • Finally, the detected car with their positions in any particular frame is passed to a function that classifies any car’s image in either SUV or Sedan. This function basically uses the model trained on thousands of SUV and Sean images using the MobileNet model which can be found in .

MobileNet Model and Transfer Learning

  • The basic structure of MobileNet models looks as depicted in the next figure which is reused using the concept of transfer learning.
Figure 2 . Structure of original MobileNetV2 Model
  • MobileNet model is used as a starting point where we discarded the last layer and added our own layers as shown in figure 2 with two additional hidden layers with 512 neurons, one hidden layer with 256 neurons and finally, a single output layer with a single node as this is a binary classification task.
Figure 3 . Structure of modified model by the addition of extra hidden layers

Hyperparameters and Model Tuning

  1. Dropout: Used a regularization method dropout on new layers with a probability of 0.2 which helped to reduce overfitting in the model which can be studied due to less gap between training and validation accuracies and errors.
  2. Data Augmentation: Tried using data augmentation which performs a different operation like zooming, shifting, etc but was not giving satisfactory results so dropped this operation.
  3. Epochs: The number of epoch set for this model was 20 as a higher number of epoch was not giving satisfactory results on ground truth and the rate of increase of accuracy was also low, could be they were overfitted.
  4. L1/L2 Regularization: Tried experimenting with L1 and L2 weight regularization but was giving worse results with very high validation errors and low accuracy.
  5. Model compilation: The loss function used is binary cross-entropy with accuracy as a metric to progress. Finally, the model weights are saved with the proper name in the folder saved_models.
  6. Gradient descent optimization algorithms: Experimented with two optimization algorithms RMSProp and Adam optimizer. For our use case, Adam was giving satisfactory results so used that. The learning rate used was 0.0001 as a lower learning rate was resulting in slower learning with a demand of higher epochs.

Accuracy and Cross-Entropy vs Number of the epoch

The variation of accuracy and cross-entropy with respect to time/epoch for optimization Adam and RMSProp is depicted in figure 4 below. Also, we have preferred Adam over RMSProp due to the less overfitted model as the difference between training accuracy and validation accuracy is less as compared to RMSProp. As expected it was giving better results on ground truth with better F1 scores of the number of sedan and SUV cars which also proved that overall it’s a better model.

Figure 4 . Variation of accuracy and cross entropy(loss) wrt to time/ epoch for Adam(left) and RMSProp(right)

Final Accuracy for deep learning models

  • Adam optimizer (lr=0.0001) : Training loss = 0.0772, Validation loss = 0.8120, Training accuracy = 0.9733 and Validation accuracy = 0.7605.
  • RMSProp optimizer (lr=0.0001) : Training loss = 0.0887, Validation loss = 1.8487, Training accuracy = 0.9721 and Validation accuracy = 0.5905.

Pipeline Optimization

  • Producer-Consumer tried using producer-consumer design pattern but was not giving significant time difference as the producer in our case the frame video reader is much faster as compared to consumer (object detector and classifier) so it is not able to synchronize itself and thus not used in this project.
  • Performed experiments with Non-Maximal Suppression (NMS) and Intersection over Union Threshold (IOU) parameters of the YOLO model to give better results. These parameter changes the threshold of a selection of objects and removes all boxes with low probability and intersection probability. For our use case with TinyYOLO, an NMS Score of 0.2 and IOU of 0.2 were giving better results which finally improves the F1 Score on ground truth.
  • We tried detecting cars first using haar cascades but it was not giving satisfactory results so we used YOLO for object detection whose performance was much better as compared to the first method. The implementation of first method can be found in .

Pipeline Output

The input of pipeline is any given video that is displayed to the user and output is video with detected cars and their type (SUV/Sedan) as below in figure 5. The output of the system is also an excel sheet with columns Frames, Sedan, SUV, Total Cars which is then compared with ground truth to calculate the F1 Score of the number of cars per frame, F1 Score of sedan cars, and F1 scores of SUV cars which are shown in figure 6.

Figure 5 . Display snapshot out output video with the number of cars with their type attached to cars
Figure 6 . The F1 Score for Total Cars, SUV and Sedan compared with ground truth on 900 frames in total

Execution Time

Our system has been slow processing the 900 frames in the time of 450 seconds. The time consumed by the object detection process by YOLO and object classification time by ML model (where the number of cars to detect is a crucial factor ) vs frame is displayed below.

Figure 7 . Detection and Classification time for each frame wrt time/frame

Design Strengths and Weaknesses

  • Even though it’s a computer vision pipeline the implementation of code has been coded on object-oriented programming standards with separate classes for video reading, object detection, car classification, etc which promotes reusable programming and will help extend this project.
  • Even with multithreading and thread pool the time to process video is significant which can be improved by running this application on a high-performance system with CUDA-enabled GPUs.
  • Even though the dataset collected from web scraping and Stanford dataset is checked manually there are a number of images that reduce the quality of the dataset and thus reduce the overall performance of the system but at the same time cannot be removed from the dataset as their count is significant.
  • Created a separate dependency.yml file so it’s easier for any new user to set up this project as a number of issues arise due to dependency issues and version clashes.

Project Setup

The code can be downloaded from the Github link.

In the same repository Car_Results.xls is the output excel which contains the number of cars, sedans, SUVs detected and classified per frame with Output.avi as output video.

Setup steps :





Prakhar Gurawa

Data Scientist@ KPMG Applied Intelligence, Dublin| Learner | Caricaturist | Omnivorous | DC Fanboy