Irtaza Hyder

YOLO integrated with Car

DE-43 (DC & SE) NUST COLLEGE OF ELECTRICAL AND MECHANICAL ENGINEERING L I DAR BASED O BJECT D ETECTION AND T RACKING FOR NUSTAG E LECTRICAL V EHICLE A PROJECT REPORT DE-43 (DC & SE) Submitted by NS SYED MUHAMMAD IRTAZA HYDER NS MUHAMMAD ZAKRIA MEHMOOD NS MUHAMMAD ABDULLAH BACHELORS IN COMPUTER ENGINEERING YEAR 2025 P ROJECT S UPERVISOR PROF. USMAN AKRAM PROF. FAHAD MUMTAZ MALIK YEAR 2025 DEPARTMENT OF COMPUTER & SOFTWARE ENGINEERING COLLEGE OF ELECTRICAL & MECHANICAL ENGINEERING NATIONAL UNIVERSITY OF SCIENCES AND TECHNOLOGY, ISLAMABAD, PAKISTAN Certification This is to certify that Syed Muhammad Irtaza Hyder [378514], Muhammad Zakria Mehmood [391449] and Muhammad Abdullah [372567] have successfully completed the final project LiDAR based Object Detection and Tracking for NUSTAG Electrical Vehicle, at the NUST College of Electrical and Mechanical Engineering, to fulfill the partial requirement of the degree Bachelors in Computer Engineering. Signature of Project Supervisor Prof. Usman Akram Head of Department i Sustainable Development Goals (SDGs) SDG No SDG 1 Description of SDG No Poverty SDG No SDG 9 SDG 2 SDG 3 Zero Hunger Good Health and Well Being SDG 10 SDG 11 SDG 4 Quality Education SDG 12 SDG 5 SDG 6 SDG 7 SDG 8 Gender Equality Clean Water and Sanitation Affordable and Clean Energy Decent Work and Economic Growth SDG 13 SDG 14 SDG 15 SDG 16 SDG 17 Description of SDG Industry, Innovation, and Infrastructure Reduced Inequalities Sustainable Cities and Communities Responsible Consumption and Production Climate Change Life Below Water Life on Land Peace, Justice and Strong Institutions Partnerships for the Goals Sustainable Development Goals ii Complex Engineering Problem Range of Complex Problem Solving Attribute Complex Problem 1 Range of conflicting re- Involve wide-ranging or conflicting technical, engineerquirements ing and other issues. 2 Depth of analysis re- Have no obvious solution and require abstract thinking, quired originality in analysis to formulate suitable models. 3 Depth of knowledge re- Requires research-based knowledge much of which is at, quired or informed by, the forefront of the professional discipline and which allows a fundamentals-based, first principles analytical approach. 4 Familiarity of issues Involve infrequently encountered issues 5 Extent of applicable Are outside problems encompassed by standards and codes codes of practice for professional engineering. 6 Extent of stakeholder in- Involve diverse groups of stakeholders with widely varyvolvement and level of ing needs. conflicting requirements 7 Consequences Have significant consequences in a range of contexts. 8 Interdependence Are high level problems including many component parts or sub-problems Range of Complex Problem Activities Attribute Complex Activities 1 Range of resources Involve the use of diverse resources (and for this purpose, resources include people, money, equipment, materials, information and technologies). 2 Level of interaction Require resolution of significant problems arising from interactions between wide ranging and conflicting technical, engineering or other issues. 3 Innovation Involve creative use of engineering principles and research-based knowledge in novel ways. 4 Consequences to society Have significant consequences in a range of contexts, and the environment characterized by difficulty of prediction and mitigation. 5 Familiarity Can extend beyond previous experiences by applying principles-based approaches. iii ✓ ✓ ✓ ✓ ✓ ✓ ✓ Dedicated to the family, friends, and professors who support us throughout our university journey. iv Acknowledgment First, we express our humblest gratitude to the Divine, Allah Almighty, for giving us the strength and fortitude to carry out this endeavor. All our efforts were guided towards a positive outcome, and help prevailed over us in times of uncertainty. We thank our project supervisor, Dr. Usman Akram, for his unwavering help and support throughout the project. His mentorship and guidance provided us with the motivation to give our best and strive for excellence. We also acknowledge our co-supervisor, Dr. Fahad Mumtaz Malik, for his crucial assistance and support during the project. He provided all equipment, including Ouster LiDAR, the NUSTAG EV, Jetson Xavier AGX, and UAV lab access. Furthermore, we thank Dr. Usman Akbar, who provided us with resources to train our Deep Learning models. We also extend our thanks to Muneeb Ahmad from MTS-43 (C) for his help in designing the LiDAR mount and aiding us in attaching it with the Autonomous Vehicle. Finally, we want to thank our family and friends for their encouragement and support throughout this effort. v Abstract LiDAR (Light Detection and Ranging) is a remote sensing technique that estimates distance via laser pulses to obtain a RANGE image. From the RANGE image derived, a point cloud is obtained from which pseudo-BEV images are created. LiDAR is a crucial sensor in the perception task stack for autonomous driving, as it provides vehicles with a very detailed view of their environment, regardless of light, with minimal interference from environmental weather factors. Major players in the industry, Waymo and Zoox, make extensive use of LiDAR for navigation that is safe and reliable. The final year project seeks to build on this foundation in improving real-time autonomous perception for the NUSTAG Electric Vehicle through LiDAR-based object detection and tracking. Fast inference on an energy-efficient platform, Nvidia Jetson Xavier AGX, was achieved through the development of an efficient process pipeline-an object detector using the YOLO model, acceleration of YOLO on tensor cores by TensorRT, and ByteTrack object tracking. This system has been installed on the NUSTAG EV-an electric vehicle assembled by the NUSTAG team-and has been tested using simulated environments. Test results show accuracy and real-time performance, making this approach ideal for autonomous vehicle applications. This work is part of an extensive effort to make fully automated electric vehicles in later phases of development. Keywords: Autonomous Systems, Complex YOLO, Embedded AI, Embedded Systems, LiDAR, NVIDIA Xavier, Object Detection, Object Tracking, Perception pipeline, Realtime Processing, TensorRT Acceleration, YOLO vi Contents Acknowledgment Abstract Contents List of Figures List of Tables Chapter 1: Introduction 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . 1.3 Problem Statement . . . . . . . . . . . . . . . . . 1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Aims and Objectives . . . . . . . . . . . . . . . . 1.5.1 Object Detection . . . . . . . . . . . . . . 1.5.2 Object Tracking . . . . . . . . . . . . . . . 1.5.3 Deployment Goal . . . . . . . . . . . . . . 1.6 Outcomes . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Object Detector . . . . . . . . . . . . . . . 1.6.2 Object Tracker . . . . . . . . . . . . . . . 1.6.3 Simulations . . . . . . . . . . . . . . . . . 1.7 Report Organization . . . . . . . . . . . . . . . . . 1.7.1 Chapter 2 . . . . . . . . . . . . . . . . . . 1.7.2 Chapter 3 . . . . . . . . . . . . . . . . . . 1.7.3 Chapter 4 . . . . . . . . . . . . . . . . . . 1.7.4 Chapter 5 . . . . . . . . . . . . . . . . . . 1.7.5 Chapter 6 . . . . . . . . . . . . . . . . . . 1.7.6 Chapter 7 . . . . . . . . . . . . . . . . . . 1.7.7 Chapter 8 . . . . . . . . . . . . . . . . . . Chapter 2: Background & Related Work 2.1 LiDAR Technology for Autonomous Perception . . 2.1.1 Challenges of LiDAR Processing . . . . . 2.2 Edge Computing for Autonomous Systems . . . . 2.2.1 Edge Computing for Autonomous Vehicles: form . . . . . . . . . . . . . . . . . . . . . 2.3 Detection of Objects in 3D Point Clouds . . . . . . 2.3.1 Voxel-Based Methods . . . . . . . . . . . 2.3.2 Point-Based Methods . . . . . . . . . . . . 2.3.3 Pillar-Based Methods . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Motivation and Plat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v vi ix x xi- 2.3.4 Projection-Based (Bird’s Eye View - BEV) Methods . . 2.3.5 Treating Sparsity and Uncertainty . . . . . . . . . . . . 2.3.6 Model Efficiency and Real-Time Performance . . . . . . 2.4 Benchmark Datasets for 3D Object Detection . . . . . . . . . . 2.4.1 Other Notable Datasets . . . . . . . . . . . . . . . . . . 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 3: Components & Materials 3.1 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Edge Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 KITTI Dataset . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Dataset Preparation and Label Conversion . . . . . . . . 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 4: Object Detection 4.1 YOLO (You Only Look Once) for Object Detection Architecture 4.1.1 Core Principles of YOLO . . . . . . . . . . . . . . . . 4.1.2 Evolution and Variants Used . . . . . . . . . . . . . . . 4.1.3 Oriented Bounding Boxes (OBB) for BEV . . . . . . . 4.2 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Training Setup and Pipeline . . . . . . . . . . . . . . . 4.2.2 The Need for Optimization on Edge Devices . . . . . . 4.3 Experimental Evaluation of Detection Models . . . . . . . . . . 4.3.1 Models Evaluated . . . . . . . . . . . . . . . . . . . . . 4.3.2 Performance Measures . . . . . . . . . . . . . . . . . . 4.3.3 Result Analysis . . . . . . . . . . . . . . . . . . . . . . 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 5: Object Tracking 5.1 Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 ByteTrack MOT . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Tracklet Interpolation . . . . . . . . . . . . . . . . . . . 5.3 Multi-step Look ahead . . . . . . . . . . . . . . . . . . . . . . 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 6: Hardware in Loop Simulations 6.1 Gazebo Simulator . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Digital Twin Development . . . . . . . . . . . . . . . . . . . . 6.2.1 Measurements . . . . . . . . . . . . . . . . . . . . . . 6.2.2 URDF Development . . . . . . . . . . . . . . . . . . . 6.2.3 Ackermann Controller . . . . . . . . . . . . . . . . . . 6.2.4 Sensor Integration . . . . . . . . . . . . . . . . . . . . 6.3 Simulation Environments . . . . . . . . . . . . . . . . . . . . . 6.3.1 Car-Only Environment . . . . . . . . . . . . . . . . . . 6.3.2 Pedestrian-Only Environment . . . . . . . . . . . . . . 6.3.3 Mixed Environment: Cars and Pedestrians . . . . . . . . 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . - Chapter 7: ROS2 Architecture and Visualization Tools 7.1 ROS2 Framework . . . . . . . . . . . . . . . . . . 7.1.1 ROS2 . . . . . . . . . . . . . . . . . . . . 7.1.2 Visualizations with Rviz2 . . . . . . . . . 7.2 Nodes and Topics . . . . . . . . . . . . . . . . . . 7.3 Features . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 BEV Conversion . . . . . . . . . . . . . . 7.3.2 Visualize Live Sensor Data . . . . . . . . . 7.3.3 Record Data . . . . . . . . . . . . . . . . 7.3.4 Replay Data . . . . . . . . . . . . . . . . . 7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . Chapter 8: Deployment & Validation 8.1 Deployment on Nvidia Jetson . . . . . . . . . . . . 8.1.1 Limitations of TensorRT . . . . . . . . . . 8.1.2 Benefits of TensorRT . . . . . . . . . . . . 8.2 Live Detections on Jetson . . . . . . . . . . . . . . 8.3 Integration with NUSTAG Autonomous Vehicle . . 8.3.1 Mount Design . . . . . . . . . . . . . . . 8.3.2 Mounting LiDAR on Vehicle . . . . . . . . 8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . Chapter 9: Conclusion and Future Work 9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . 9.1.1 Key Contributions . . . . . . . . . . . . . 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . References ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . - List of Figures Figure 1 Sustainable Development Goals . . . . . . . . . . . . . . . . . . . ii Figure 3.1 Point Cloud to 2D Bird Eye View Conversion. . . . . . . . . . . . 22 Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 Figure 4.5 Figure 4.6 YOLOv11 base architecture. . . . . . . . . . . . . YOLOv11 small training curve. . . . . . . . . . . . Models bad performance. . . . . . . . . . . . . . . Model’s good performance. . . . . . . . . . . . . . RTDETR models performance. . . . . . . . . . . . Performance of Object detection models on Xavier. . . . . . . - Figure 5.1 Tracking test LiDaR data of the San fransisco drive. . . . . . . . . 39 Figure 6.1 Figure 6.2 Figure 6.3 Figure 6.4 Figure 6.5 Figure 6.6 Figure 6.7 Figure 6.8 Figure 6.9 Figure 6.10 Chassis Dimensions . . . . . . . . . . . . . . . . . . . . . Side View of Digital Twin . . . . . . . . . . . . . . . . . . Top View of Digital Twin . . . . . . . . . . . . . . . . . . Ackermann Diagram . . . . . . . . . . . . . . . . . . . . Simulated car-only environment in Gazebo . . . . . . . . . Model detection results in the car-only environment . . . . Simulated pedestrian-only environment in Gazebo . . . . . Model detection results in the pedestrian-only environment Simulated environment with both cars and pedestrians . . . Model detection results in the mixed environment . . . . . . . . . . . . . . . - Figure 7.1 Rviz2 Output for Simulation . . . . . . . . . . . . . . . . . . . . Figure 7.2 Rviz2 Output for LiDAR sensor . . . . . . . . . . . . . . . . . . Figure 7.3 Point Cloud Data visualized on Rviz2. reflectivity (top left), intensity (top right), range (bottom left), ambient (bottom right) . . . . . . . . Figure 7.4 Bounding Boxes using Line Strip and Line List Markers . . . . . . Figure 7.5 Rviz2 Display Markers . . . . . . . . . . . . . . . . . . . . . . . Figure 7.6 Bounding Boxes using Line Strip and Line List Markers . . . . . . Figure 7.7 Node and Topics interaction in Gazebo Simulation . . . . . . . . . Figure 7.8 Node and Topics interaction in the LiDAR . . . . . . . . . . . . . Figure 7.9 Overview of ROS2 Workflow . . . . . . . . . . . . . . . . . . . . - Figure 8.1 Figure 8.2 Figure 8.3 63 64 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Object Detection and Tracking Models running on Xavier . . . . . CAD Models of the LiDAR Mount . . . . . . . . . . . . . . . . . LiDAR mounted on Autonomous Vehicle . . . . . . . . . . . . . x List of Tables Table 3.1 Table 3.2 LiDAR Specifications . . . . . . . . . . . . . . . . . . . . . . . . Jetson AGX Xavier (32GB) Specifications . . . . . . . . . . . . . . Table 4.1 Reiteration of Table I: Comparison of Object Detection Models Performance on Xavier (Trained at 325 Epochs) . . . . . . . . . . . . . . . . Table 4.2 3D Object Detection Models Performance Comparison on Nvidia Xavier on KITTI Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . Table 6.1 Car Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . xi - Chapter 1 Introduction 1.1 Introduction Recent progressive steps in autonomous systems and edge computing have fostered a plethora of intelligent technologies like self-driving cars, robotics, and smart cities. Among these technologies, LiDAR-based sensing has become a key player in reliable object detection due to its ability to provide accurate 3D spatial representation of complex environments [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. LiDAR poses numerous challenges for realtime deployment on edge devices and very much-used embedded platforms due to its heavy computational requirements [16]. So far, the scene has relied on powerful GPUs hosted in the cloud or upon lightweight 2D camera systems [17]; these, however, do not allow for the robust real-time on-device 360-degree perception needed for those autonomous vehicles acting in dynamic environments. The present work describes an optimized pipeline for LiDAR-based object detection and tracking based on a YOLO-based architecture [18], leveraging TensorRT for efficient deployment on edge devices like the NVIDIA Jetson series [19]. We have implemented the proposed system into the NUSTAG EV, an electric vehicle developed at the National University of Sciences and Technology, and examined it through real-time testing within a campus environment. The results showed that the system exhibited high accuracy and low latency performance, establishing it as a solution for real-world applications. While basic 1 automation functionalities such as braking and throttle control are supported in the current setup, the objectives of future versions will be to push for higher levels of autonomy at Level 3 and Level 4 in autonomous driving. 1.2 Motivation Camera-based perception systems have long dominated the autonomous driving market due to their low cost and high-resolution imaging capabilities. However, in recent year other perception sensors, such as the LiDAR, have significantly matured. Utilizing light rays to generate a 3D map of the environment, it has become a popular choice for environmental awareness and depth perception. With this improvement, two main school of thoughts have emerged. On one hand we have Waymo’s sensor-fusion model, which heavily integrates LiDAR, and on the other Tesla’s vision-only strategy, which depends solely on cameras. Despite camera’s affordability and availability, camera systems have trouble with low-light conditions and depth estimation. In contrast, LiDAR-centric strategy improves detection and navigation accuracy by providing precise 3D spatial data. In light of this, the project aims to incorporate LiDAR-based perception to replace/enchance the NUSTAG autonomous vehicle’s perception stack, which previously only used a front view camera. The end goal is to develop a dependable and safe system for efficient detection and tracking of pedestrians, cyclists and cars. 1.3 Problem Statement Our project, LiDAR based Object Detection and Tracking for NUSTAG Autonomous Vehicle tackles the challenge of perception. Initially the vehicle used a single monocamera for road segmentation and path planning. Our goal was to introduce the LiDAR sensor to enhance the perception of the vehicle. LiDAR provided the vehicle with a 360 2 degree FoV alongside being tolerant and robust even in more tougher weather conditions (such as rain and fog) in comparison to the camera. 1.4 Scope The scope of this project was to utilize the LiDAR as the primary sensor in the perception for the NUSTAG autonomous vehicle. It tackles the computational bottlenecks faced by LiDAR-based perception for edge devices by means of a hardware-accelerated architecture for TensorRT optimized YOLO [16, 20, 18, 21]. The high-performance pipeline that we have deployed on the NUSTAG EV performed well on campus roads, proving to be scalable and efficient. On top of that, the system’s low-latency and high-accuracy performance render it applicable in other areas like robotics, security, and agriculture, especially where edge devices operate in remote or resource-constrained settings. Running complex models like YOLO on low-power platforms is, in a broader sense, contributing to sustainable AI. 1.5 1.5.1 Aims and Objectives Object Detection The first objective of this project was to detect and classify objects using the point cloud data generated by LiDAR input 1.5.2 Object Tracking The second objective of this project was to track the objects being detected. So that we can estimate their trajectory and avoid any collisions. 3 1.5.3 Deployment Goal The aim of this project is to deploy the perception system on low powered edge device. For that, the system have to have a high accuracy, low latency and low resource consumption as it is meant to be deployed on an edge device. As the edge device would also be running many other processes, such as navigation and communication with other sensors, the developed system must be as small and efficient as possible. 1.6 1.6.1 Outcomes Object Detector We processed the 3D point cloud into 2D RGB-BEV images. Then we used yolo11s-obb as the 2D backbone of our Object Detector. 1.6.2 Object Tracker After detection comes tracking. For this purpose, we used the ByteTrack multi-object tracking algorithm. We extended its functionality to perform what we call a Multi-step Look ahead. 1.6.3 Simulations Testing and validation of the Deep Learning Model in the Gazebo simulation environment. Test environments were developed where there were pedestrians, cars and a mix of both to test the models accuracy in detecting and tracking their trajectories. 1.7 Report Organization The organization of the thesis is as follows: 4 1.7.1 Chapter 2 This chapter lists and references the literature reviewed before starting the project. It is further divided into 4 parts: LiDAR sensor, Edge Computing, Detection, and Dataset. 1.7.2 Chapter 3 This chapter includes the requirements to develop deep learning neural network models, the edge devices, and dataset used. 1.7.3 Chapter 4 This chapter discusses the object detection model used, how it was trained and optimized and its performance on the KITTI dataset. 1.7.4 Chapter 5 This chapter discusses the object tracking algorithm, its foundations and its implementation. 1.7.5 Chapter 6 This chapter discusses the simulator Gazebo, the development of the digital twins, the simulation environments and the testing of the models in the simulated environment. 1.7.6 Chapter 7 This chapter discusses the ROS2 workflow used to develop the visualizer. It also introduces features such as viewing live, recording and replaying sensor data obtained from the simulator and visualizer. 5 1.7.7 Chapter 8 This chapter includes the detail of deploying our model on the Jetson Xavier and discusses the future work for working on the Jetson. 6 Chapter 2 Background & Related Work This chapter deals with the essential conception and previous research on what can be called LiDAR-based object detection and tracking, particularly concerning the kinds of solutions that would be intelligent enough to use edge computing platforms. It also goes on to discuss how LiDAR can be an answer to understanding the reading of point clouds, the advances made in the field during research into multiple and unprecedented target detection algorithms, model efficiency for real-time performance, and the role of benchmark datasets. It provides the environment for the contributions made here in this thesis, specifically on an optimized pipeline for LiDAR object detection and tracking for an edge device such as the NVIDIA Jetson family. The challenges of deploying fast algorithms for perception at this level, and how they are addressed, form a major component of the comparative study related to this work. The generation of dense, accurate 3D point clouds of the environment provides an abundance of spatial information so crucial for tasks ranging from object detection to localization and mapping [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. Processing LiDAR data is very challenging, requiring the presence of high computing power due to the very high points output per second, as well as the complexity associated with the algorithms used to extract meaning from that point cloud. This poses real challenges, especially in real-time applications, 7 using the low resource configuration [16]. Conventional approaches have used high-end GPU processor power in the cloud or relatively less costly systems based on 2D cameras. Cloud processing is very effective because it provides huge computational capacity [17]. 2.1 LiDAR Technology for Autonomous Perception The capability of robotic building blocks combined with actuator control is the generation of self-directing vehicles. It will be possible by LiDAR based systems that detect and differentiate obstacles from free space and respond they make real-time autonomous navigation possible even in dynamic environments. LiDAR went further into being the crucial and real sensing technology towards autonomous vehicles and robotics, with detailed advantages against other perceptions modalities such as cameras and radar. It generates images capturing different perspectives: that view from above and down and around an object—providing a much broader comprehension of the space in all directions, making it essential for the whole understanding of a large environment. LiDAR technology examination encompasses this subject, given its operational principles, advantages and limitations in real-world applications. These limitations include issues such as global practicality, covariance with nonautonomous platform systems, and the requirement to process substantial amounts of LiDAR data fast and efficiently. The perception of environment by LiDAR systems with high-performance computing and real-time data processing equips an autonomous platform for smart decision-making and autonomous mobility in any situation with actuator control. 2.1.1 Challenges of LiDAR Processing Despite its advantages, LiDAR technology also presents several challenges: • Cost of Computation: A massive amount of data is continuously generated by the 8 LiDAR sensors, which contains several processing works like filtering, segmentation, object detection, and tracking [16] for real-time applications. This becomes a significant obstacle to realizing deployment in embedded or edge devices, which typically have limited processing power and memory. • Sensitive to Harsh Weather: In many ways, LiDAR performs better than cameras, but the performance can degrade when it comes to severe weather conditions like heavy rains, snowstorms, or fog, all of which scatter or absorb the laser beams, resulting in very noisy or sparse data. • Sparsity over Long Distances: The density of point clouds reduces with distance, and that makes it increasingly difficult to detect and classify objects at long ranges. • Cost and Mechanical Complexity: The high-performance LiDAR sensors have, in the past, been costly ones with moving mechanical parts, although solid-state LiDAR is becoming common and addressing some of these issues. • Absence of Color/Texture Information: The LiDAR point cloud is sparse in terms of color and texture information available from cameras, which capture close-up, intimate views beneficial for classifying objects. This often drives sensor fusion ideas. The work presented in this thesis directly addresses the computational cost challenge by developing an efficient pipeline optimized for edge devices. 2.2 Edge Computing for Autonomous Systems Although edge computing generally refers to the processing of data near its capture source rather than relaying it to a centralized cloud or data center, for autonomous vehicles, the term edge usually refers to computing platforms present onboard the respective vehicles. 9 2.2.1 Edge Computing for Autonomous Vehicles: Motivation and Platform Edge computing is of crucial value for the scaling/designing of intelligent, real-time operations to autonomous vehicles. The following motivations make processing data locally at the vehicle level important: • Low Latency: It is essentially real-time decision making that relates to the safety requirements. For example, emergency braking or when an object must be avoided in a very critical situation. All the processing occurs on the vehicle device, which is very much lower in latency than what would be all the computation for the scenario in the cloud that would later contribute to the system delay. • Less Bandwidth Consumption: A number of sensors such as LiDAR or cameras are incorporated in the system, which generates a huge amount of data. Highbandwidth requirements exist for continuous real-time traffic of raw sensor data transfer to the cloud, which is not always available or reliable in various driving conditions. • Reliability and Availability: The perception and control systems of the vehicle remain active in the case of network outages or poor connectivity, thanks to the edge processing. • Increased Privacy and Security: Local processing results in the continued exposure of sensitive sensor data to external networks and therefore enhances data privacy while minimizing the cyberattack surface. The design and implementation of the study of the NVIDIA Jetson Xavier AGX have for the purpose of being used as an edge computing module specifically for robot and autonomous systems [16, 19, 20]. The critical architectural and software elements providing vital assistance for efficient AI inference on edge devices encompass: 10 • GPU Accelerated: The platform provides an integration of NVIDIA Volta architecture GPU with Tensor Cores that allow for extreme parallel processing power suited for deep learning and its applications. • Multi-core CPU: This supports all kinds of general-purpose computations by its multiple ARM CPU cores. • Deep Learning Accelerators (DLAs): These purpose-built accelerators for neural network inference function with low power consumption. • A Large Memory: It equips 32GB RAM for efficiently loading large models and high-resolution sensor data. • Strong Software Stack: The Jetson Xavier AGX is well supported by CUDA, cuDNN, and TensorRT, which is an inference optimization library essential to deploying models like YOLO with low latency and high throughput. TensorRT optimization techniques such as layer fusion, precision calibration (FP16/INT8), kernel auto-tuning, and dynamic memory management will enable deep learning models to work very well on resource-constrained edge devices [19, 20, 21]. 2.3 Detection of Objects in 3D Point Clouds Detecting objects present in 3D point clouds is a core algorithm pertaining to autonomous navigation, enabling vehicles to recognize and react to cars, pedestrians, or cyclists as well as other obstacles. This section presents an analysis of some prominent approaches found in the literature, many of which are already cited in the original paper. 2.3.1 Voxel-Based Methods Voxel-based methods for discretizing 3D point cloud space into a regular grid of volumetric elements, so-called voxels. 11 • VoxelNet [6]: VoxelNet, being one of the early work in this area, divides the point cloud into 3D voxels, encodes the points within each non-empty voxel into a feature vector by a mini-PointNet structure, and then applies 3D convolutions to these voxel features. A Region Proposal Network (RPN) is used to generate bounding box detections for the 3D instances. VoxelNet demonstrated the feasibility of end-toend learning for 3D object detection but was computationally expensive. • SECOND (Sparsely Embedded Convolutional Detection) [14]: SECOND is an improvement to VoxelNet by introducing sparse convolution operations. These sparse convolutions preferentially execute computation only on the non-empty voxels, thus negating the cost of computation and memory which standard dense 3D convolutions require on the empty voxels. Consequently, they avoid surplus compute and memory at unprecedented speeds while maintaining similar accuracy. These methods leverage the well-established convolutional architectures from 2D image processing by extending them to 3D. However, the voxelization step can lead to information loss and quantization errors, and the computational cost of 3D convolutions, even sparse ones, can still be high. 2.3.2 Point-Based Methods Point-based methods operate direct on the raw data of irregular points, avoiding the information loss that is usually associated with voxelization. • PointNet [5]: This was the pioneering work that proposed an architecture of neural network designed to directly operate on unordered point sets. It learns point-wise features using the shared Multi-Layer Perceptrons (MLPs), later aggregated into a global shape feature using a symmetric function e.g. max pooling to guarantee permutation invariance. It is originally designed for classification and segmentation but inspired many further complex tasks like detection. • PointNet++: It further extends PointNet but hierarchically evaluates local geomet12 ric structures by processing at different input point levels. PointNet run recursively on nested partitions of input point set instead of directly learning features from the input point set at a single level. • PointRCNN [11]: It is a two-stage approach: First, it gets 3D region proposals directly from the point cloud through a segmentation network based on PointNet++. Then, each proposal is refined and classified. PointRCNN is accurate but a little slower because it’s two-stage processing and real point-wise experimentation. • Part-A^2 Net [12]: A newer model directed, in particular, towards marking intraobject parts and thus improving an object’s detection accuracy. It has two main stages, one for the generation of part-aware proposals and the second for part aggregation in RoI feature learning, thus showcasing the finer feature representation benefits. Point-based methods act as good conveyors of precise geometric information but, many times, have trouble extracting contextual information efficiently, being computationally expensive for large point clouds. 2.3.3 Pillar-Based Methods Pillar-based algorithms provide an in-between speed and accuracy, thus filling the gap left by voxel-based and point-based approaches. • PointPillars [1]: An important paper referenced here, PointPillars discretizes the point cloud into vertical columns or pillars, not 3D voxels. Each pillar’s points are encoded into a feature vector using a simplified PointNet-like architecture. These pillar features are scattered into a 2D pseudo-image so that efficient 2D convolutional networks can be used for detection. The PointPillars model strikes a nice balance between speed (reported at 62 Hz) and accuracy on benchmarks like KITTI [22], hence suited for real-time applications. The paper mentions the use of a PointPillar voxelization scheme for the NUSTAG EV’s pipeline, mapping points to pixels 13 during the BEV generation process. 2.3.4 Projection-Based (Bird’s Eye View - BEV) Methods The method is that it projects the 3D point cloud onto the corresponding 2D plane, which mainly would be the Bird’s Eye View (BEV), and applies further techniques of 2D object detection. Here programming complexity is highly reduced. • MV3D (Multi-View 3D Object Detection Network) [7]: One of the earlier works that fused features from multiple views (BEV, front view, and camera images) to perform 3D object detection. The paper mentions that its RGB-BEV feature map generation follows ideas proposed by Chen et al. [MV3D]. • Pixor [13]: This method performs 3D object detection directly from a BEV representation of the point cloud. It uses a fully convolutional network to predict objectness and bounding box parameters for each cell in the BEV grid, enabling fast, single-stage detection. • Complex-YOLO [2]: This is a key reference in the paper. Complex-YOLO adapts the YOLO (You Only Look Once) architecture for 3D object detection from BEV images derived from point clouds. It encodes height, intensity, and density information into the RGB channels of the BEV image. A significant contribution is the Euler-Region Proposal Network (E-RPN), which uses complex regression for orientation estimation, avoiding singularities associated with traditional angle computation. Complex-YOLO demonstrated very high speed (over 5x faster than some contenders) and accuracy. • BirdNet+ [10]: Another BEV-based approach that focuses on end-to-end 3D object detection directly in bird’s eye view and aims for efficient and accurate detection. They are fast because they use mature two-dimensional CNNs. However, projection often results in a loss of information, especially about how objects extend vertically. Conse- 14 quently, the code for encoding the BEV images as presented in the paper (Equation ??) is important for preserving salient features. 2.3.5 Treating Sparsity and Uncertainty The point cloud of LiDAR is sparse, and the ground truth annotations may be uncertain, since they could be occluded or contain human errors. • Sparse Convolutional Networks (SCN) [4]: As said about SECOND, SCNs (for example, Sub manifold Sparse Convolutional Networks) are meant for the efficient processing of spatially sparse data. In this, convolution is done only on active sites, that is, non-empty voxels/points and their neighbours, thereby saving large computation as compared to a dense convolution. They work very well for tasks such as semantic segmentation, but deploying them on edge boxes, which are constrained in their resources, could be very challenging without a particular library support created for the target architecture as was mentioned (aarch64) in the future work of the paper. • GLENet [3]: This paper addresses the issue of uncertain labels in 3D object detection; it presents a generative model through conditional variational autoencoders (CVAEs) and proposes such a model to incorporate uncertainty into the detection pipeline with the potential to improve robustness with which detections can be made under ambiguous conditions or even when occlusions occur. The paper notes, however, that while powerful, these advanced models still confront challenges in terms of deployment on edge devices due to limitations in frameworks. 2.3.6 Model Efficiency and Real-Time Performance This necessity has resided in close proximity to all associated past literature, and more prominently among them, as also base focus of the paper- model efficiency for real-time performance of edge devices. 15 • The paper highlights PointPillars [1] and Complex-YOLO [2] for their emphasis on improving process efficiency. Their specific approaches include pillar encoding and a 2D CNN backbone by PointPillars and BEV projection with a lightweight YOLO architecture by Complex-YOLO. • Another example that demonstrates this pursuit is the choice of the paper among different YOLO variants (YOLOv3 tiny, YOLOv4tiny, YOLOv11 Small OBB, YOLOv12 Nano/Small OBB, RT-DETR) [18, 23, 24]. Since YOLO is singlestage processing and integrates bounding boxes and class probabilities for all images together, it impacts inference speeds. • Furthermore, concerning the field and the speed trade-off for any deployment, optimization by TensorRT [19, 20, 21] proves vital for covering the distance between trained models and deployable real-time systems on NVIDIA hardware, as well demonstrated throughout the paper. However, the accuracy-speed trade-off is a constant consideration. Such models as SASSD [15] achieve maximum accuracy using structure-aware single-stage detection but tend to be heavier computationally compared to the lightweight YOLO variants chosen for deployment at the edges of this paper. 2.4 Benchmark Datasets for 3D Object Detection Standardized datasets are crucial for training models and evaluating their performance objectively. The KITTI dataset is prominently featured in the paper and is a cornerstone in autonomous driving research, we have mentioned about it in the chapter 3. 2.4.1 Other Notable Datasets While KITTI is central to our paper, there also appeared other large-scale datasets that continued to promote the advancement of research in 3D perception. 16 • Waymo Open Motion Dataset [25]: In the context of GLENet, the Waymo dataset is much larger than KITTI and offers data collected from a much wider range of sensors in various scenarios with high-resolution LiDAR data and camera images together with detailed annotation for 3D object detection, tracking, and motion forecasting. • nuScenes: Another large autonomous-driving dataset, nuScenes features a complete sensor suite that includes LiDAR, cameras, and radar, together with comprehensive annotations. It provides a more diverse range of weather and lighting conditions than KITTI. • Argoverse: This dataset goes on to provide extensive sensor data and detailed maps, focusing primarily on motion forecasting and 3D tracking. With larger-scale and more complex datasets, researcher communities are being challenged in the development of perception algorithms that are robust and generalized across different environments. Nevertheless, with respect to model optimizations under various considerations as proposed by the paper for edge deployment research, KITTI can be considered an equally relevant and controllable benchmark. 2.5 Conclusion This chapter discussed the relevant research papers and models studied for undertaking the project. There were three main types of research papers referred, namely about the LiDAR, detection using point cloud data and processing point cloud data on edge devices. Additionally the datasets used in these papers are also discussed. 17 Chapter 3 Components & Materials 3.1 Technologies 1. Python3: Primary programming language utilized to develop the models. 2. Pytorch: Used for Training Deep Learning Models 3. Ultralytics: Provided the models and functions to train them on custom data. 4. TensorRT: Used to deploy trained models on the edge device 5. ROS2: Used for developing Simulations and Visualizer 3.2 Sensor In this work, the Ouster OS-1-64 LiDAR sensor served as the primary and sole source of environmental perception data. The general specifications of this LiDAR model are detailed in Table 3.1: The Ouster OS-1-64 is a high-performance digital solid-state LiDAR known for its robust design, high resolution, and reliability, making it well-suited for demanding applications such as autonomous driving and robotics. 18 Table 3.1: LiDAR Specifications Parameter Value Range (80% Lambertian Reflectivity) Minimum Range Range Accuracy Range Resolution Vertical Resolution Horizontal Resolution Vertical Field of View Horizontal Field of View 110 m @ >90% detection probability, 100 klx sunlight 0.8 m for point cloud data ±5 cm for lambertian targets, ±10 cm for retroreflectors 0.3 cm 64 channels 512, 1024, or 2048 (configurable) +16.6° to -16.6° (33.2°) 360° 3.3 Edge Device The NVIDIA Jetson AGX Xavier served as the central processing unit and "brain" of the NUSTAG autonomous vehicle in this project. The key specifications of the Jetson AGX Xavier (32GB version) utilized in this project are detailed in Table 3.2: Table 3.2: Jetson AGX Xavier (32GB) Specifications Parameter Value AI Performance 32 TOPS (Trillion Operations Per Second) GPU 512-core NVIDIA Volta architecture GPU with 64 Tensor Cores GPU Max Frequency 1377 MHz CPU 8-core NVIDIA Carmel Arm®v8.2 64-bit CPU 8MB L2 + 4MB L3 CPU Max Frequency 2.2 GHz Power 10W to 30W An edge device like the Jetson AGX Xavier is crucial for autonomous systems as it provides the necessary computational power to process sensor data, run complex algorithms (such as perception, planning, and control), and make real-time decisions directly on the vehicle, minimizing reliance on external cloud infrastructure and reducing latency. 19 3.4 3.4.1 Dataset KITTI Dataset The KITTI Vision Benchmark Suite [22, 26] is one of the most popularly used datasets in mobile robotics and autonomous driving research. This was the primary dataset used for training the models. • Sensor Suite: Data was collected from a standard station wagon with a complete sensor setup of high-resolution stereo cameras (one color, one grayscale), a Velodyne HDL-64E LiDAR scanner and a high-precision GPS/IMU localization unit. • Data Variety: It contains hours of driving data collected across different environments: structurally urban streets, rural roads, and highways at different weather and lighting conditions. • Annotations: KITTI provides 3D bounding boxes for several object classes (cars, pedestrians, cyclists) in both the LiDAR points cloud and camera images, which are very carefully annotated for object detection. These annotations contain object class, 3D location, dimensions, and orientation. • Tasks: This covers different types of benchmark activities in stereo vision, optical flow, visual odometry, 3D object detection, and 3D tracking. • Effect: KITTI has had a great deal of contribution to the advancement of 3D perception. Almost all state-of-the-art models discussed in the paper, PointPillars, Complex-YOLO, etc., are evaluated and benchmarked on KITTI. In fact, this paper trains its YOLO models on the KITTI dataset. 20 3.4.2 Preprocessing The raw LiDAR point clouds are dense with 3D information; however, they are often described as sparse, unordered, and high-dimensional, which is contrary to how standard CNNs work. Therefore, a very important first step consists in changing the point clouds into a more efficient structure that can be processed by much deeper learning models. We outline a second approach in the paper: this involves producing a 2D Bird’s Eye View (BEV) representation. Pre-processing is fundamental to dense input data and to improve the inference speed of the object detector. The 3D point cloud of a single frame, acquired by the LiDAR, is converted into a single RGB-BEV, covering an area of 50m x 50m, in front of the sensor. Inspired by Chen et al. (MV3D) [7], the RGB-map is encoded by height, intensity, and density. PΩ = {P = [x, y, z]T | x ∈ [0, 50m], y ∈ [−25m, 25m], z ∈ [−1.75m, 2m]} We define a mapping function Sj = fP S (PΩi , g) with S ∈ Rmxn mapping each point with index i into a specific grid cell Sj of our RGB-BEV. A set describes all points mapped into a specific grid cell: T PΩi→ − j = {PΩi = [x, y, z] |Sj = fP S (PΩi , g)} Hence, we can calculate the channel of each pixel, considering the SIGNAL strength as I(PΩ ): T zg (Sj ) = max(PΩi→ − j · [0, 0, 1] ) zb (Sj ) = max(I(PΩi→j )) zr (Sj ) = min(1.0, log64 (N + 1)) where N = |PΩi→j | Here, N describes the number of points mapped from PΩi to Sj . Hence, zg encodes the maximum height, zb the maximum intensity, and zr the normalized density of all points mapped into Sj . 21 3.4.3 Dataset Preparation and Label Conversion The RGB-BEV images were used for training the models from the KITTI dataset. The KITTI dataset has 3D bounding box annotations for objects in 3D in point clouds. Conversion of these 3D labels into 2D BBs for training purposes of the OBB YOLO models on the BEV images is then done as follows: • Project the corners of the 3D bounding box onto the 2D BEV plane. • Parameterize this OBB into the form required by the YOLO model that is (x, y, w, h, r) representation. Consequently, training images were kept at a resolution of 608x608 pixels, which is normally input in terms of various versions of YOLO. During training, the augmentation was also employed as provided by the ultralytics framework. Figure 3.1: Point Cloud to 2D Bird Eye View Conversion. 3.5 Conclusion This chapter discussed the technologies used and the detailed specifications of the Ouster Gen1 OS-1 LiDAR, the Nvidia Jetson Xavier and the KITTI dataset. 22 Chapter 4 Object Detection Autonomous systems employed for environmental perception, especially autonomous vehicles, depend on object detection. It involves figuring out where the object of interest, according to sensor data, is located, how big it is, and what all classes it would belong to. For instance, one is interested in identifying cars, pedestrians, and cyclists among others. In particular LiDAR systems, this activity is practically defined by the task of real-time processing of large chunks of 3D point cloud data for extracting this vital information. Therefore, the performance of the object detection module will directly affect the safety and reliability of all downstream operations, including path planning, collision avoidance, and higher-level decision-making. This chapter describes the object detection pipeline developed by this work within the overall perception system. The chapter begins by detailing the preprocessing of raw LiDAR data into suitable representations for deep learning-based detectors-most notably Bird’s Eye View (BEV) projections. A thorough account is then provided on the YOLO (You Only Look Once) family of object detection models. Their performance brings speed and accuracy in poised balance, which is why they form the backbone detection strategy of our work. The chapter also discusses the training procedure, including dataset preparation, data augmentation techniques, and 23 Oriented Bounding Boxes (OBB) to improve object localization accuracy in the space. Special attention is also given to the optimization methods employed to efficiently deploy these models on edge computing hardware, particularly the NVIDIA Jetson Xavier AGX platform. Optimization with TensorRT, which includes techniques like layer fusion, precision calibration (FP16/INT8), and memory tuning, is highlighted as essential for achieving real-time inference on constrained hardware. Lastly, the chapter outlines the experimental setup used for evaluating different YOLO configurations. Performance is analyzed on the KITTI dataset, and results are used to justify the final model selection for integration into the perception system of the NUSTAG EV. 4.1 YOLO (You Only Look Once) for Object Detection Architecture The network takes a bird’s-eye-view RGB-map as input. It uses a YOLO11-obb CNN architecture to detect accurate multi-class oriented objects while operating in real-time. The YOLO (You Only Look Once) [18] family of object detectors has been selected in this project because it has proven in literature to be a very good compromise between speed and accuracy and so well suited for real-time applications on embedded systems. 4.1.1 Core Principles of YOLO YOLO views object detection as one end-to-end regression problem: from image pixels to bounding box coordinates and class probabilities. This monolithic structure stands in contrast to two-stage detectors, for example, Faster R-CNN, which propose regions of interest and then classify them. Core features of YOLO are: • Single Pass: Entire image processed in a single pass through neural network. 24 • Grid-Based Detection: Divide the input image into S ×S grid. If an object’s center falls within a grid cell, that grid cell is responsible for detecting that object. • Bounding Box Prediction: Every grid cell predicts a fixed number of bounding boxes (B) and confidence scores for those boxes. The confidence score reflects how confident the model is that the box contains an object and also how accurate it thinks the box is. • Class Probabilities: Each grid cell also predicts conditional class probabilities for C classes, P (Classi |Object). • Unified Prediction: Predictions usually include the bounding box coordinates (x, y, w, h) (the center coordinates: width, height), a confidence score, and class probabilities for each predicted box. This one-stage approach offers a drastically improved inference speed over a two-stage detection mechanism. Figure 4.1: YOLOv11 base architecture. 25 4.1.2 Evolution and Variants Used From its inception, YOLO has been through many iterations, advances, or improvements, obtaining more accuracy, being faster, and more adept at detecting objects with different scales. In this paper, we analyze some of the variants: • Complex-YOLOv3 Tiny & Complex-YOLOv4 Tiny [2]: These are adaptations of the lightweight "tiny" versions of YOLOv3 and YOLOv4 explicitly tailored for 3D object detection on BEV images as proposed in the Complex-YOLO paper. They have very small network backbones in order to achieve maximum speed of inference for edge devices. • YOLOv11 Small OBB [27]: This is not a commonly acknowledged official release in the main YOLO lineage (up to YOLOv9/YOLO-NAS/YOLO-World as of early 2024), so it is most probably newly developed or project-specific. "Small" indicates a mini architecture, "OBB" indicates its capability to predict Oriented Bounding Boxes, and it was the strongest model in the paper’s experiments. • YOLOv12 Nano OBB & YOLOv12 Small OBB [23]: Once again, just like YOLOv11, "YOLOv12" appears to be a name for a very new or project-specific nomenclature. "Nano" and "Small" convey two levels of complexity, with "OBB" denoting oriented boxes. • RT-DETR (Real-Time DEtection TRansformer) [24]: It is other architecture using transformers for object detection. Though DETR-based models have relatively good performance, they are still under the active research banner in terms of realtime processing on edge devices. The inclusion adds a comparison against modern detectors not based on convolutional architecture. Such models include the small, nano, and tiny versions, which highlight the performance goals of the project towards resource-constrained edge deployment. 26 4.1.3 Oriented Bounding Boxes (OBB) for BEV Unfortunately, traditional object detectors predict axis-aligned bounding boxes (AABBs). For example, vehicles in a bird-eye view (BEV) image can be oriented at different angles, so AABBs generally produce extremely poor fittings: they contain large values when matched against background and miss some portion of the object, especially for elongated objects aligned with an angle. Oriented Bounding Boxes (OBBs) fit a tighter bounding box, by allowing rotation of the bounding box. Typically defined by center (x, y), width (w), height (h), and an angle of rotation (θ), an OBB is capable of mathematically representing a rotated bounding box. The paper states that OBB models have been trained, which means that the YOLO variants used herein have been modified as well as chosen to predict these five instead of the four for AABBs. There are three advantages of employing OBBs in BEV object detection: • Improved Localization: Tighter boxes provide better IoU with reality and increase accuracy in localizing objects. • Less Overlap Ambiguity: Overlap that narrows the possible interpretations of the bounding box is maintained to a minimum at the boundaries of nearby objects. • Improved Size and Shape Estimation: This will improve the representation of geometric footprint and orientation of the object, both highly important for tracking and path planning. The E-RPN from the Complex-YOLO paper, for example, targeted robust angle regression relevant to 3D object detection [2]. Modifying the output layer of the YOLO model for 2D BEV OBB detection allows for the prediction of an additional angle parameter. 27 4.2 Model Training To train robust, accurate object detection models, careful preparation of data is needed, appropriate augmentation strategies have to be considered, and a well-configured training pipeline is in place. 4.2.1 Training Setup and Pipeline The training takes place on a very powerful workstation: • CPU: 32 Intel i9 processors • GPU: NVIDIA GTX 4090 24GB • RAM: 128GB RAM Such hardware can ensure reasonably fast iterations in training. The paper mentions using the Ultralytics training function. The Ultralytics framework provides implementations and training scripts for almost all YOLO versions (e.g., YOLOv5, YOLOv8). All models used here were trained with standard training arguments, but for 325 epochs and an image size of 608 pixels. Specifics of hyperparameter tuning are not given but can be deduced as being part of standard training procedures for purposes of maximizing detection mean Average Precision (mAP) and generalization. Typically, the core of a training pipeline consists of the following: • Loss Function: Most YOLO models employ a composite loss function that incorporates bounding box coordinate regression (like CIoU loss), object confidence (binary cross-entropy), and class prediction (either binary cross-entropy or crossentropy) components. • Optimizer: Adam or SGD with momentum would commonly be seen as optimizers. 28 • Learning Rate Schedule: Learning rate decay and cyclic learning rates are common strategies. • Batch Size: This is picked from the limitations imposed by the GPU memory. Inferencing on edge devices such as NVIDIA Jetson Xavier does involve great optimizations for real-time performance while training mostly takes place on very highperformance workstations. Would-be-one-of-its-key-enabling technology is NVIDIA TensorRT. 4.2.2 The Need for Optimization on Edge Devices Edge devices indeed have much less computation architecture, memory, and energy budget compared to their server counterparts or workstations. With almost no optimization, simply sending a trained PyTorch or TensorFlow model onto an edge device would suffer from unacceptably high latency and low throughput. Hence optimization becomes necessary to: • Reduce Latency: Ensure predictions are available fast enough for live decisionmaking. • Increase Throughput: Process more frames per second. • Minimize Resource Usage: Reduce CPU, GPU, and memory footprint. 4.3 Experimental Evaluation of Detection Models The paper went on to do an experimental comparison of the performance of different variants of YOLO over TensorRT optimization, implemented on the NVIDIA Jetson Xavier. 29 Figure 4.2: YOLOv11 small training curve. 4.3.1 Models Evaluated The following models were trained on KITTI BEV dataset and later tested on TensorRT for Jetson Xavier performance evaluation: 1. Complex YOLOv3 Tiny 2. Complex YOLOv4 Tiny 3. YOLOv11 Small OBB (best performing) 4. YOLOv12 Nano OBB 5. YOLOv12 Small OBB 6. RT-DETR The model configurations were chosen such that their computational requirements (GFLOPS) are within performance specifications of the NVIDIA Jetson Xavier (mentioned in the paper as under 32 GFLOPS, although this seems low for the peak of the Xavier AGX; perhaps a sustained target was meant). 30 Figure 4.3: Models bad performance. Figure 4.4: Model’s good performance. 31 Figure 4.5: RTDETR models performance. 4.3.2 Performance Measures Some main measures of evaluation were: • TensorRT Inference Time (ms): The time to perform inference using an optimized TensorRT engine on a single BEV image by the Jetson Xavier. This is a direct measure of speed. • mean Average Precision (mAP): The common metric for measuring the accuracy of object detection. – mAP @ 0.5 IoU: mAP computed using a 0.5 threshold for Intersection over Union (IoU). Common baseline usually used for PASCAL VOC-style evaluation. – mAP @ 0.5:0.95 IoU: mAP averaged over multiple IoU thresholds-from 0.5 up to .95 with step 0.05. This is a stricter metric and is often used for evaluation in the COCO style and favors a more precise localization. 32 IoU is the measure of overlap between the predicted bounding box and the ground truth bounding box, as defined: IoU = 4.3.3 Area(P redictedBox ∩ GroundT ruthBox) Area(P redictedBox ∪ GroundT ruthBox) (4.1) Result Analysis The summary of the evaluated models is presented in Table ?? (Table I in the paper). Table 4.1: Reiteration of Table I: Comparison of Object Detection Models Performance on Xavier (Trained at 325 Epochs) # 1 2 3 4 5 6 Model Complex YOLOv3 Tiny Complex YOLOv4 Tiny YOLOv11 Small OBB YOLOv12 Nano OBB YOLOv12 Small OBB RT-DETR TRT Inference Time (ms- mAP (%) at 50 IoU- mAP (%) at 50-95 IoU- Key observations from the results: • yolo11s-obb showed prime value in all major parameters: – Inference Time: This is 15ms. – Highest mAP @ 0.5 IoU: That is 93.0% – Highest mAP @ 0.5:0.95 IoU: This is 71.2% • The Complex-YOLO tiny variants were meant to be fast, but unfortunately turned out to be much slower (45-51 ms). This only proves how much the latest YOLO architectures have advanced. • yolo12s was slightly slower, with inference times of 18ms, and gave slightly lower accuracy compared with yolo11s. • RT-DETR gave high inference time (50 ms), and in mAP it falls below that of the best YOLO variants on this specific task. 33 Consequently, YOLOv11 Small OBB was taken up as the flagship detection model in the NUSTAG EV perception pipeline since it offered an optimal trade-off between speed and accuracy on the edge device. Apart from the experiments that were carried out using the above-mentioned detection model with improved performance from the literature in comparison with other existing 3D object detection models (Table ?? in the paper), the true performance of YOLOv11 Small OBB is even more contextualized. Table 4.2: 3D Object Detection Models Performance Comparison on Nvidia Xavier on KITTI Dataset Model BirdNet+ [10] PointRCNN [11] Part-A2 [12] Pixor [13] Second [14] SA-SSD [15] PointPillars [1] YOLOv11 small Inference (ms) - – 101 10 mAP (%) @ 0.5 IoU 0.5-0.95 IoU - - Figure 4.6: Performance of Object detection models on Xavier. 34 4.4 Conclusion This chapter discussed the YOLO OBB model used for object detection, its training and performance evaluation. It was further compared to the other models found in literature such as BirdNet, PointRCNN, Pixor, PointPillars, etc. It is observe that the YOLOv11s model gave the least inference time of 10ms and a high mAP of 93.0. 35 Chapter 5 Object Tracking 5.1 Tracker Tracking is performed by ByteTrack algorithm. It is a multi-object tracking (MOT) algorithm that effectively uses low-confidence detection boxes by associating them with tracklets, leading to improved performance, especially in crowded scenes and with occlusions. 5.2 ByteTrack MOT ByteTrack utilizes a Kalman filter at its core for state estimation and prediction of the tracked objects. It is a mulit-object tracker (MOT) and therefore maintains a history of the objects tracked in the form of data records called tracklets. These tracklets maintain positional data of the past 30 frames for each object. We extended the prediction of tracked objects to estimate their trajectory. The Kalman filter has an 8 dimensional state vector to estimate the position of the objects being tracked. In the Kalman filter, the predicted state based on object dynamics (e.g., velocity and direction) x̂k|k−1 is fused with the probabilistic detection from YOLO, whose inaccuracy may vary from time to time with the noisy state zk , as given in [Eq: 5.1]. 36 The following equation is at the core of Kalman filters, x̂k = x̂k|k−1 + Kk (zk − H x̂k|k−1 ) (5.1) here, x̂k : Updated state estimate at time k x̂k|k−1 : Predicted state estimate at time k given information up to k − 1 Kk : Kalman gain at time k zk : Measurement vector at time k H: Observation matrix (zk − H x̂k|k−1 ): Innovation or measurement residual 5.2.1 Tracklet Interpolation Tracklet is a sequence of tracked object positions. Tracklet interpolation is useful for cases where the object detector misses detections in between frames. Tracklet interpolation estimates the bounding boxes of these missed objects. If a tracklet T loses its detection at time step t such that t1 < t < t2 and t2 − t1 < σ then the bounding box Bt of the tracklet T is linearly interpolated between Bt1 and the first redetected box Bt2 . The hyperparameter σ defines the acceptable time of occlusion. The following is the formula for linear interpolation [Eq: 5.2]. Bt = Bt1 + (Bt2 − Bt1 ) t − t1 t2 − t1 (5.2) Tracklet interpolation is useful to maintain the structure of the tracklets since they require thirty positional data points for adjacent frames. 37 5.3 Multi-step Look ahead The Kalman filter inherently performs a forward prediction of the state for a single time step. We used [Eq: 5.3] with n = 20 to predict the state two seconds into the future. x̂k+n = x̂k × F n (5.3) In the above equation [Eq: 5.3], x̂k+n is the future state at time step n x̂k is the currently predicted state F n is the state space matrix of shape 8 × 8 x̂ = x y a h ẋ ẏ ȧ ḣ here, x and y are the abscissa and ordinate of the center of the bounding box a is the aspect ratio between width w and height h of the bounding box such that w = a×h ẋ and ẏ are velocities in x-direction and y-direction respectively while ȧ and ḣ are rate of change in a and h respectively.   1 0 0 0 ∆t        F =        0 1 0 0 0 0 0 1 0 0 0 .. . 1 .. . 0 .. . 0 0 0 0 0 0 .. . 0 .. . 38 0 0 0   ∆t 0 0     0 ∆t 0    0 0 ∆t   .. .. ..  . . .    0 0 1 Figure 5.1: Tracking test LiDaR data of the San fransisco drive. 5.4 Conclusion This chapter discusses the ByteTracker algorithm and tracklet interpolation. It discusses how the Kalman Filter is used for forward predictions and how the tracker utilizes it for predicting the trajectory of detected objects after 2 seconds. 39 Chapter 6 Hardware in Loop Simulations In this chapter we will discuss the Gazebo Simulator, why it was choosen and how the simulations that were made using Gazebo Ignition and ROS2. Furthermore, the outputs of the Deep Learning models on the simulated environments are also shown and discussed. 6.1 Gazebo Simulator Gazebo is a collection of open-source software libraries designed to simplify the development of simulations for robotics applications. It is widely used by robot developers, designers, and educators. Each library within Gazebo has minimal dependencies and are made to be as modular as possible, making them suitable for tasks such as solving mathematical transforms for kinematics and inverse kinematics, sensor imitation, and managing physics simulation [28] Simulations provided a safe and efficient way to test the deep learning model’s performance under different conditions. Gazebo was chosen as the simulation platform for this project for testing and evaluation of the NUSTAG electric vehicle in a digital environment. 40 6.2 6.2.1 Digital Twin Development Measurements All the measurements are either manually calculated or taken from the Tecknofest Documentation[29]. Table 6.1: Car Specifications Parameter Value Weight Height Width Length Ground Clearance 80 kg 169 cm 90 cm 258 cm 13 cm Wheel Positions Front Wheels Offset Back Wheel Offset 46 ± 12.5 cm from arc center 61 ± 12.5 cm from arc center Front Area Thickness Width Length 5 cm 63 cm 190 cm Back Area Length Thickness Width 35 cm 5 cm 70 cm Wheels Thickness Wheelbase Front Wheel Opening Back Wheel Opening Diameter 7 cm 170 cm 101 cm 82 cm 60 cm 41 Figure 6.1: Chassis Dimensions Figure 6.2: Side View of Digital Twin Figure 6.3: Top View of Digital Twin 42 6.2.2 URDF Development URDF (Unified Robot Description Format) is an XML-based file format used to represent robot properties such as its physical configuration, joints and link structures. It is used by ROS2 ecosystem to describe the kinematic and dynamic properties of a robot like its links, joints, inertial data, visual geometry, and collision models. The URDF model of the digital twin of the car was designed to mimic the physical and structural specifications of the real vehicle. The key features of the URDF model are as follows: 1. Accurate Physical Dimensions: The URDF contains the electric vehicles’ geometry, including exact measurements of length, width, height, wheelbase, and ground clearance. This ensures that the simulated robot behaves similar to its physical counterpart in terms of spatial interaction and kinematics. 2. Four Continuous Wheel Joints: The vehicle is equipped with four continuous joints representing its four rotating wheels. They essentially simulate realistic rolling behavior of the wheels during forward and reverse motion. 3. Front Wheel Steering via Revolute Joints: The two front wheels are connected via revolute joints, enabling them to rotate about their vertical axes for steering. These joints are configured according to the Ackermann steering geometry [30], allowing for realistic driving dynamics and vehicle mobility. 4. LiDAR Mount: A specific link is defined in the URDF to represent the mounting point for the simulated LiDAR. The link is fixed to the vehicle’s chassis at a height of 164cm. 43 5. IMU Mount: Similar to the LiDAR, a fixed link is assigned for the Inertial Measurement Unit (IMU) sensor. The simulated IMU gives orientation, acceleration, and angular velocity readings in the virtual world. These readings are required for state estimation and control algorithms. 6.2.3 Ackermann Controller Ackermann steering geometry is a configuration which reduces tire slippage and enhances turning efficiency, especially at low speeds, by ensuring all wheels follow concentric turning paths. Ackermann geometry ensures that the inner front wheel turns at a sharper angle than the outer front wheel. This setup enables both front wheels to align with the instantaneous center of rotation, improving maneuverability and minimizing lateral tire slip. The Ackermann Steering is implemented in the NUSTAG Electric Vehicle, hence to mimic the steering and turning behavior of the vehicle an Ackermann Controller was designed for the digital twin. Ackermann Geometry Calculations The following parameters and formulas were used to calculate inner and outer wheel steering angles and the turning radius based on Ackermann principles. All measurements were taken manually after turning the car wheels to their maximum turning angle. 44 Figure 6.4: Ackermann Diagram Given Data: • Wheelbase: 170 cm • Rear wheel track width: 82 cm • Front wheel track width: 100 cm • Distance from Rear Left wheel to Centre of Turning Radius: D = 284 cm • Distance between Front and Rear wheel = 9.25 cm Inner Wheel Steering Angle (θin ): Using Pythagorean Theorem: −1 θin = tan Wheelbase D − Offset Substituting the values: −1 θin = tan - − 9.25 45 −1 = tan - ≈ 31.8◦ Outer Wheel Steering Angle (θout ): Using Pythagorean Theorem: −1 θout = tan Wheelbase D + Rear wheel track width + 9.25 Substituting the values: −1 θout = tan - + 82 + 9.25 −1 = tan - ≈ 24.4◦ Turning Radius Calculation The turning radius R is computed using the Pythagorean theorem by taking the distance from the center of the car: r R= ( 82 170 2 ) + (284 + ( ))2 2 2 p (85)2 + (284 + 41)2 √ R = 7225 + 105625 ≈ 335.93 cm R= 6.2.4 Sensor Integration The gz_sensor library was used to create dummy sensors to imitate sensor data. The Electric Vehicle only utilizes the Ouster Gen 1 OS-1 LiDAR, which consists of: 1. LiDAR: Generates a 3D map of the sensor using point clouds. The LiDAR simulated shares similar specifications to the Ouster Gen 1 OS-1 LiDAR, operating at 10 Hz and having 1024 horizontal beams x 64 vertical beams at an angle of ±16.6◦ . 2. IMU: Ouster LiDARs consists of an integrated and calibrated IMU, providing motion compensation and Simulatanous Localization and Mapping (SLAM). The IMU is 46 simulated for the sake of completeness. 6.3 Simulation Environments To evaluate the accuracy of the trained models, several basic simulated environments were created in Gazebo. Due to the significant RAM and GPU requirements of the simulator, the environments were kept simple, using static objects with lightweight meshes. 6.3.1 Car-Only Environment This environment simulates a street with a total of seven parked cars—six on the left side and one on the right—all facing downward. The autonomous vehicle is programmed to drive down the road and perform a U-turn to the left upon reaching the end. Figure 6.5: Simulated car-only environment in Gazebo 6.3.2 Figure 6.6: Model detection results in the car-only environment Pedestrian-Only Environment This environment simulates a market area populated with fifteen static pedestrian models. The pedestrians are evenly spaced across the grid, with four facing right, four facing 47 downward, and seven facing left. The vehicle navigates through the pedestrians and returns to its starting position. Figure 6.7: Simulated pedestrian-only environment in Gazebo 6.3.3 Figure 6.8: Model detection results in the pedestrian-only environment Mixed Environment: Cars and Pedestrians This environment features a simulated intersection populated with twelve cars and seven pedestrians. A turn is located at the edge of the map. The autonomous vehicle is tasked with navigating around all static obstacles to reach the end of the scene. Figure 6.9: Simulated environment with both cars and pedestrians Figure 6.10: Model detection results in the mixed environment 48 6.4 Conclusion This chapter discusses the details about the Gazebo Simulator and the background of Gazebo Fortress, the simulator used for this final year project. Furthermore, details of the digital twin developed for the simulations is discussed, which included measurements and calculation to compute the ackermann steering angles (θin = 31.4◦ and θout = 24.4◦ ) and the turning radius of 3.35m for the car. The ROS2 pipeline is also discussed, which dictates how sensor data from the Gazebo simulator is stored and processed to be sent to the Jetson Xavier for processing. Finally, detailed simulation environments are created. Three types of cases (Car ONLY, Pedestrian ONLY, and Car and Pedestrian mixed) are made, and the output of the model is observed in all three environments. 49 Chapter 7 ROS2 Architecture and Visualization Tools 7.1 ROS2 Framework This section discusses the Robotic Operating System (ROS2), its use in developing the visualizations for the sensor data using Rviz2 and the features programmed for easier user interaction with the system. 7.1.1 ROS2 Robot Operating System 2 (ROS2) is an open-source framework of libraries and tools for building robot applications. It provides a modular publishing/subscribe and server/client architecture for inter-node communication, drivers for hardware, and integration with simulation tools like Gazebo. ROS2 was chosen for its support of distributed computing and real-time control in robotics. 50 7.1.2 Visualizations with Rviz2 Rviz2 is a ROS2 package which allows the user to display different types of data. It provides support for viewing images, point cloud data, laser scanner data, robot states (link and joints positions and orientations) and much more [31]. Figure 7.1: Rviz2 Output for Simulation Figure 7.2: Rviz2 Output for LiDAR sensor 51 7.1.2.1 Topics Topics in ROS2 refers to the data that follows the publisher/subscriber architecture. A publisher pushes data to the topic, and all the subscribers listening on the topic receive the data simultaneously. Each topic is strongly typed, meaning that they use static datatypes with well-defined semantics (i.e. all topics using angles are defined to use radian rather than degrees). All the sensors (LiDAR and IMU) are publishers and Rviz2 is the subscriber (receiving and displaying the sensor data). The sensor datatype are referred to interfaces. Interfaces define the structure of the message. 1. PointCloud2: The interface consists of a header (which contains the timestamp and id of the message), height and width of the 2D structure of the point cloud, the fields (i.e range, reflectivity, signal and NIR for Ouster LiDAR), length of each field in bytes, and the actual data. Rviz2 Output for Simulation shows the type of PointCloud2 interface utilized by the Simulation and Ouster LiDAR. Note that the Ouster LiDAR offers more fields in comparison to the simulated LiDAR. This does not affect our BEV image as we are utilizing the xyz coordinates and intensity values retrieved from the LiDAR. 52 Figure 7.3: Point Cloud Data visualized on Rviz2. reflectivity (top left), intensity (top right), range (bottom left), ambient (bottom right) 2. IMU: The interface consists of a header, orientation (quaternion), orientation covariance matrix, angular velocity (rad/s), angular velocity covariance matrix, linear acceleration and the linear acceleration covariance matrix. The matrices show the uncertainty in the sensor measurements and are a compact way to represent error in 3D space. 3. Image: The interface consists of a header, height and width of the image, encoding (RGB, HSV etc), the size of the image (bytes), and the pixel-wise data. The topic is being used by the LiDAR to publish the NIR, Range, Signal and Reflectivity Images shown in figure below. The BEV node also utilizes this interface to publish the BEV image. 53 Figure 7.4: Bounding Boxes using Line Strip and Line List Markers 4. Markers: Markers are special type of Rviz2 interfaces which allow for programmatic addition of different shapes to 3D view. [32]. Two main type of markers were utilized for generating the bounding box on Rviz2: LINE_STRIP (for generating a connected 2D square) and LINE_LIST (for generating vertical lines to represent height). (a) Line strip Marker (b) Line list Marker Figure 7.5: Rviz2 Display Markers 54 Figure 7.6: Bounding Boxes using Line Strip and Line List Markers 7.2 Nodes and Topics The RQT Graph shows the interaction between the different nodes and topics in our system. The Gazebo Simulation and LiDAR RQT graphs are shown below. In the figures the rectangle represents the topics and the oval shows the nodes. Gazebo RQT Figure 7.7: Node and Topics interaction in Gazebo Simulation In the graph we observe that there are two main topics, the joint_state and lidar/points. In the simulation as we are directly receiving the point cloud, we directly convert the x, y, z and intensity values to the BEV image. The joint states update the car’s joints (mainly 55 wheels and front wheel steering joints) to update the link and joints of the robot shown on Rviz2. LiDAR RQT Figure 7.8: Node and Topics interaction in the LiDAR In the above graph we observe that the ouster lidar is uses the signal_image, range_image and ouster_points to generate the BEV Image. There are two main methods to generate the BEV from the LiDAR output: 1. Convert Range Image to xyz points using XYZLut provided by the ouster-sdk which requires the LiDAR metadata file. Then concatenate it with the normalized signal image. 2. Directly take the ouster/points message and filter the relevant fields (x, y, z and intensity). The conversion are done implicitly. The first approach is used to stay consistent with preprocessing mentioned in Preprocessing. 7.3 Features The section defines the different nodes and processes programmed in ROS2 to provide users with a method to visualize and interact with the data. The figure below shows an abstract workflow of how the ROS2 pipeline generates its features. 56 Figure 7.9: Overview of ROS2 Workflow 7.3.1 BEV Conversion To convert the point cloud data to BEV image, we first extract the relevant LiDAR features then filter and finally convert them to the RGB BEV format (Preprocessing). 7.3.1.1 Extraction The Simulation and Hardware feature extraction is a bit different. After the features are extracted the rest of the process is similar for both workflows. 1. Simulation From the simulations we extract the x, y, z and intensity fields in the lidar/points topic (using PointCloud2 interface). The obtained numpy matrix has the dimensions 1x262144. We first convert all the data to float32 datatype, then reshape the array to 1024x64x3 and then transpose it to make it matrix 64x1024x3. 57 In the above matrix the dimensions refers to the 64 vertical lidar beams, 1024 horizontal beams and 4 channels. These are the configurations used in the Ouster LiDAR Sensor. 2. Hardware The Ouster LiDAR does not directly send point cloud data. As it uses the Ethernet protocol, the data received by the hardware is in the form of packets. These packets are converted into RANGE and SIGNAL images, which are then converted to point cloud points (x, y, z) using the XYZ Look Up Table generated from the metadata file generated by the LiDAR upon its initialization. 7.3.1.2 Filtering The following boundary conditions have been used to filter point cloud points. These configuration values are taken from . minX maxX minY maxY minZ maxZ 0 50 -25 25 -1.73 2.27 The above values show the distance (in meters) captured by the LiDAR. It looks 50m ahead and 25m left and right. As the height of our car is 1.73m, we define the minZ as the car’s height and the maxZ of 2.27m. A mask is generated to filter the points that satisfy these conditions, then the z channel is subtracted with the minZ value. 7.3.1.3 Generating Features The filtered point cloud data is discretized into a fixed grid 608x608 pixels (for BEV image). The X and Y coordinates of each point are converted into discrete image coordinates, and points are sorted to prioritize those with the highest elevation. From this discretized 58 and sorted point cloud, three feature maps are generated: a height map representing the normalized vertical position (Z-axis) of the topmost point in each cell, an intensity map capturing the reflectance value of the first point in each cell, and a density map that encodes the number of points per grid cell using a logarithmic normalization. These three maps are stacked as separate channels in a single RGB image, where red corresponds to density, green to height, and blue to intensity. 7.3.2 Visualize Live Sensor Data The script written first initializes the LiDAR’s UDP ports to start transmission of data. It then converts the packets send from the LiDAR into PointCloud2 format and publishes it on the ouster/points topic. It also publishes on the range_image, reflectivity_image, NIR_image, and signal_image topics. The following command’s can be used to for this feature: ros2 launch ouster_ros sensor.launch.xml \ sensor_hostname:= Just need to specify the IPv4 address of the LiDAR ros2 launch ouster_ros driver.launch.py \ params_file:= requires path to configuration file which specifies LiDAR properties such as its IPv4 address, its beam configuration, frequency etc. ros2 launch av_car sensor_launch Wrapper on the above command to automatically generate and link the configuration files for launching sensor. ros2 launch av_car sensor_with_bev_launch.py sensor:=true Launch the sensor visualizer with BEV image. 59 7.3.3 Record Data In Overview of ROS2 Workflow we observe that the Simulation data is stored in rosbag whereas the LiDAR data is stored in as pcap. 7.3.3.1 Simulation ROS2 provides support for storing data recorded in simulations. These include messages, topics they are generated from and the timestamps. It stores this data as a compressed .sqlite3 file. To record in rosbag format the command ros2 bag record -ao is used. The command listens and stores data published on all topics. To filter the topics that are recorded, ros2 bag record -o can be used where the list of desired topics can be provided. 7.3.3.2 Hardware Upon establishing connection with the LiDAR, there are two main methods to record the LiDAR. 1. rosbag: Rosbag saves specific topics published by the LiDAR. The recorded data saved is in the .sqlite3 format. ros2 launch ouster_ros record.launch.xml \ sensor_hostname:= \ bag_file:= \ metadata:= 60 2. pcap: Raw LiDAR packets are stored directly. ouster-cli source save .pcap 7.3.4 Replay Data 7.3.4.1 rosbag: This commands replays all the data stored on saved topics. ros2 bag replay 7.3.4.2 pcap: Reads the pcap file and converts the data into the topics ouster/points, range_image, reflectivity_image, NIR_image, and signal_image to visualize it on Rviz2. ros2 launch ouster_ros replay_pcap.launch.xml \ pcap_file:= \ metadata:= 7.4 Conclusion This chapter discussed the ROS2 framework, the Rviz2 visualizer to show the sensor data (point clouds and images), the ros2 pipeline for converting point cloud data to RGB BEV image, and finally the features implemented using ROS2 (recording, replaying and live viewing of the LiDAR data). 61 Chapter 8 Deployment & Validation 8.1 Deployment on Nvidia Jetson TensorRT (an inference optimizer and runtime developed by Nvidia) was used to get efficient real-time performance on the Nvidia Jetson. TensorRT builds a hardware-optimized inference graph of our trained object detection model. Various optimizations (specific to the target Nvidia GPU architecture) such as layer fusion and quantization are applied to the input model to make the model run more smoothly on the hardware. The deployment pipeline first converts the trained PyTorch model into the ONNX (Open Neural Network Exchange) format. ONNX serves as an intermediary representation that allows interoperability between different deep learning frameworks. At the end, this ONNX graph is built into a TensorRT engine runtime. 8.1.1 Limitations of TensorRT The latest TensorRT version supported for Jetson Xavier (running Jetpack 5.1) is 8.5.2. This is why ONNX Opset version 16 was used. This version was chosen because it represents one of the latest ONNX operator sets with comprehensive compatibility with the version of TensorRT available. 62 8.1.2 Benefits of TensorRT Employing TensorRT to significantly speed up inference, as we observed a 330% increase in our FPS (Frames Per Second). This metric jumped from 18 FPS to 62 FPS, for yolo11s-obb model for PyTorch to TensorRT. 8.2 Live Detections on Jetson Utilizing ouster-sdk, a Python script was written to capture the LiDAR frames, convert them to RGB BEV images and finally feed them into the model. The model then generated bounding boxes on each frame. An average inference time of 20ms was observed upon detection of a car, cyclist or pedestrian. Figure 8.1: Object Detection and Tracking Models running on Xavier 8.3 Integration with NUSTAG Autonomous Vehicle The Jetson and LiDAR were to be mounted on the autonomous vehicle for complete integration. As the mount for Jetson was already present, only a mount for the LiDAR 63 was required. 8.3.1 Mount Design The CAD model for the mount was design keeping in mind the vehicle’s limitations. The mount was empty from the inside and had two openings on the front and back to accommodate the mono-camera already mounted on the car. The mount had two parts, the top part acted as a holder for the LiDAR and the bottom part was bolted to the vehicle to provide a strong base for holding the structure together. All credit for the designs go to Muneeb Ahmad from 43-MTS. (a) Top Part of Mount (b) Bottom Part of Mount Figure 8.2: CAD Models of the LiDAR Mount 8.3.2 Mounting LiDAR on Vehicle The top and bottom parts were drilled together, making the complete mount. The lower portion of the mount was further drilled and then bolted to attach it to the vehicle’s chassis. The top part was made using acrylic on account of it being cheap, strong and resistant to thermal deformation (unlike the 3D printed materials such as PPG). The bottom part was made with hollow scrap metal bars welded together. 64 Figure 8.3: LiDAR mounted on Autonomous Vehicle 8.4 Conclusion Firstly, despite the benefits of TensorRT, we encountered limitations in fully exploiting the sparsity inherent in our data. TensorRT’s optimization tactics do not inherently account for the sparsity of the input data, and the closed-source nature of its internal optimization strategies makes it challenging to guide the compilation process to specifically leverage data sparsity. Consequently, we were unable to directly implement a Spatial Sparse Convolution algorithm as a custom TensorRT plugin within the constraints of this project. While spconv2 exists, its reliance on a PyTorch backend did not provide the desired performance gains beyond what the Xavier’s native TensorRT capabilities offered for our specific sparse data. Finally, an overview of how the LiDAR was attached to the Autonomous Vehicle was discussed, which includes the development of the CAD models and finally creating the mounts. 65 Chapter 9 Conclusion and Future Work 9.1 Conclusion This investigation has addressed an important problem in autonomous systems: achieving efficient, real-time LiDAR-based object detection and tracking on power-limited edge devices. The foremost aim was to close the chasm between the perception prowess of LiDAR sensors and the computational limitations of embedded platforms. This objective has set the stage for building a united environmental perception pipeline designed to be integrated into an autonomous agent, that is, the NUSTAG Electric Vehicle. A complete yet optimized pipeline was built to deal with all the crucial aspects, from raw LiDAR point cloud coordinate transformation, Bird-Eye View (BEV) projection, and object detection using YOLO-based models to multi-object tracking with trajectory prediction. The whole system was designed and optimized for deployment based on NVIDIA’s TensorRT framework for low-latency and high-throughput inference on the Jetson Xavier AGX platform. Extensive evaluations were conducted with RGB-BEV images rendered from the KITTI dataset, and oriented bounding boxes (OBB) provided the spatial accuracy. Among the different models of the YOLO algorithm, YOLOv11 Small OBB has been chosen for its 66 best compromise of accuracy and speed. After optimization with TensorRT, the mean Average Precision achieved was 93.0% at IoU 0.5 and 71.2% at IoU 0.5–0.95 with impressive inference time of 10ms on the Jets. To make things easier at last, the system must be tested beyond simulation and integrated into a NUST prototype of the NUSTAG Electric Vehicle developed at NUST. Campus tests confirmed the pipeline’s full capability for real-time and 360-degree object detection and tracking in the real world within the NUST campus. Multi-Object Tracking using implementation of the ByteTrack algorithm together with Kalman filtering for trajectory prediction up to two seconds ahead delivers the time-consistency and foresight critical to downstream autonomous functionalities. Deployment and validation of this system thus do not only confirm the individual performance of all components but also prove that they can cohere well into a unified perception solution in real time. Such a vision implements an important step towards autonomy at scale within constrained edge environments. 9.1.1 Key Contributions • Efficient Edge Deployment: Developed an extremely lightweight and high-performance LiDAR perception system with YOLO OBB models optimized through TensorRT for real-time inference on edge devices without high power constraints. • RGB-BEV Image Pipeline: This pipeline set forth a practical process to produce RGB-BEV images from raw LiDAR point clouds, thus honoring a delicate balance of information richness versus computational efficiency for YOLO-based detection. • Real-World Validation: Successfully rolled out aboard the NUSTAG EV autonomous vehicle, thus proving real-world applicability for such features as automated braking and lane change assistance. • Proactive Safety: While detected objects can be predicted for future paths, this will 67 continue to add to the development of proactive safety systems instantaneously when acknowledging their existence in the autonomous driving scheme. In reality, this research has certain limitations. The system was implemented and optimized for the NVIDIA Jetson Xavier AGX platform, and its universality remains subject to further examination with respect to generalizing into other edge hardware. Moreover, the training and evaluation were basically conducted using the KITTI dataset, which though extensive, does not represent the entire range available in highly dynamic urban or off-road environments. Real-world testing at the NUST campus proved to be a fairly realistic evaluation ground, but larger field validation in a wide range of random traffic scenarios would boost claims of robustness and general applicability of the system. To summarize, this Final Year Project has successfully realized the design, implementation, and validation of a fast and efficient YOLO-based LiDAR object detection and tracking pipeline, suitable for real-time deployment on edge devices. The system performed excellently fast, accurate, and reliable, significantly enhancing the perceptual capabilities of autonomous systems under computational constraints. This research tackles important issues in embedded LiDAR processing and opens the pathway for computer resource-conscious and therefore intelligent and reactive autonomous agents. 9.2 Future Work With the success of the modification, as well as promising results of the LiDAR perception system presently on the NUSTAG EV, the road for many future developments is exciting. These extensions will strengthen the capabilities, robustness, and applicability of this system in the direction of higher autonomy and further deployment scenarios. 1. Advancing Towards Higher Levels of Autonomy (Level 3 and Level 4): The present configuration is a sound basis for Level 2 autonomous driving features (such as adaptive cruise control and basic lane-keeping assistance based on perceived ob- 68 jects). A pivotal future challenge, as alluded to in the introduction, is to build up the perception and decision-making ability to allow operation at SAE Level 3 (conditional automation, wherein the driver can cede all safety-critical functions to the automated driving system under certain conditions) and Level 4 (high automation, where an automated driving system can control the vehicle within its defined operational design domain). 2. The Development of Lightweight 3D Sparse Convolution Libraries for aarch64 platforms: The study states that those advanced 3D object detection models, for example GLENet [3], do directly rely on point clouds and the sparse convolutions, but these are yet not available optimized and ready libraries for architectures of aarch64, which is common on edge devices such as those in the NVIDIA Jetson series. A promising future work would be contribution to the lightweight designing of effective sparse convolution libraries specific to these platforms. This could literally open the door to deploying varieties of state-of-the-art perception models directly to edges even as they promise gains in accuracy or robustness on a per-task basis when richer 3D feature learning is utilized compared to BEV projections. 3. Alternate to the nvidia-smi: A possible function for Jetson devices. A lot of these voxelization-based 3D object detection algorithms depend on functionalities of the kind available via nvidia-smi. To account for this restriction or more rarely so on embedded Jetson, such tools or such libraries are typically unavailable, which poses an obstacle for many of these methods from being directly ported to such systems or optimally implemented therein. Future work could include the development or adaptation of tools and libraries providing similar core functionality for GPU resource management and introspection on Jetson. 4. TensorRT optimized sparse spatial convolution: A custom TensorRT plugin or alternative methods to explicitly leverage data sparsity within the TensorRT framework can be developed, making deployment and processing of more computation- 69 ally intensive 3D object detection models directly on the Jetson AGX Xavier more feasible. This would represent a significant advancement, potentially enabling the use of richer 3D information for perception tasks on resource-constrained edge devices, moving beyond the limitations of relying solely on 2D convolution-based YOLO models for real-time autonomous applications on the Xavier platform. 5. Exploring Multi-Modal Sensor Fusion for Improved Robustness: Whereas LiDAR is very good in terms of providing 3D geometric information, the performance of LiDAR continues to be dependent upon bad weather conditions (e.g., dense fog, heavy rain), and it does not provide color/texture information. The system needs to research further into the possibility of multi-sensor modalities using LiDAR data. • Cameras: Enrich the color, texture, and semantic data aiding in object classification and understanding of road signs/markings. • Radar: Robust against adverse weather and direct measurement of velocities, complementing LiDAR spatial accuracy. • Proximity Sensors (Ultrasonics): Helpful during very short-range obstacle detection, for example, during parking maneuvers. It is thus crucial to develop lightweight, synchronized fusion architectures (early, mid, or late fusion) capable of running efficiently on edge devices. The paper recommends utilizing frameworks similar to LangChain for embedded platforms so that multiple perception models from different modalities can run in parallel and efficiently in the management of perception. All this would contribute eventually to a perception that may be stronger or faster than before. 6. Continuous Model Optimization and Quantization: Incorporating model optimization techniques like advanced quantization (for example, INT8 or even less where accuracy is not considerably sacrificed), pruning, and knowledge distillation may further improve the computational footprint and power consumption of the 70 neural network. Hence, overall efficiency for deeply embedded applications may also be attained. Future investigations in these directions would serve to greatly expand the foundations of this work and would result in far more powerful, reliable, and as generally applicable perception systems for the next generation of autonomous technology. It has been a journey, continuing even further into improvements and innovations, with a lot of opportunities ahead for these contributions to be very relevant. 71 References [1] A. Lang, S. Vora, N. M. Rasmussen, and K.A.E.S.L.J.H.K.,, “Pointpillars: Fast encoders for object detection from point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp-, 2019. [2] T. Geiger, D. S. Thrun, and M. Z., “Complex-yolo: An euler-region-proposal for real-time 3d object detection on point clouds,” IEEE Transactions on Robotics, vol. 35, no. 6, pp-, 2019. [3] W. Zhou, L. Liu, X. Wu, and Z. Li, “GLENet: Boosting 3d object detectors with generative label uncertainty estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp-, 2021. [4] M. Graham and L. W. Y., “3d semantic segmentation with submanifold sparse convolutional networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 922–929, 2018. [5] R. Qi, L. Yi, H. Su, and L. Guibas, “PointNet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 652–660, 2017. [6] Y. Zhou and O. Tuzel, “VoxelNet: End-to-end learning for point cloud-based 3d object detection,” CoRR, vol. abs/-, 2017. [7] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object detection network for autonomous driving,” CoRR, vol. abs/-, 2016. [8] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner, “Vote3Deep: Fast object detection in 3d point clouds using efficient convolutional neural networks,” CoRR, vol. abs/-, 2016. [9] X. Ma, C. Qin, H. You, H. Ran, and Y. Fu, “Rethinking network design and local geometry in point cloud: a simple residual MLP framework.” arXiv preprint arXiv:-, 2022. [10] A. Barrera, C. Guindel, J. Beltrán, and F. García, “BirdNet+: End-to-end 3d object detection in LiDAR bird’s eye view,” in Proceedings of the IEEE International Conference on Intelligent Transportation Systems (ITSC), (Rhodes, Greece), pp. 1–6, IEEE, Sept. 2020. [11] S. Shi, X. Wang, and H. Li, “PointRCNN: 3d object proposal generation and detection from point cloud,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (Long Beach, CA, USA), pp. 770–779, IEEE/CVF, June 2019. [12] S. Shi, Z. Wang, X. Wang, and H. Li, “Part-A^2 Net: 3d part-aware and aggregation neural network for object detection from point cloud.” arXiv preprint arXiv:-, 2019. The volume 2, number 3 in the original citation appears to be incorrect for an arXiv paper and is omitted. [13] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection from point clouds,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp-, IEEE, 2018. [14] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, 2018. [15] C. He, H. Zeng, J. Huang, X.-S. Hua, and L. Zhang, “Structure aware single-stage 3d object detection from point cloud,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp-, IEEE/CVF, 2020. [16] M. Bhandari, S. Srivastava, and V. Kumar, “Real-time high performance computing using a jetson xavier AGX,” International Journal of Innovative Technology and Exploring Engineering (IJITEE), vol. 9, pp. 256–261, Apr. 2020. [17] G. Welch and G. Bishop, “An introduction to the kalman filter,” Technical Report TR 95-041, University of North Carolina at Chapel Hill, Department of Computer Science, July 1995. [18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (Las Vegas, NV, USA), pp. 779–788, IEEE, 2016. [19] F. Ahmed and Z. Liu, “TensorRT-based framework and optimization methodology for deep learning inference on jetson boards,” in Proceedings of the International Conference on Embedded Systems (ICES), (Austin, TX, USA), pp. 29–36, 2020. [20] C. Peng, J. Wang, and L. Xu, “Benchmark analysis of deep learning-based 3d object detectors on NVIDIA Jetson platforms,” IEEE Access, vol. 8, pp-, 2020. [21] M. Abadi, A. Agarwal, and P. Barham, “TensorRT inference with TensorFlow,” in Proceedings of the International Conference on Machine Learning (ICML), (Vienna, Austria), pp-, 2020. This reference appears to be problematic as cited; an ICML paper with this exact title and authors is not readily found for 2020. [22] A. Geiger, P. Lenz, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” International Journal of Robotics Research, vol. 32, no. 11, pp-, 2013. [23] Y. Tian, Q. Ye, and D. Doermann, “YOLOv12: Attention-Centric Real-Time Object Detectors.” arXiv preprint arXiv:- (Placeholder), Feb. 2025. This appears to be a placeholder for future work with a future date and hypothetical arXiv ID. [24] Y. Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y. Liu, and J. Chen, “DETRs Beat YOLOs on Real-time Object Detection.” arXiv preprint arXiv:-, Apr. 2023. [25] S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y. Chai, B. Sapp, C. R. Qi, Y. Zhou, Z. Yang, A. Chouard, P. Sun, J. Ngiam, V. Vasudevan, A. McCauley, J. Shlens, and D. Anguelov, “Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset.” arXiv preprint arXiv:-, 2021. [26] A. Geiger, “Are we ready for autonomous driving? The KITTI Vision Benchmark Suite,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (Washington, DC, USA), pp-, IEEE, 2012. [27] Y. Sun, P. Li, and J. He, “YOLOv11: An overview of the key architectural enhancements,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (Long Beach, CA, USA), pp-, IEEE, 2021. YOLOv11 is not a widely recognized official release. The conference location for CVPR 2021 was virtual; Long Beach was CVPR 2019. [28] O. S. R. Foundation, “Gazebo.” https://gazebosim.org/about, Jan. 2020. [29] N. A. Group, “TEKNOFEST: ROBOTAXI-FULL SCALE AUTONOMOUS VEHICLE COMPETITION PRELIMINARY DESIGN AND SIMULATION REPORT,” June 2022. [30] J. Vogel, “Tech Explained: Ackermann Steering Geometry.” Racecar Engineering, Apr. 2021. Retrieved January 24, 2025. [31] O. S. R. Foundation, “Rviz2.” https://github.com/ros2/rviz, Jan. 2022. [32] O. S. R. Foundation, “Marker-Display.” https://docs. ros.org/en/humble/Tutorials/Intermediate/RViz/ Marker-Display-types/Marker-Display-types.html, Jan. 2020.

Scheduled maintenance