Increasing usage of traffic cameras provides an opportunity to utilize them for smart city applications. However, the efficacy of such systems is determined by their ability to detect and track objects of interest from diverse viewpoints accurately. This is challenging due to the diverse viewpoints, elevations, and distinct properties of camera sensors. Thus, to ensure robust performance, the training dataset should cover many variations, including viewpoints, illumination changes, and diverse weather conditions. However, constructing such a dataset is expensive in terms of data collection and annotation. This paper proposes an unsupervised domain adaptation approach wherein a synthetic dataset is generated using a simulator and subsequently used to ensure performance consistency of multi-object-tracking (MOT) algorithms across a diverse range of manually annotated natural scenes. Towards this end, we emphasize achieving domain invariant object detection by combining image stylization and class-balancing augmentation. Furthermore, we extend the robust detection algorithm to track detected objects across a large time scale using feature embeddings generated by the detector. Based on qualitative and quantitative results, we demonstrate the viability of such a system that is invariant to illumination, weather, viewpoint, and scene changes while providing a baseline for future research. Codebase and datasets would be made available at https://github.com/pranjay-dev/IS2R.