Vehicular accidents pose a substantial risk to drivers, underscoring the persistent and vital need for heightening safety measures. Early accident anticipation mechanisms are imperative for proactive measures, while detection accuracy is pivotal for prompt response and effective post-accident mitigation. Accurate and early anticipation of accidents for automated driving assistance systems in vehicles or CCTV in cities remains a complex task due to the intricate spatial-temporal interactions within traffic videos. This study presents text-informed magnitude enhancement in contrastive multiple-instance feature learning for vehicle accident detection and anticipation (TIME-VAD). Text is a better representative of concepts when compared to images in video, thus multi-modal learning is suitable. Also, the traditional assumption about feature magnitude of accidents and normal frames in magnitude based multiple-instance learning using weak supervision may not hold. This has led to the development of a novel weak-supervised learning strategy involving magnitude enhancement from textual concepts. For a better frame-level perception of accident risks in videos, dynamic temporal attentions are refined using the proposed dilated temporal conv-attention (DTCA) block. In-depth component-level analysis is performed to showcase the model's efficacy while elucidating its operational mechanisms. Evaluation is conducted on three benchmark datasets, considering both earliness and accuracy-related metrics. Extensive experiments demonstrate that our TIME-VAD model outperforms the existing models. Compared to the previous top-performing supervised model achieving 84.7% accuracy, TIME-VAD achieves a 94.44% accuracy (measured by ROC-AUC) on the largest DoTA dataset. Notably, our model also excels in measuring how early it detects accidents compared to previous methods.