Aiming at the problem that the multi-modal fusion effect is not good, and the key emotional information from specific time periods and multiple perspectives cannot be fully mined, this paper proposed a time-series multi-modal emotional classification model based on multiple perspectives to extract key information from multiple perspectives in a specific time period. Firstly, it performed low-dimensional spatial word embedding and sequence expression on the data from the two perspectives of text title and text content, extracted multi-modal time series features from different perspectives, and performed feature extraction on the data from two perspectives of image interception and horizontal mirroring. Secondly, it used the recurrent neural network to construct the time series interaction features of multi-modal data to increase mutual information. Finally, joint training was performed based on contrastive learning to complete sentiment classification. The model was evaluated on two multi-modal sentiment classification benchmark datasets Yelp and Mutli-Zol, with accuracies of 73.92% and 69.15%, respectively. Comprehensive experiments show that multi-view multi-modal sentence sequences in specific time periods can improve model performance. [ABSTRACT FROM AUTHOR]