In this post, we explain the basics behind our method for calculating video similarity based on audio information. This work was carried out in the context of Near-Duplicate Detection, a key component of the WeVerify project. Based on our results, we have published a research paper titled “Audio-based Near-Duplicate Video Retrieval with Audio Similarity Learning,” which has been accepted for publication at this year’s International Conference on Pattern Recognition (ICPR 2020).
To calculate the similarity between two compared videos, we have to extract feature descriptors for the audio signals of the videos and then calculate the similarity between them as the final similarity score of the video pair. In this work, we have designed two processes that implement these functionalities: (i) a feature extraction scheme based on transfer learning from a pre-trained Convolutional Neural Network (CNN), and (ii) a similarity calculation process based on video similarity learning.
Let’s begin with feature extraction. We employ the pre-trained CNN network designed for transfer learning. The network is trained on a large-scale dataset, namely AudioSet, consisting of approximately 2.1 million weakly-labeled videos from YouTube with 527 audio event classes.
To extract features, we first generate the Mel-filtered spectrogram from the audio of the videos. The generated spectrograms are divided into overlapping time frames, which are then fed to the feature extraction CNN. To extract compact audio representation for each spectrogram frame, we apply Maximum Activation of Convolutions (MAC) on the activations of the intermediate convolutional layers. To improve the discriminative capabilities of the audio descriptors, we then apply PCA whitening and an attention-based scheme for the decorrelation and weighting of the extracted feature vectors, respectively.
To measure the similarity between the two compared videos, we employ the video similarity learning scheme for the robust and accurate similarity calculation. More precisely, having extracted the audio representation of the two videos, we can now calculate the similarity between all the descriptor pairs of the two videos. To do so, we calculate the similarity between the feature vectors of the corresponding video descriptors by applying the dot product. In that way, we generate a pairwise similarity matrix that contains the similarities between all vectors of the two videos.
Then, to calculate the similarity between the two videos, we provide the pairwise similarity matrix to a CNN network, which we call AuSiL. The network captures the temporal similarity structures existing within the content of the similarity matrix, and it is capable of learning robust patterns of within-video similarities. To calculate the final video similarity, we apply the hard tanh activation function on the values of the network output, and then we apply Chamfer Similarity to derive a single value, which is considered as the final similarity between the two videos.
For the evaluation of the proposed approach, we employ two datasets compiled for fine-grained incident and near-duplicate video retrieval, i.e., FIVR-200K and SVD. We have manually annotated the videos in the dataset according to their audio duplicity with the set of query videos. Also, we evaluate the robustness of our approach to audio speed transformations by artificially generating audio duplicates.
In the following table, we compare the retrieval performance of AuSiL against Dejavu, a publicly available Shazam-like system. The performance is measured based on mean Average Precision (mAP) on the two annotated datasets with two different settings, i.e., the original version and the artificially generated videos with speed transformation. AuSiL outperforms Dejavu by a considerable margin on three out of four runs. Dejavu achieves marginally better results on the original version of the FIVR-200K. It is evident that our approach is very robust against speed transformation, unlike the competing method.
mAP comparison of the proposed approach Dejavu, a publicly available Shazam-like system. Superscript T indicates the runs with audio speed transformations.
For more details regarding the architecture and training of the model, but also for comprehensive experimental results, feel free to have a look at the AuSiL paper. The implementation of AuSiL is publicly available.
Author: Giorgos Kordopatis Zilos (CERTH).
Editor: Olga Papadopoulou (CERTH).
Image credits: respective persons named. Usage rights have been obtained by the authors named above for publication in this article. Copyright / IPR remains with the respective originators.
Note: This post is an adaptation of the Video similarity based on audio blog post, which was originally prepared for the CERTH Media Verification team (MeVer) website.