Near-duplicate video similarity detection in H.264/AVC compressed domain

Rouhi, A 2018, Near-duplicate video similarity detection in H.264/AVC compressed domain, Doctor of Philosophy (PhD), Science, RMIT University.

Document type: Thesis
Collection: Theses

Attached Files
Name Description MIMEType Size
Rouhi.pdf Thesis application/pdf 3.46MB
Title Near-duplicate video similarity detection in H.264/AVC compressed domain
Author(s) Rouhi, A
Year 2018
Abstract Efficient content-based multimedia searching to retrieve near-duplicate video segments is required in big multimedia archives and data banks. A major challenge of content-based video retrieval is its computational cost. Efficiency was identified as a major issue in the overview of the final TRECVID content-based copy detection (CCD) in 2011, as the top 10 most effective systems were slower than in the previous year. So, the contribution of this research is improving the efficiency of video similarity detection with minimal impact on the effectiveness.

Compressed domain processing is a promising approach that saves a considerable amount of computational resources by avoiding fully decompressing the video signal with the advantages of faster processing of less data, lower storage requirements, and reduced bandwidth utilisation. However compressed domain algorithms are generally suffering from lack of spatial information which is the most important factor for image/video processing. During the general compression process, in most of the cases, a shift of information will happen and spatial information among the pixels will be converted to the frequency information and will be stored in the compressed domain. However, there are some compression standards, such as H.264 (MPEG4), that keep spatial information in the compressed file. To achieve the benefits of the compressed domain we need to address the following four major challenges in this research: which pixel domain method is the appropriate baseline, what spatial information can be extracted from the compressed domain, what structure is suitable for those features to construct a descriptor, and how sensitive is the descriptor to spatio-temporal parameters,
visual transformations, and compression settings?

To answer these questions the first step is extracting the global visual features from the pixel domain of the video stream. This step enables us to compare the performance of the proposed global features extracted from the compressed domain in the following steps. The selected global descriptors from the pixel domain are introduced by CRIM (Centre de Recherche Informatique de Montr´eal) team, that acquired the best results in the TRECVID/CCD in content-preserving visual transformations and selected as the baseline in this research. However, an alternate pixel domain approach utilising luminance is introduced in this thesis which improved the efficiency of the baseline.

The second step is selecting the format of the compressed video and the appropriate feature from that compressed domain. H.264/AVC is used as the compressed video standard in this research. This video codec utilises three types of frames, I- P- and B-frames. I-frames keep the spatial information of the whole image content of the frames that are known as key pictures (frames). The compression technique that used in I-frames is known as intra-prediction. Unlike the other elements stored in the compressed domain, such as DCT (Discrete Cosine Transform), which store only frequency information, intra predictions store spatial information in the compressed domain. This information then is directly stored in the I-frames of H.264 video file in the form of 8 directions (modes). Utilising spatial information, stored in the compressed domain, is the main motivation for using such information to propose three efficient yet effective Intra Prediction Mode -based (IPM-based)
visual descriptors.

The third step is the spatio-temporal structure of the proposed feature descriptors. Novel combinations of the intra-prediction modes, combined on a group of frames, are introduced in this thesis in form of three IPM-based visual descriptors (named IPMH, e-IPMH, and IPMC in this research) and investigates which descriptors are most effective for video copy detection tasks. We compare the efficiency and effectiveness of intra-prediction modes with two other widely used global spatial features (intensity and colour histogram- and auto-correlogrambased features) in a single region as well as multi-regions (ordinal approach). The final step is sensitivity test of the proposed descriptors to the spatio-temporal parameters, visual transformations as well as compression settings. Compressed domain-based features are generally sensitive to pixel distortions and encoder settings. Multiple sensitivity tests are conducted to investigate and measure the impact of visual transformations as well as compression parameters on the proposed visual descriptors. For sensitivity to compression
settings, these tests show that the impact of encoding profiles, the most common compressing variable, is not significant; however, using different types of encoders can affect the performance of the proposed feature descriptors. The tests also show the sensitivity of the proposed features to the spatio-temporal parameters such as frame sampling rate and frame partitioning as well as visual transformations such as four major content-preserving transformations. Content-altering transformations are out of the scope of the functionality of global descriptors and consequently this research.

The contribution of this thesis is improving the efficiency of video copy detection with minimum impact on the effectiveness. As the testing scope of the global features is content preserving (non-geometric) visual distortions, the effectiveness of the global features of both domains were investigated on the content-preserving video distortions. The experimental results show the proposed descriptors are as effective as the baseline on the content-preserving visual transformations with much more efficient searching time. Efficiency analysis shows the proposed descriptors are 3.3 times faster than the baseline, even in much slower programming platform, CPU- versus GPU-based programming utilised for the baseline. On the same hardware and software programming platforms IPM-based descriptors are even more significantly efficient, 153 to 320 times faster than the baseline, depending on the proposed descriptors. Effectiveness analysis also shows the baseline method is more effective in minNDCR but less effective in F1 measures compared to the most effective IPM-based descriptor (IPMC). Consequently, this research suggests the proposed descriptors can be utilised effectively in video similarity detection applications as a robust feature descriptor, where efficiency plays an important role.

Degree Doctor of Philosophy (PhD)
Institution RMIT University
School, Department or Centre Science
Subjects Information Retrieval and Web Search
Image Processing
Signal Processing
Keyword(s) Content-based video retrieval
Compressed domain
Intra prediction modes
Copy detection
Version Filter Type
Access Statistics: 73 Abstract Views, 54 File Downloads  -  Detailed Statistics
Created: Fri, 30 Nov 2018, 10:28:46 EST by Keely Chapman
© 2014 RMIT Research Repository • Powered by Fez SoftwareContact us