Abstract (english) | Subjective assessments are frequently used as they most accurately reflect user experience. However, it is difficult to recruit and motivate test subjects to participate in the subjective assessments. Traditional subjective video quality assessment methods such as have high costs, as they are conducted in controlled laboratory conditions. All of these setbacks in conducting subjective video quality assessments are the reason why objective measures are most often used in system and algorithmic optimization, although there are no universally accepted objective measures, especially for 3D video sequences. Conducting the subjective video quality assessment over the internet on a crowdsourced platform enables a huge poll of diverse international workers and faster and cheaper evaluations. This dissertation presents results of a new method and grade comparison study for crowdsourced subjective 3D video quality assessment. The crowdsourced subjective 3D quality assessment was conducted using a newly formed 3DVCL@FER 3D video sequence database, which contains a large number of degradation types, especially ones specific to 3D video systems. Subjective grades for quality, depth and comfort were collected on the crowdsourced platform. This dissertation shows a comparison of the results from crowdsourced and laboratory tests on the 3DVCL@FER 3D video database and several objective measures. This dissertation contains six chapters. Chapter 1 gives an overview of subjective and objective video quality assessment methods. Chapter 2 describes framework for crowdsourced video quality assessment while Chapter 3 presents the new 3DVCL@FER Video Database. Chapter 4 and 5 present the results from laboratory and crowdsourced video quality assessments. Finally, Chapter 6 presents the conclusions from this research and future work. Chapter 1 gives one an overview of subjective and objective video quality assessment methods. Subjective evaluations of video quality are frequentlyused in research and development of video technology as they reflect most accurately the user experience, which is a complex combination of context, colour, texture, motion and other perceptually relevant factors. Test subjects participating in subjective video quality assessments watch a number of video sequences and rate their quality on either an absolute or a relative scale following one of the protocols defined in specialized recommendations. One such recommendation is the ITU-R BT.500-13 that defines a methodology for the subjective assessment of the quality of television pictures, but with a scope of application limited to 2D video contents. An extension of the methodology described in this recommendation to stereoscopic 3DTV systems has been developed and is available as recommendation ITU-R BT.2021. In subjective 2D video quality assessment the test subjects rate video on a single dimension that quantifies quality, but in subjective 3D video quality assessment other video quality indicators specific to 3D such as depth quality and visual comfort have to be rated as well. That means that for each 3D video sequence, test subjects have to indicate three different grades, as opposed to only one in the 2D video case, thus making the evaluation procedure longer and more prone to inter and intra-observer variability. To improve the reliability of the quality grades collected in subjective 3D video quality assessments, the test subjects need to pass a set of stereoscopic vision screenings, alongside colour and vision acuity tests. These are the main differences between subjective assessment of 2D and 3D video sequences. However traditional subjective video quality assessment methods based on either ITU-R BT.500-13 or ITU-R BT.2021 have high costs and complex logistic requirements, as they have to be conducted in controlled laboratory conditions and involve numerous human graders. In Chapter 2 the framework for crowdsourced video quality assessment is described. Recent developments on crowdsourced image and video quality assessment and the availability of crowdsourcing platforms such as Microworkers and Amazon Mechanical Turk have provided an alternative to laboratory-based quality evaluations. Using crowdsourced evaluators it is now possible to have 3D video quality assessments done by many observers at multiple locations, extending the evaluators recruitment domain and thus solving one of the problems of this type of studies, the assembly of a medium to large set of graders. However, the diversity of observers and their locations and other specificities of this type of grade collection platform introduces several new technical and conceptual challenges that need to be solved before crowdsourced 3D video quality assessment campaigns are effective and their results trustworthy. Chapter 3 presents the new 3DVCL@FER Video Database. To form new 3D video database we selected 8 original 3D stereo video sequences (src01-src08) as well as derived sequences with specific degradations: Car and barrier gate, Basketball training, Boxers, Hall, Laboratory, News report, Phone call, Soccer. All the 8 original sequences are in full HD stereo format, with 25 fps frame rate and are 16 seconds long. The dynamic characteristics of the reference sequences, as measured by spatial and temporal activity indices were computed on the left and right view according to the procedure defined in ITU-T recommendation P.910. The plots show that the sequences are very diverse in terms of their dynamic characteristics. These eight original sequences included in the database were complemented with several sequences showing the effect of specific types of degradation. 22 degradation types were applied to each of the eight original sequences, so that besides the original eight sequences, our 3D video database contains 176 sequences with degradations. We called our new 3D video database "3DVCL@FER". Depending on their type, the degradations were generated either using ffmpeg-x64 (version 22.11.2014) or the H.264/AVC encoder in from the JVT JM18.6 reference software package. Degradation types 21 and 22, based on a 3D-HEVC encoding, differ from degradation types 11 and 12, based on H.264/AVC reference encoder, because 3D-HEVC encodes left view as a base view and the right view as a dependent view, using HEVC core encoding tools for both intra-view and inter-view. H.264/AVC as used in this work encodes independently the left and right views. The grade information lists 146 grades collected in a laboratory of Faculty of Electrical Engineering and Computing in Croatia. The data provided in the database comprises the sequences in uncompressed format (.avi separate left+right) and in a near-lossless compressed format (x.264+.mp4 and vp8+.webm combined left+right). The entire 3DVCL@FER database including all the sequences and grades described above is publicly available. Chapter 4 presents the results from laboratory video quality assessments. Firstly, a traditional laboratory 2D video quality assessment using in accordance with ITU-R BT.500-13 recommendation was conducted. The aim of the research was to compare subjective assessment of H.265 versus H.264 Video Coding for High-Definition Video Systems and to assess the performance of both standards. For the purposes of the research a database consisting of 4 original HD video sequences was prepared with 30 degraded HD video sequences each, with various compression steps both in H.265/HEVC and H.264/AVC. The subjective assessment was conducted in one research laboratory in Croatia. This assessment will help in the future decision on the coding standard that is going to be used in DVB-T2 networks in Croatia. Additionally, a status of the preparation activities for the allocation of the second digital dividend band in Croatia is given and further developments are described in this Chapter. The results have shown that x265 has similar subjective score with half the bitrate (or less) of x264 encoder. We also compared different spatial resolutions using the same encoder. Average DMOS scores were similar for x264 encoder and nearly similar for x265 encoder. This means that the broadcasters, depending on their equipment, can choose final spatial resolution (1080p, 1080i or 720p) of the broadcasting video stream. The results of this subjective assessment will help in the future decision on the coding standard that is going to be used in DVB-T2 networks in Croatia. Further research is needed on the availability, compatibility and performance of equipment supporting H.265 in the complete content production and delivery chain. Different network parameters could be also incorporated in future research, such as the influence of packet losses on final video quality. A web-based application was developed for conducting the laboratory subjective assessments of the 3DVCL@FER database contents. The application is easily customizable and can be used with different web browsers. In the case study reported here it was setup to be used with Google Chrome and Mozilla Firefox web browsers. It was programmed using the javascript and php languages and customized to display 3D video on computers equipped with a 3D monitor. The application collects and saves the subjective scores in a results database. Several control mechanisms are implemented in the application to ensure the validity of the scores collected. The most important one is that the application switches automatically to full screen during the whole duration of the assessment. The assessment of the subjective quality of the 3D videos from the 3DVCL@FER database performed using the system is based on Absolute Category Rating (ACR) with hidden reference (ACR-HR). In ACR-HR, each original unimpaired signal is included in the experiment but not identified as such. The ratings for the original signals are removed from the scores of the associated processed video sequences during data processing. The grading is done on three different dimensions, each one graded on a continuous scale from 0 to 5 with a step of 0.1 according to the. The three dimensions represent picture quality, depth quality and visual comfort. For picture quality and depth quality grade 0 represents bad, while 5 represents excellent. For visual comfort grade 0 represents extremely uncomfortable while 5 represents very comfortable. After the conclusion of the grading sessions, the scores collected from Croatian laboratory were converted to DMOS and compared with 7 objective measures. As it is not yet very well known how to interpret and process raw scores for depth and comfort to calculate DMOS values, those scores were treated as usual quality scores. Overall, we gathered 146 observations, resulting on an average of 146/8≈18 grades per video sequence, before elimination of outliers. Our findings show that the correlation between subjective grades and objective quality estimation methods for 3D video is still inadequate, especially when comparing widely different degradation types. New objective methods are being developed which hopefully are better adapted to 3D video quality assessment. The research on such better 3D video quality estimation drives a need for new 3D video sequence databases complemented with subjective assessment grades, which can then be compared to grades obtained from current and new objective methods. Future research could be also directed towards comparison of MOS, DMOS quality scores collected using pure web-based setups, and scores obtained in more controlled laboratory based grading sessions. Chapter 5 presents and crowdsourced video quality assessments. As a first step, a subjective quality assessment of 2D video sequences based on crowdsourced testing over internet was conducted in order to test the crowdsourcing platform and to gather experience. To achieve this comparison we used an existing video database. Additionally, description of the crowdsourcing application design and the application usage is given, and its further development for subjective quality assessment of 3D video sequences based on crowdsourced testing over internet is envisaged. In this research, we have described a crowdsourced subjective video quality method, which evaluates various degradation types. In order to test this method a web crowdsourcing application was developed. The results from testing this method were compared to the conventional subjective video quality assessment. To achieve this comparison we used an existing video database (the LIVE video quality database) and obtained maximal Pearson's correlation of 0.8923. It is possible that with higher number of observers, correlation will be even higher. In this chapter a new method for crowdsourced subjective 3D video quality assessment – Crowd3D is proposed. All technical aspects of the proposed method for video quality assessment are described in detail, including pointers to a web-based implementation of the crowdsourced grade collection platform. Results obtained using the new method, complemented with a comparative study of the crowdsourced and laboratory subjective 3D video quality assessments are presented. The crowdsourced subjective 3D quality assessment was conducted using as test data the contents of a 3D video database (3DVCL@FER) publicly available, annotated with grades obtained through a traditional laboratory-based subjective 3D video quality assessment. The grades from the crowdsourced subjective assessment were collected using a web-based platform developed specifically to support 3D-able video monitors, to collect grades for quality, depth and comfort that were then compared to the laboratory results for the same 3DVCL@FER 3D video database. Besides the present study, this 3D video quality assessment platform can be used with advantage for further research activities as it reduces the time and cost compared to traditional laboratory-based quality assessments. From the results in subsection it can be concluded that by using the proposed framework of the Crowd3D method it is possible to obtain similar DMOS quality scores as in laboratory experiments, provided all additional reliability mechanisms (ARMs) implemented and explained are used. Pearson and Spearman's correlations between crowdsourced and laboratory tests are about 0.95. However, correlation is somewhat lower for DMOS comfort and depths scores: e.g. Pearson's correlation is about 0.9 for DMOS comfort and about 0.86 for DMOS depth. Lower scores for comfort and especially depth scores can be due to the several reasons, which are very difficult to control in crowdsource tests: different illumination, different 3D monitor type, different monitor settings. In addition, depth and comfort scores, as added grades in 3D subjective experiments, may require the use of different subjective assessment approaches (in our work we have used ACR-HR). Possibly, observers may be more uncertain when giving depth and comfort grades, than quality grades. Finally, Chapter 6 presents the conclusions from this research and future work. In this thesis, the following scientific contributions are achieved: 1. A new method for crowdsourced subjective 3D video quality assessment – Crowd3D. 2. 3DVCL@FER Video Database accompanied by results from subjective video quality assessments (both laboratory and Crowd3D results) 3. Verification of the results obtained by the Crowd3D method and traditional laboratory subjective video quality assessment methods. Further research may be needed to fully understand the new quality dimensions associated with 3D video and respective scores (depth, comfort). This could be achieved by using similar equipment in different conditions in both laboratory and crowdsourced environments, using more observers and maybe changing the methodology to be used in 3D video subjective tests. As an additional contribution to this research area, the video sequences used in this work and related DMOS scores for quality, depth and comfort (calculated using overall results) are publicly available. The whole dataset can be found at an on-line repository, which includes the compressed video sequences together with the collected quality grades information. |