Descriptor: "STREAMING media" / Journal: ieee transactions on multimedia - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"STREAMING media"' showing total 387 results

Start Over Descriptor "STREAMING media" Journal ieee transactions on multimedia

387 results on '"STREAMING media"'

1. Omnidirectional Video Super-Resolution Using Deep Learning.

Author: Agrahari Baniya, Arbind, Lee, Tsz-Kwan, Eklund, Peter W., and Aryal, Sunil
Published: 2024
Full Text: View/download PDF

2. Perceptual Quality Improvement in Videoconferencing Using Keyframes-Based GAN.

Author: Agnolucci, Lorenzo, Galteri, Leonardo, Bertini, Marco, and Bimbo, Alberto Del
Published: 2024
Full Text: View/download PDF

3. Towards Real-Time Video Caching at Edge Servers: A Cost-Aware Deep Q-Learning Solution.

Author: Cui, Laizhong, Ni, Erchao, Zhou, Yipeng, Wang, Zhi, Zhang, Lei, Liu, Jiangchuan, and Xu, Yuedong
Abstract: Given the rapid growth of user-generated videos, internet traffic has been heavily dominated by online video streaming. Caching videos on edge servers in close proximity to users has been an effective approach to reduce the backbone traffic and the request response time, as well as to improve the video quality on the user side. Video popularity, however, can be highly dynamic over time. The cost of cache replacement at edge servers, particularly that related to service interruption during replacement, is not yet well understood. This paper presents a novel lightweight video caching algorithm for edge servers, seeking to optimize the hit rate with real-time decisions and minimized cost. Inspired by recent advances in deep Q-learning, our DQN-based online video caching (DQN-OVC) makes effective use of the rich and readily available information from users and networks. We decompose the Q-value function as a product of the video value function and the action function, which significantly reduces the state space. We instantiate the action function for cost-aware caching decisions with low complexity so that the cached videos can be updated continuously and instantly with dynamic video popularity. We used video traces from Tencent, one of the largest online video providers in China, to evaluate the performance of our DQN-OVC and to compare it with state-of-the-art solutions. The results demonstrate that DQN-OVC significantly outperforms the baseline algorithms in the edge caching context. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

4. Predictive Adaptive Streaming to Enable Mobile 360-Degree and VR Experiences.

Author: Hou, Xueshi, Dey, Sujit, Zhang, Jianzhong, and Budagavi, Madhukar
Abstract: As 360-degree videos and virtual reality (VR) applications become popular for consumer and enterprise use cases, the desire to enable truly mobile experiences also increases. Delivering 360-degree videos and cloud/edge-based VR applications require ultra-high bandwidth and ultra-low latency , challenging to achieve with mobile networks. A common approach to reduce bandwidth is streaming only the field of view (FOV). However, extracting and transmitting the FOV in response to user head motion can add high latency, adversely affecting user experience. In this paper, we propose a predictive adaptive streaming approach, where the predicted view with high predictive probability is adaptively encoded in relatively high quality according to bandwidth conditions and transmitted in advance, leading to a simultaneous reduction in bandwidth and latency. The predictive adaptive streaming method is based on a deep-learning-based viewpoint prediction model we develop, which uses past head motions to predict where a user will be looking in the 360-degree view. Using a very large dataset consisting of head motion traces from over 36,000 viewers for nineteen 360-degree/VR videos, we validate the ability of our predictive adaptive streaming method to offer high-quality view while simultaneously significantly reducing bandwidth. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

5. TCLiVi: Transmission Control in Live Video Streaming Based on Deep Reinforcement Learning.

Author: Cui, Laizhong, Su, Dongyuan, Yang, Shu, Wang, Zhi, and Ming, Zhong
Abstract: Currently, video content accounts for the majority of network traffic. With increased live streaming, rigorous requirements have been introduced for better Quality of Experience (QoE). It is challenging to meet satisfactory QoE in live streaming, where the aim is to achieve a balance between 1) enhancing the video quality and stability and 2) reducing the rebuffering time and end-to-end delay, under different scenarios with various network conditions and user preferences, where the fluctuation in the network throughput degrades the QoE severely. In this paper, we propose an approach to improve the QoE for live video streaming based on Deep Reinforcement Learning (DRL). The new approach jointly adjusts the streaming parameters, including the video bitrate and target buffer size. With the basic DRL framework, TCLiVi can automatically generate the inference model based on the playback information, to achieve the joint optimization of the video quality, stability, rebuffering time and latency parameters. We evaluate our framework on real-world data in different live streaming broadcast scenarios, such as a talent show and a sports competition under different network conditions. We compare TCLiVi with other algorithms, such as the Double DQN, MPC and Buffer-based algorithms. The simulation results show that TCLiVi significantly improves the video quality and decreases the rebuffering time, consequently increasing the QoE score by 40.84% in average. We also show that TCLiVi is self-adaptive in different scenarios. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

6. Learned Resolution Scaling Powered Gaming-as-a-Service at Scale.

Author: Chen, Hao, Lu, Ming, Ma, Zhan, Zhang, Xu, Xu, Yiling, Shen, Qiu, and Zhang, Wenjun
Abstract: Built on the explosive advancement of cloud and telecommunication technologies, Gaming-as-a-Service (GaaS) or cloud gaming system is expected to revolutionize the traditional multi-billion video game market in the near future. This wave is analogous to the rise of live-video-streaming-based-Netflix to replace conventional DVD rental business for movies and TVs. In practice, a successful GaaS platform need to operate in a transparent mode without requiring substantial efforts from both content providers and end users, and offer the pristine quality of experience (QoE) at an affordable cost. Our analysis suggests that GaaS provisioning cost can be reduced significantly by enforcing the game video rendering and streaming at a lower resolution (so as to increase the user concurrency in the cloud and reduce the streaming bandwidth over the network). However, streaming video at a lower resolution may deteriorate the QoE. To maintain the client QoE at the level using the default-native resolution for streaming or even enhance it, we introduce the learned resolution scaling (LRS), which leverages the computational capabilities at clients/edges to restore/improve the reconstructed image/video quality via stacked deep neural networks (DNN). We integrate this LRS into a commercialized GaaS platform - AnyGame, to study its efficiency and complexity quantitatively. Extensive real-life experiments have shown that LRS-powered AnyGame offers the state-of-the-art performance, and the lower operational cost, paving the road for a potential success of GaaS over the Internet. Additionally, we dive into proposed LRS via ablation studies to further demonstrate its consistent performance, including the discussions on trade-off between efficiency and complexity, alternative training sets, etc. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

7. BR2Net: Defocus Blur Detection Via a Bidirectional Channel Attention Residual Refining Network.

Author: Tang, Chang, Liu, Xinwang, An, Shan, and Wang, Pichao
Abstract: Due to the remarkable potential applications, defocus blur detection, which aims to separate blurry regions from an image, has attracted much attention. Although significant progress has been made by many methods, there are still various challenges that hinder the results, e.g., confusing background areas, sensitivity to the scale and missing the boundary details of the defocus blur regions. To solve these issues, in this paper, we propose a deep convolutional neural network (CNN) for defocus blur detection via a Bi-directional Residual Refining network (BR2 Net). Specifically, a residual learning and refining module (RLRM) is designed to correct the prediction errors in the intermediate defocus blur map. Then, we develop a bidirectional residual feature refining network with two branches by embedding multiple RLRMs into it for recurrently combining and refining the residual features. One branch of the network refines the residual features from the shallow layers to the deep layers, and the other branch refines the residual features from the deep layers to the shallow layers. In such a manner, both the low-level spatial details and high-level semantic information can be encoded step by step in two directions to suppress background clutter and enhance the detected region details. The outputs of the two branches are fused to generate the final results. In addition, with the observation that different feature channels have different extents of discrimination for detecting blurred regions, we add a channel attention module to each feature extraction layer to select more discriminative features for residual learning. To promote further research on defocus blur detection, we create a new dataset with various challenging images and manually annotate their corresponding pixelwise ground truths. The proposed network is validated on two commonly used defocus blur detection datasets and our newly collected dataset by comparing it with 10 other state-of-the-art methods. Extensive experiments with ablation studies demonstrate that BR2 Net consistently and significantly outperforms the competitors in terms of both the efficiency and accuracy. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

8. Beyond Vision: A Multimodal Recurrent Attention Convolutional Neural Network for Unified Image Aesthetic Prediction Tasks.

Author: Zhang, Xiaodan, Gao, Xinbo, Lu, Wen, He, Lihuo, and Li, Jie
Abstract: Over the past few years, image aesthetic prediction has attracted increasing attention because of its wide applications, such as image retrieval, photo album management and aesthetic-driven image enhancement. However, previous studies in this area only achieve limited success because 1) they primarily depend on visual features and ignore textual information. 2) they tend to focus equally on to each part of images and ignore the selective attention mechanism. This paper overcomes these limitations by proposing a novel multimodal recurrent attention convolutional neural network (MRACNN). More specifically, the MRACNN consists of two streams: the vision stream and the language stream. The former employs the recurrent attention network to tune out irrelevant information and focuses on some key regions to extract visual features. The latter utilizes the Text-CNN to capture the high-level semantics of user comments. Finally, a multimodal factorized bilinear (MFB) pooling approach is used to achieve effective fusion of textual and visual features. Extensive experiments demonstrate that the proposed MRACNN significantly outperforms state-of-the-art methods for unified aesthetic prediction tasks: (i) aesthetic quality classification; (ii) aesthetic score regression; and (iii) aesthetic score distribution prediction. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

9. Mobile Streaming of Live 360-Degree Videos.

Author: Eltobgy, Omar, Arafa, Omar, and Hefeeda, Mohamed
Abstract: Live streaming of immersive multimedia content, e.g., 360-degree videos, is getting popular due to the recent availability of commercial devices that support interacting with such content such as smartphones/tablets and head-mounted displays. Streaming live content to mobile users using individual connections (i.e., unicast) consumes substantial network resources and does not scale to large number of users. Multicast, on the other hand, offers a scalable solution but it introduces multiple challenges, including handling user interactivity, ensuring smooth quality, conserving the energy of mobile receivers, and achieving fairness among users. We propose a new solution for the problem of live multicast streaming of 360-degree videos to mobile users, which addresses the aforementioned challenges. The proposed solution, referred to as VRCast, is designed for cellular networks that support multicast, such as LTE. We show through trace-driven simulations that VRCast outperforms the closest algorithms in the literature by wide margins across several performance metrics. For example, compared to the state-of-the-art, VRCast improves the viewport quality by up to 2.5 dB. We have implemented VRCast in an LTE testbed to show its practicality. Our experimental results show that VRCast ensures smooth video quality and saves energy for mobile devices. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

10. DeepQoE: A Multimodal Learning Framework for Video Quality of Experience (QoE) Prediction.

Author: Zhang, Huaizheng, Dong, Linsen, Gao, Guanyu, Hu, Han, Wen, Yonggang, and Guan, Kyle
Abstract: Recently, many models have been developed to predict video Quality of Experience (QoE), yet the applicability of these models still faces significant challenges. Firstly, many models rely on features that are unique to a specific dataset and thus lack the capability to generalize. Due to the intricate interactions among these features, a unified representation that is independent of datasets with different modalities is needed. Secondly, existing models often lack the configurability to perform both classification and regression tasks. Thirdly, the sample size of the available datasets to develop these models is often very small, and the impact of limited data on the performance of QoE models has not been adequately addressed. To address these issues, in this work we develop a novel and end-to-end framework termed as DeepQoE. The proposed framework first uses a combination of deep learning techniques, such as word embedding and 3D convolutional neural network (C3D), to extract generalized features. Next, these features are combined and fed into a neural network for representation learning. A learned representation will then serve as input for classification or regression tasks. We evaluate the performance of DeepQoE with three datasets. The results show that for small datasets (e.g., WHU-MVQoE2016 and Live-Netflix Video Database), the performance of state-of-the-art machine learning algorithms is greatly improved by using the QoE representation from DeepQoE (e.g., 35.71% to 44.82%); while for the large dataset (e.g., VideoSet), our DeepQoE framework achieves significant performance improvement in comparison to the best baseline method (90.94% vs. 82.84%). In addition to the much improved performance, DeepQoE has the flexibility to fit different datasets, to learn QoE representation, and to perform both classification and regression problems. We also develop a DeepQoE based adaptive bitrate streaming (ABR) system to verify that our framework can be easily applied to multimedia communication service. The software package of the DeepQoE framework has been released to facilitate the current research on QoE. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

11. A New Method and Benchmark for Detecting Co-Saliency Within a Single Image.

Author: Yu, Hongkai, Zheng, Kang, Fang, Jianwu, Guo, Hao, and Wang, Song
Abstract: Recently, saliency detection in a single image and co-saliency detection in multiple images have drawn extensive research interest in the vision and multimedia communities. In this paper, we investigate a new problem of co-saliency detection within a single image, i.e., detecting within-image co-saliency. By identifying common saliency within an image, e.g., highlighting multiple occurrences of an object class with similar appearance, this work can benefit many important applications, such as the detection of objects of interest, more robust object recognition, reduction of information redundancy, and animation synthesis. We propose a new bottom-up method to address this problem. Specifically, a large number of object proposals are first detected from the image. Then we develop an optimization algorithm to derive a set of proposal groups, each of which contains multiple proposals showing good common saliency in the image. For each proposal group, we calculate a co-saliency map and then use a low-rank based algorithm to fuse the maps calculated from all the proposal groups for the final co-saliency map in the image. In the experiment, we collect a new benchmark dataset of 664 color images (two subsets) for within-image co-saliency detection. Experiment results show that the proposed method can better detect the within-image co-saliency than existing algorithms. The experimental results also show that the proposed method can be applied to detect the repetitive patterns in a single image and detect the co-saliency in multiple images. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

12. Tamper-Proofing Video With Hierarchical Attention Autoencoder Hashing on Blockchain.

Author: Bui, Tu, Cooper, Daniel, Collomosse, John, Bell, Mark, Green, Alex, Sheridan, John, Higgins, Jez, Das, Arindra, Keller, Jared Robert, and Thereaux, Olivier
Abstract: We present ARCHANGEL; a novel distributed ledger based system for assuring the long-term integrity of digital video archives. First, we introduce a novel deep network architecture using a hierarchical attention autoencoder (HAAE) to compute temporal content hashes (TCHs) from minutes or hour-long audio-visual streams. Our TCHs are sensitive to accidental or malicious content modification (tampering). The focus of our self-supervised HAAE is to guard against content modification such as frame truncation or corruption but ensure invariance against format shift (i.e. codec change). This is necessary due to the curatorial requirement for archives to format shift video over time to ensure future accessibility. Second, we describe how the TCHs (and the models used to derive them) are secured via a proof-of-authority blockchain distributed across multiple independent archives. We report on the efficacy of ARCHANGEL within the context of a trial deployment in which the national government archives of the United Kingdom, United States of America, Estonia, Australia and Norway participated. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

13. Vabis: Video Adaptation Bitrate System for Time-Critical Live Streaming.

Author: Feng, Tongtong, Sun, Haifeng, Qi, Qi, Wang, Jingyu, and Liao, Jianxin
Abstract: With the rise of time-critical and interactive scenarios, ultra-low latency has become the most urgent requirement. Adaptive bitrate (ABR) schemes have been widely used in reducing latency for live streaming services. However, the traditional solutions suffer from a key limitation: they only utilize coarse-grained chunk to solve the I-frame misalignment problem in different bitrate switching process at the cost of increasing latency. As a result, existing schemes are difficult to guarantee the timeliness and granularity of control in essence. In this paper, we use a frame-based approach to solve the I-frame misalignment problem and propose a video adaptation bitrate system (Vabis) in units of the frame for time-critical live streaming to obtain the optimal quality of experience (QoE). On the server-side, a Few-Wait ABR algorithm based on Reinforcement Learning (RL) is designed to adaptively select the bitrate of future frames by state information that can be observed, which can subtly solve the problem of I-frame misalignment. A rule-based ABR algorithm is designed to optimize the Vabis system for the weak network. On the client-side, three delay control mechanisms are designed to achieve frame-based fine-grained control. We construct a trace-driven simulator and the real live platform to evaluate the comprehensive live streaming performance. The results show that Vabis is significantly better than the existing methods with decreases in an average delay of 32%–77% and improvements in average QoE of 28–67%. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

14. Representing Modifiable and Reusable Musical Content on the Web With Constrained Multi-Hierarchical Structures.

Author: Thalmann, Florian, Wiggins, Geraint A., and Sandler, Mark B.
Abstract: The most commonly used formats for exchanging musical information today are limited in that they represent music as flat and rigid streams of events or as raw audio signals without any structural information about the content. Such files can only be listened to in a linear way and reused and manipulated in manners determined by a target application such as a Digital Audio Workstation. The publisher has no means to incorporate their intentions or understanding of the content. This article introduces an extension of the music formalism CHARM for the representation of modifiable and reusable musical content on the Web. It discusses how various kinds of multi-hierarchical graph structures together with logical constraints can be useful to model different musical situations. In particular, we focus on presenting solutions on how to interpret, navigate and schedule such structures in order for them to be played back. We evaluate the versatility of the representation in a number of practical examples created with a Web-based implementation based on Semantic Web technologies. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

15. Character-Oriented Video Summarization With Visual and Textual Cues.

Author: Zhou, Peilun, Xu, Tong, Yin, Zhizhuo, Liu, Dong, Chen, Enhong, Lv, Guangyi, and Li, Changliang
Abstract: With the booming of content “re-creation” in social media platforms, character-oriented video summary has become a crucial form of user-generated video content. However, artificial extraction could be time-consuming with high missing rate, while traditional techniques on person search may incur heavy burden of computing resources. At the same time, in social media platforms, videos are usually accompanied with rich textual information, e.g., subtitles or bullet-screen comments which provide the multi-view description of videos. Thus, there exists a potential to leverage textual information to enhance the character-oriented video summarization. To that end, in this paper, we propose a novel framework for jointly modeling visual and textual information. Specifically, we first locate characters indiscriminately through detection methods, and then identify these characters via re-identification to extract potential key-frames, in which appropriate source of textual information will be automatically selected and integrated based on the features of specific frame. Finally, key-frames will be aggregated as the character-oriented summarization. Experiments on real-world data sets validate that our solution outperforms several state-of-the-art baselines on both person search and summarization tasks, which prove the effectiveness of our solution on the character-oriented video summarization problem. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

16. 2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs.

Author: Avola, Danilo, Cascio, Marco, Cinque, Luigi, Foresti, Gian Luca, Massaroni, Cristiano, and Rodola, Emanuele
Abstract: Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells is proposed. Unlike 3D skeletons, usually generated by RGB-D cameras, the 2D skeletons adopted in this article are reconstructed starting from RGB video streams, therefore allowing the use of the proposed approach in both indoor and outdoor environments. Moreover, any case of missing skeletal data is managed by exploiting 3D-Convolutional Neural Networks (3D-CNNs). Comparative experiments with several key works on KTH and Weizmann datasets show that the method described in this paper outperforms the current state-of-the-art. Additional experiments on UCF Sports and IXMAS datasets demonstrate the effectiveness of our method in the presence of noisy data and perspective changes, respectively. Further investigations on UCF Sports, HMDB51, UCF101, and Kinetics400 highlight how the combination between the proposed two-branch stacked LSTM and the 3D-CNN-based network can manage missing skeleton information, greatly improving the overall accuracy. Moreover, additional tests on KTH and UCF Sports datasets also show the robustness of our approach in the presence of partial body occlusions. Finally, comparisons on UT-Kinect and NTU-RGB+D datasets show that the accuracy of the proposed method is fully comparable to that of works based on 3D skeletons. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

17. Deep Multimodality Learning for UAV Video Aesthetic Quality Assessment.

Author: Kuang, Qi, Jin, Xin, Zhao, Qinping, and Zhou, Bin
Abstract: Despite the growing number of unmanned aerial vehicles (UAVs) and aerial videos, there is a paucity of studies focusing on the aesthetics of aerial videos that can provide valuable information for improving the aesthetic quality of aerial photography. In this article, we present a method of deep multimodality learning for UAV video aesthetic quality assessment. More specifically, a multistream framework is designed to exploit aesthetic attributes from multiple modalities, including spatial appearance, drone camera motion, and scene structure. A novel specially designed motion stream network is proposed for this new multistream framework. We construct a dataset with 6,000 UAV video shots captured by drone cameras. Our model can judge whether a UAV video was shot by professional photographers or amateurs together with the scene type classification. The experimental results reveal that our method outperforms the video classification methods and traditional SVM-based methods for video aesthetics. In addition, we present three application examples of UAV video grading, professional segment detection and aesthetic-based UAV path planning using the proposed method. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

18. Statistical Learning Based Congestion Control for Real-Time Video Communication.

Author: Dai, Tongyu, Zhang, Xinggong, Zhang, Yihang, and Guo, Zongming
Abstract: The existing congestion control is hard to simultaneously achieve low latency, high throughput, good adaptability and fair bandwidth allocation, mainly because of the hardwired control strategy and egocentric convergence objective. To address these issues, we propose an end-to-end statistical learning based congestion control, named Iris. By exploring the underlying principles of self-inflicted delay, we find that RTT variation is linearly related to the difference between sending rate and receiving rate, which inspires us to control video bit rate using a statistical-learning congestion control model. The key idea of Iris is to force all flows to converge to the same queue load and adjust bit rate by the model. All flows keep a small and fixed number of packets queuing in the network, thus the fair bandwidth allocation and low latency are both achieved. Besides, the adjustment step size of sending rate is updated by online learning, to better adapt to dynamically changing networks. We carried out extensive experiments to evaluate the performance of Iris, with the implementations over transport layer and application layer respectively. The testing environment includes emulated network, real-world Internet and commercial cellular networks. Compared against Transmission Control Protocol (TCP) flavors and state-of-the-art protocols, Iris is able to achieve high bandwidth utilization, low latency and good fairness concurrently. Especially for HyperText Transfer Protocol (HTTP) video streaming service, Iris is able to increase the video bitrate up to 25% and Peak Signal to Noise Ratio (PSNR) up to 1 dB. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

19. A Distance-Driven Alliance for a P2P Live Video System.

Author: Zhang, Jinyu, Zhang, Yifan, and Shen, Mengru
Abstract: In peer-to-peer (P2P) networks, free-riders and redundant streams including overlapped and folded streams dramatically degrade playback quality and network performance, respectively. Although a locality-aware P2P live video can reduce the topological complexity, it cannot effectively avoid redundant streams while denying free-riders. In this paper, we first model free-rider, redundant streams and a distance-driven P2P system. Based on that model, a distance-driven alliance algorithm is proposed to construct not only an alliance that directly prevents any utility gains of free-riders through inter-user constraints but also a small-world network or a multicast tree that effectively reduces redundant streams. Finally, simulations confirm its advantages in functionality and performance over several existing strategies and distance-driven P2P live video systems. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

20. Intra Coding Strategy for Video Error Resiliency: Behavioral Analysis.

Author: Kazemi, Mohammad, Ghanbari, Mohammad, and Shirmohammadi, Shervin
Abstract: One challenge in video transmission is to deal with packet loss. Since the compressed video streams are sensitive to data loss, the error resiliency of the encoded video becomes important. When video data is lost and retransmission is not possible, the missed data should be concealed. But loss concealment causes distortion in the lossy frame which also propagates into the next frames even if their data are received correctly. One promising solution to mitigate this error propagation is intra coding. There are three approaches for intra coding: intra coding of a number of blocks selected randomly or regularly, intra coding of some specific blocks selected by an appropriate cost function, or intra coding of a whole frame. But Intra coding reduces the compression ratio; therefore, there exists a trade-off between bitrate and error resiliency achieved by intra coding. In this paper, we study and show the best strategy for getting the best rate-distortion performance. Considering the error propagation, an objective function is formulated, and with some approximations, this objective function is simplified and solved. The solution demonstrates that periodical I-frame coding is preferred over coding only a number of blocks as intra mode in P-frames. Through examination of various test sequences, it is shown that the best intra frame period depends on the coding bitrate as well as the packet loss rate. We then propose a scheme to estimate this period from curve fitting of the experimental results, and show that our proposed scheme outperforms other methods of intra coding especially for higher loss rates and coding bitrates. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

21. A Fast FoV-Switching DASH System Based on Tiling Mechanism for Practical Omnidirectional Video Services.

Author: Song, Jiarun, Yang, Fuzheng, Zhang, Wei, Zou, Wenjie, Fan, Yuqun, and Di, Peiyun
Abstract: With the development of multimedia technologies and virtual reality display devices, omnidirectional videos have gained popularity nowadays. To reduce the bandwidth requirement for omnidirectional video transmission, tile-based viewport adaptive streaming methods have been proposed in the literatures. Challenges related to decoding the tiles simultaneously with limited number of decoders, and ensuring user's viewing experience during the viewport switch are still to be solved. In this paper, a two-layer fast viewport switching dynamic adaptive streaming over HTTP (DASH) system based on tiling mechanism is proposed, which incorporates the viewing trajectory of end users. To deal with the simultaneously decoding problem, an open group of picture (GOP) technique is proposed to enable merging different types of tiles into a composite stream at the client side. To reduce the quality recovery duration after the viewport change, a fast-switching strategy is also proposed. Moreover, considering the priorities of different types of chunks, a download strategy is further proposed to adapt the bandwidth fluctuations and viewport changes. Experimental results showed that the proposed system can significantly reduce the recovery duration of high quality video by approximately 90%, which can provide a better viewing experience to end users. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

22. Multimedia Intelligence: When Multimedia Meets Artificial Intelligence.

Author: Zhu, Wenwu, Wang, Xin, and Gao, Wen
Abstract: Owing to the rich emerging multimedia applications and services in the past decade, super large amount of multimedia data has been produced for the purpose of advanced research in multimedia. Furthermore, multimedia research has made great progress on image/video content analysis, multimedia search and recommendation, multimedia streaming, multimedia content delivery etc. At the same time, Artificial Intelligence (AI) has undergone a “new” wave of development since being officially regarded as an academic discipline in 1950s, which should give credits to the extreme success of deep learning. Thus, one question naturally arises: What happens when multimedia meets Artificial Intelligence? To answer this question, we introduce the concept of Multimedia Intelligence through investigating the mutual-influence between multimedia and Artificial Intelligence. We explore the mutual influences between multimedia and Artificial Intelligence from two aspects: i) multimedia drives Artificial Intelligence to experience a paradigm shift towards more explainability and ii) Artificial Intelligence in turn injects new ways of thinking for multimedia research. As such, these two aspects form a loop in which multimedia and Artificial Intelligence interactively enhance each other. In this paper, we discuss what and how efforts have been done in literature and share our insights on research directions that deserve further study to produce potentially profound impact on multimedia intelligence. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

23. Sensor-Augmented Neural Adaptive Bitrate Video Streaming on UAVs.

Author: Xiao, Xuedou, Wang, Wei, Chen, Taobin, Cao, Yang, Jiang, Tao, and Zhang, Qian
Abstract: Recent advances in unmanned aerial vehicle (UAV) technology have revolutionized a broad class of civil and military applications. However, the designs of wireless technologies that enable real-time streaming of high-definition video between UAVs and ground clients present a conundrum. Most existing adaptive bitrate (ABR) algorithms are not optimized for the air-to-ground links, which usually fluctuate dramatically due to the dynamic flight states of the UAV. In this paper, we present SA-ABR, a new sensor-augmented system that generates ABR video streaming algorithms with the assistance of various kinds of inherent sensor data that are used to pilot UAVs. By incorporating the inherent sensor data with network observations, SA-ABR trains a deep reinforcement learning (DRL) model to extract salient features from the flight state information and automatically learn an ABR algorithm to adapt to the varying UAV channel capacity through the training process. SA-ABR does not rely on any assumptions or models about UAV's flight states or the environment, but instead, it makes decisions by exploiting temporal properties of past throughput through the long short-term memory (LSTM) to adapt itself to a wide range of highly dynamic environments. We have implemented SA-ABR in a commercial UAV and evaluated it in the wild. We compare SA-ABR with a variety of existing state-of-the-art ABR algorithms, and the results show that our system outperforms the best known existing ABR algorithm by 21.4% in terms of the average quality of experience (QoE) reward. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

24. Part-Aware Fine-Grained Object Categorization Using Weakly Supervised Part Detection Network.

Author: Zhang, Yabin, Jia, Kui, and Wang, Zhixin
Abstract: Fine-grained object categorization aims for distinguishing objects of subordinate categories that belong to the same entry-level object category. It is a rapidly developing subfield in multimedia content analysis. The task is challenging due to the facts that (1) training images with ground-truth labels are difficult to obtain, and (2) variations among different subordinate categories are subtle. It is well established that characterizing features of different subordinate categories are located on local parts of object instances. However, manually annotating object parts requires expertise, which is also difficult to generalize to new fine-grained categorization tasks. In this work, we propose a Weakly Supervised Part Detection Network (PartNet) that is able to detect discriminative local parts for the use of fine-grained categorization. A vanilla PartNet builds on top of a base subnetwork two parallel streams of upper network layers, which respectively compute scores of classification probabilities (over subordinate categories) and detection probabilities (over a specified number of discriminative part detectors) for local regions of interest (RoIs). The image-level prediction is obtained by aggregating element-wise products of these region-level probabilities, and meanwhile diverse part detectors can be learned in an end-to-end fashion under the image-level supervision. To generate a diverse set of RoIs as inputs of PartNet, we propose a simple Discretized Part Proposals module (DPP) that directly targets for proposing candidates of discriminative local parts, with no bridging via object-level proposals. Experiments on benchmark datasets of CUB-200-2011, Oxford Flower 102 and Oxford-IIIT Pet show the efficacy of our proposed method for both discriminative part detection and fine-grained categorization. In particular, we achieve the new state-of-the-art performance on CUB-200-2011 and Oxford-IIIT Pet datasets when ground-truth part annotations are not available. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

25. Multi-Party WebRTC Services Using Delay and Bandwidth Aware SDN-Assisted IP Multicasting of Scalable Video Over 5G Networks.

Author: Kirmizioglu, Riza Arda and Tekalp, A. Murat
Abstract: At present, multi-party WebRTC videoconferencing between peers with heterogenous network resources and terminals is enabled over the best-effort Internet using a central selective forwarding unit (SFU), where each peer sends a scalable encoded video stream to the SFU. This connection model avoids the upload bandwidth bottleneck associated with mesh connections; however, it increases peer delay and overall network load (resource consumption) in addition to requiring investment in servers since all video traffic must go through SFU servers. To this effect, we propose a new multi-party WebRTC service model over future 5G networks, where a video service provider (VSP) collaborates with a network service providers (NSP) to offer an NSP-managed service to stream scalable video layers using software-defined networking (SDN)-assisted Internet protocol (IP) multicasting between peers using NSP infrastructure. In the proposed service model, each peer sends a scalable coded video upstream, which is selectively duplicated and forwarded as layer streams at SDN switches in the network, instead of at a central SFU, in a multi-party WebRTC session managed by multicast trees maintained by the SDN controller. Experimental results show that the proposed SDN-assisted IP multicast service architecture is more efficient than the SFU model in terms of end-to-end service delay and overall network resource consumption, while avoiding peer upload bandwidth bottleneck and distributing traffic more evenly across the network. The proposed architecture enables efficient provisioning of premium managed WebRTC services over bandwidth-reserved SDN slices to provide videoconferencing experience with guaranteed video quality over 5G networks. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

26. Sentiment Recognition for Short Annotated GIFs Using Visual-Textual Fusion.

Author: Liu, Tianliang, Wan, Junwei, Dai, Xiubin, Liu, Feng, You, Quanzeng, and Luo, Jiebo
Abstract: With the rapid development of social media, visual sentiment analysis from image or video has become a hot spot in visual understanding researches. In this work, we propose an effective approach using visual and textual fusion for sentiment analysis of short GIF videos with textual descriptions. We extract both sequence-level and frame-level visual features for each given GIF video. Next, we build a visual sentiment classifier by using the extracted features. We also define a mapping function, which converts the sentiment probability from the classifier to a sentiment score used in our fusion function. At the same time, for the accompanied textual annotations, we employ the Synset forest to extract the sets of the meaningful sentiment words and utilize the SentiWordNet3.0 model to obtain the textual sentiment score. Then, we design a joint visual-textual sentiment score function weighted with visual sentiment component and textual sentiment one. To make the function more robust, we introduce a noticeable difference threshold to further process the fused sentiment score. Finally, we adopt a grid search technique to obtain relevant model hyper-parameters by optimizing a sentiment aware score function. Experimental results and analysis extensively demonstrate the effectiveness of the proposed sentiment recognition scheme on three benchmark datasets including T-GIF dataset, GSO-2016 dataset and Adjusted-GIFGIF dataset. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

27. Snapshot High Dynamic Range Imaging via Sparse Representations and Feature Learning.

Author: Fotiadou, Konstantina, Tsagkatakis, Grigorios, and Tsakalides, Panagiotis
Abstract: Bracketed High Dynamic Range (HDR) imaging architectures acquire a sequence of Low Dynamic Range (LDR) images in order to either produce a HDR image or an “optimally” exposed LDR image, achieving impressive results under static camera and scene conditions. However, in real world conditions, ghost-like artifacts and noise effects limit the quality of HDR reconstruction. We address these limitations by introducing a post-acquisition snapshot HDR enhancement scheme that generates a bracketed sequence from a small set of LDR images, and in the extreme case, directly from a single exposure. We achieve this goal via a sparse-based approach where transformations between differently exposed images are encoded through a dictionary learning process, while we learn appropriate features by employing a stacked sparse autoencoder (SSAE) based framework. Via experiments with real images, we demonstrate the improved performance of our method over the state-of-the-art, while our single-shot based HDR formulation provides a novel paradigm for the enhancement of LDR imaging and video sequences. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

28. Optimizing Fixation Prediction Using Recurrent Neural Networks for 360° Video Streaming in Head-Mounted Virtual Reality.

Author: Fan, Ching-Ling, Yen, Shou-Cheng, Huang, Chun-Ying, and Hsu, Cheng-Hsin
Abstract: We study the problem of predicting the viewing probability of different parts of 360° videos when streaming them to head-mounted displays. We propose a fixation prediction network based on recurrent neural network, which leverages sensor and content features. The content features are derived by computer vision (CV) algorithms, which may suffer from inferior performance due to various types of distortion caused by diverse 360° video projection models. We propose a unified approach with overlapping virtual viewports to eliminate such negative effects, and we evaluate our proposed solution using several CV algorithms, such as saliency detection, face detection, and object detection. We find that overlapping virtual viewports increase the performance of these existing CV algorithms that were not trained for 360° videos. We next fine-tune our fixation prediction network with diverse design options, including: 1) with or without overlapping virtual viewports, 2) with or without future content features, and 3) different feature sampling rates. We empirically choose the best fixation prediction network and use it in a 360° video streaming system. We conduct extensive trace-driven simulations with a large-scale dataset to quantify the performance of the 360° video streaming system with different fixation prediction algorithms. The results show that our proposed fixation prediction network outperforms other algorithms in several aspects, such as: 1) achieving comparable video quality (average gaps between −0.05 and 0.92 dB), 2) consuming much less bandwidth (average bandwidth reduction by up to 8 Mb/s), 3) reducing the rebuffering time (on average 40 s in bandwidth-limited 4G cellular networks), and 4) running in real-time (at most 124 ms). [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

29. Unmanned Aircraft System Aided Adaptive Video Streaming: A Joint Optimization Approach.

Author: Zhan, Cheng, Hu, Han, Wang, Zhi, Fan, Rongfei, and Niyato, Dusit
Abstract: Due to the coverage constraint of a wireless base station, mobile users suffer from the unstable network connection and poor service quality, especially for the prevalent video services. As an alternative solution, an unmanned aerial vehicle (UAV) is able to reach the cell edge and serve ground users (GUs). In this paper, we extend the UAV applications to the more challenging adaptive streaming service over fading channel. First, we decompose the system into different modules, and present mathematical models for each of them, including a trajectory model of the UAV, fading channels between the UAV and GUs, and video streaming utility. Second, we formulate the problem as a non-convex optimization problem by optimizing the UAV trajectory and transmit power allocation, jointly with transmission schedule and rate allocation for multiple users. The objective is to maximize the overall utility while guaranteeing the fairness among multiple users under the UAV energy budget and rate-outage probability constraints. Third, to tackle this problem, we first analyze the relationship between transmission rate and rate-outage probability over the fading channel, and then divide the original problem into three subproblems, which can be solved by leveraging the successive convex approximation technique. Furthermore, an overall iterative algorithm over the three subproblems is proposed to obtain a locally optimal solution by applying the block coordinate descent technique. Finally, through extensive experiments, we demonstrate that the proposed design can achieve almost $\text{30}\%$ performance gain in terms of max–min streaming utility for all users, compared with other benchmark schemes. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

30. MLC STT-MRAM-Aware Memory Subsystem for Smart Image Applications.

Author: Jang, Wooyoung
Abstract: Next-generation memories with high storage capacity, high performance, and low power consumption are being researched due to the ever-growing demand for artificial intelligence and high-definition applications. Among such future memories, a multi-level cell (MLC) spin-transfer magnetic torque random access memory (STT-MRAM) attracts considerable attention as an alternative to static or dynamic random access memories. An MCL STT-MRAM has the advantages of capacity and non-volatility, but the disadvantages of performance, power consumption, and endurance resulting from complicated resistance state transition and detection processes. In particular, such issues are exacerbated in the latest smart image applications employing block-based processing algorithms. In this paper, we propose a memory subsystem that mitigates the MLC STT-MRAM disadvantages in smart image applications. Our main idea is threefold: MLC-aware image buffer composing, block-aware pixel-to-memory mapping, and prediction-aware image-to-buffer allocating techniques that all make multi-step resistance state transition and detection processes less required. Experimental results show that the proposed memory subsystem achieves 24.5% shorter application execution time, and 96.4% lower memory power consumption than the conventional memory subsystems for industrial smart image applications. In addition, our memory subsystem increases the lifetime of MLC STT-MRAMs via 93.8% fewer multi-step resistance state transition processes. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

31. Incentive Mechanism for Cooperative Scalable Video Coding (SVC) Multicast Based on Contract Theory.

Author: Xu, Zeyu, Cao, Yang, Wang, Wei, Jiang, Tao, and Zhang, Qian
Abstract: In scalable video coding (SVC) multicast, videos are encoded into several layers that represent multiple quality levels. Mobile users with different wireless channel conditions can obtain different numbers of layers and have different quality of experience (QoE). To enhance the QoE of the users that suffer from the worse channel quality, it is beneficial to stimulate users’ cooperation in relaying enhancement layers. However, potential relays may be unwilling to truthfully cooperate with receivers, which results in the asymmetric information problem in relay selecting. In this paper, we model the video relaying selection as a market with multiple receivers (principals) and relays (agents), and solve the problem according to the contract theory. The proposed solution is divided into following two steps: first, contract design and item preselection, and second, matching between each principal and agent. We propose a contract parameter determination method termed as the Matching-Aware strategy. Different from traditional strategies, the proposed Matching-Aware strategy makes the contract competitive in principal-agent matching without knowing the probability distribution of relays’ types. The matching step is undertaken by the base station with the purpose of maximizing the social welfare. Numerical results corroborate that the contract-based video relaying scheme can tackle the asymmetric information problem. Besides, compared with other two baseline strategies, the proposed Matching-Aware strategy achieves higher QoE. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

32. Robust QoE-Driven DASH Over OFDMA Networks.

Author: Xiao, Kefan, Mao, Shiwen, and Tugnait, Jitendra K.
Abstract: In this paper, the problem of effective and robust delivery of Dynamic Adaptive Streaming over HTTP (DASH) videos over an orthogonal frequency-division multiplexing access (OFDMA) network is studied. Motivated by a measurement study, we propose to explore the request interval and robust rate prediction for DASH over OFDMA. We first formulate an offline cross-layer optimization problem based on a novel quality of experience (QoE) model. Then the online reformulation is derived and proved to be asymptotically optimal. After analyzing the structure of the online problem, we propose a decomposition approach to obtain a user equipment (UE) rate adaptation problem and a BS resource allocation problem. We introduce stochastic model predictive control (SMPC) to achieve high robustness on video rate adaption and consider the request interval for more efficient resource allocation. Extensive simulations show that the proposed scheme can achieve a better QoE performance compared with other variations and a benchmark algorithm, which is mainly due to its lower rebuffering ratio and more stable bitrate choices. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

33. Video Storytelling: Textual Summaries for Events.

Author: Li, Junnan, Wong, Yongkang, Zhao, Qi, and Kankanhalli, Mohan S.
Abstract: Bridging vision and natural language is a longstanding goal in computer vision and multimedia research. While earlier works focus on generating a single-sentence description for visual content, recent works have studied paragraph generation. In this paper, we introduce the problem of video storytelling, which aims at generating coherent and succinct stories for long videos. Video storytelling introduces new challenges, mainly due to the diversity of the story and the length and complexity of the video. We propose novel methods to address the challenges. First, we propose a context-aware framework for multimodal embedding learning, where we design a residual bidirectional recurrent neural network to leverage contextual information from past and future. The multimodal embedding is then used to retrieve sentences for video clips. Second, we propose a Narrator model to select clips that are representative of the underlying storyline. The Narrator is formulated as a reinforcement learning agent, which is trained by directly optimizing the textual metric of the generated story. We evaluate our method on the video story dataset, a new dataset that we have collected to enable the study. We compare our method with multiple state-of-the-art baselines and show that our method achieves better performance, in terms of quantitative measures and user study. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

34. Multi-Scale Based Context-Aware Net for Action Detection.

Author: Liu, Haijun, Wang, Shiguang, Wang, Wen, and Cheng, Jian
Abstract: We address the problem of action detection in continuous untrimmed video streams, based on the two-stage framework: one stage for action proposals generation and the other for proposals classification and refinement. The context features inside and outside a candidate region (proposal) are critical for classification in action detection. Therefore, effective integration of these features with different scales has become a fundamental problem. We contend that different action instances and candidate proposals may need different context features. To address this issue, we present a novel multiple scales based context-aware net (MSCA-Net) to effectively classify the action proposals for action detection in this paper. For each candidate action proposal, MSCA-Net takes its multiple regions with different temporal scales as input and then generates suitable context features. Based on the “candidate-control” mechanism of LSTM, the proposed MSCA-Net specially adopts the two-branch structure: Branch1 generates multi-scale context features for each candidate proposal, whereas Branch2 utilizes the context-aware gate function to control the message passing. Extensive experiments on THUMOS’14, Charades daily and ActivityNet action detection datasets, demonstrate the effectiveness of the designed structure and show how these context features influence the detection results. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

35. Exploiting Mid-Level Semantics for Large-Scale Complex Video Classification.

Author: Zhang, Ji, Mei, Kuizhi, Zheng, Yu, and Fan, Jianping
Abstract: As the amount of available video data has grown substantially, automatic video classification has become an urgent yet challenging task. Most video classification methods focus on acquiring discriminative spacial visual features and motion patterns for video representation, especially deep learning methods, which have achieved very good results on action recognition problems. However, the performance of most of these methods drastically degenerates for more generic video classification tasks where the video contents are much more complex. Thus, in this paper, the mid-level semantics of videos are exploited to bridge the semantic gap between low-level features and high-level video semantics. Inspired by the term ''frequency-inverse document frequency'', a word weighting method for the problem of text classification is introduced to the video domain. The visual objects in videos are regarded as the words in texts, and two new weighting methods are proposed to encode videos by weighting visual objects according to the characteristics of videos. In addition, the semantic similarities between video categories and visual objects are introduced from the text domain as privileged information to facilitate classifier training on the obtained semantic representations of videos. The proposed semantic encoding method (semantic stream) is then fused with the popular two-stream CNN model for the final classification results. Experiments are conducted on two large-scale complex video datasets, CCV and ActivityNet. The experimental results validate the effectiveness of the proposed methods. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

36. AccAnn: A New Subjective Assessment Methodology for Measuring Acceptability and Annoyance of Quality of Experience.

Author: Li, Jing, Krasula, Lukas, Baveye, Yoann, Li, Zhi, and Le Callet, Patrick
Abstract: User expectations have a crucial impact on the levels of quality of experience (QoE) that they consider acceptable or satisfying. Measuring acceptability and annoyance has mainly been performed in separate or multi-step experiments without any control over participants’ expectations. This paper introduces a simple methodology to obtain the information about both of the entities in a single step and compares several data processing strategies useful for results interpretation. A specifically designed subjective experiment, conducted on compressed videos, has shown that the multi-step procedures could be replaced by our proposed single-step approach, regardless of the viewing conditions, while the novel approach is significantly preferred by observers for its low time requirements and higher intuitiveness. The test has simultaneously proven that user expectations can be altered by the instructions and it is, therefore, possible to simulate different user profiles regardless of the participants’ real habits. The acceptability/annoyance experimental results are also used to benchmark the state-of-the-art objective video quality metrics in predicting acceptability/annoyance of QoE. A case study on the determination of the threshold of acceptability/annoyance for objective quality metrics is conducted, which can be served as a guideline for video streaming service providers. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

37. COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval.

Author: Li, Xirong, Xu, Chaoxi, Wang, Xiaoxu, Lan, Weiyu, Jia, Zhengxiong, Yang, Gang, and Xu, Jieping
Abstract: This paper contributes to cross-lingual image annotation and retrieval in terms of data and baseline methods. We propose COCO-CN, a novel dataset enriching MS-COCO with manually written Chinese sentences and tags. For effective annotation acquisition, we develop a recommendation-assisted collective annotation system, automatically providing an annotator with several tags and sentences deemed to be relevant with respect to the pictorial content. Having 20 342 images annotated with 27 218 Chinese sentences and 70 993 tags, COCO-CN is currently the largest Chinese–English dataset that provides a unified and challenging platform for cross-lingual image tagging, captioning, and retrieval. We develop conceptually simple yet effective methods per task for learning from cross-lingual resources. Extensive experiments on the three tasks justify the viability of the proposed dataset and methods. Data and code are publicly available at https://github.com/li-xirong/coco-cn. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

38. S-MDP: Streaming With Markov Decision Processes.

Author: Khan, Koffka and Goodridge, Wayne
Abstract: In recent years, finding adaptive players competing for network resources at a single bottleneck link has become common. This competition generally occurs at routers in household local area networks. Such competition severely reduces viewers’ quality of experience, such as fairness and stability. Researchers harness Markov decision process (MDP) models to optimize the adaptive video streaming process. Typically, players follow a policy based on numerous parameters, such as buffer occupancy or average bandwidth. In this study, we defied this traditional decentralized client-side MDP approach by allowing players to share a network state among themselves, which we called a streaming MDP (S-MDP). This state includes a discrete data rate measurement (DRM) value. The DRM value is a normalized value of a player's incoming bitrate and is an example of an interval measurement scale. Players use video bitrates to produce unique state transition matrices. The S-MDP reward matrix penalizes excessive switching along the DRM interval scale and thus encourages stability. At intervals during streaming, players create unique policies. The near-real-time update of policies enables players DRM values to converge. S-MDP shows relatively good performance in emulation experiments compared with four streaming methods, namely, k-chunk MDP, stochastic dynamic programming for adaptive streaming over hypertext transfer protocol (sdpDASH), MDP-based DASH, and RTRA_S. In Internet experiments, we compare the performance of streaming methods with a Roku, Amazon Fire TV, and Apple TV. We also compared it against users who play online games, download files, and/or chat online, and S-MDP outperforms the other methods in terms of both objective and subjective visual quality, except in the presence of transmission control protocol long-lived flows, such as Skype. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

39. A High-Efficiency Compressed Sensing-Based Terminal-to-Cloud Video Transmission System.

Author: Zheng, Shuai, Zhang, Xiao-Ping, Chen, Jian, and Kuo, Yonghong
Abstract: With the rapid popularization of mobile intelligent terminals, mobile video and cloud services applications are widely used in people's lives. However, the resource-constrained characteristic of the terminals and the enormous amount of video information make the efficient terminal-to-cloud data upload a challenge. To solve the problem, this paper proposes an efficient compressed sensing-based high-efficiency video upload system for the terminal-to-cloud upload network. The system contains two main new components. First, to effectively remove the inter-frame redundant information, an encoder sampling scheme with high efficiency is developed by applying the skip block-based residual compressed sensing sampling technology. For the time-varying channel state, the encoder can adaptively allocate the sampling rate for different video frames by the proposed adaptive sampling scheme. Second, a local secondary reconstruction-based multi-reference frames cross-recovery algorithm is developed at the decoder. It further improves the reconstruction quality and reduces the quality fluctuation of the recovered video frames to improve the user experience. Compared with the state-of-the-art reference systems reported in the literature, the proposed system achieves the high-efficiency and high-quality terminal-to-cloud transmission. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

40. Dynamic Cross-Layer Signaling Exchange for Real-Time and On-Demand Multimedia Streams.

Author: Shamieh, Fuad and Wang, Xianbin
Abstract: Multimedia streams consume a significant chunk of the consumer Internet traffic exchanged and will continue to do so due to the ever-increasing connection among people, businesses, and industries. To cope with the deviation of the Internet's intended use, unreliable underlying infrastructure, and best effort protocols while leveraging existing technologies, Hypertext Transfer Protocol Adaptive Streaming is utilized by numerous multimedia services. Performance of HAS-based streaming services is limited by the growing control overhead generated by the Transmission Control Protocol/Internet Protocol (TCP/IP) stack as the stream length, multimedia fidelity, and network conditions vary. In this paper, a novel cross-layer steganographic-enabled signaling scheme is proposed to reduce service provider costs while improving multimedia session performance and maintaining expected Quality-of-Service (QoS). The proposed scheme is designed to encode control stream messages from any TCP/IP layer within payload messages to reduce the total amount of overhead exchanged, thereby decreasing resource utilization within source and intermediate nodes. Furthermore, the encoding scheme probes network conditions and session statistics for adaptive decision-making to enable real-time pliability of the proposed process. A utility function is developed to find the optimal cost savings where simulations are conducted to verify the designs. The proposed solution is then implemented using VideoLan Media Player transceivers residing in linux containers virtual machines, where a multimedia file is exchanged in the popular Advanced Video Coding (H.264) format. The results show a decrease in bandwidth and average queue waiting time costs of 4.71% and 29.61%, respectively, with a throughput increase of 5.77%. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

41. Cache Less for More: Exploiting Cooperative Video Caching and Delivery in D2D Communications.

Author: Wu, Dapeng, Liu, Qianru, Wang, Honggang, Yang, Qing, and Wang, Ruyan
Abstract: The ever-increasing demand for videos on mobile devices poses a significant challenge to existing cellular network infrastructures. To cope with the challenge, we propose a user-centric video transmission mechanism based on device-to-device communications that allows mobile users to cache and share videos between each other, in a cooperative manner. The proposed solution jointly considers users’ similarity in accessing videos, users’ sharing willingness, users’ location distribution, and users’ quality of experience (QoE) requirements, in order to achieve a QoE-guaranteed video streaming service in a cellular network. Specifically, a service set consisting of several service providers and mobile users, is dynamically configured to provide timely service according to the probability of successful service. Numerical results show that when the number of providers and demanded videos is 40 and 2, respectively, the improved users experience rate in the proposed solution is approximately 85%, and the data offload rate on base station(s) is about 78%. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

42. Fine-Grained Land Use Classification at the City Scale Using Ground-Level Images.

Author: Zhu, Yi, Deng, Xueqing, and Newsam, Shawn
Abstract: Multimedia researchers have exploited large collections of community-contributed geo-referenced images to better understand a particular image, such as its subject matter or where it was taken, as well as to better understand a geographic location, such as the most visited tourist spots in a city or what the local cuisine is like. The goal of this paper is to better understand location. In particular, we use geo-referenced image collections to better understand what occurs in different parts of a city at fine spatial and activity class scales. This problem is known as land use mapping in the geographical sciences. We propose a novel framework to perform fine-grained land use mapping at the city scale using ground-level images. Mapping land use is considerably more difficult than mapping land cover and is generally not possible using overhead imagery as it requires close-up views and seeing inside buildings. We postulate that the growing collections of geo-referenced, ground-level images suggest an alternate approach to this geographic knowledge discovery problem. We develop a general framework that uses Flickr images to map 45 different land-use classes for the city of San Francisco, CA, USA. Individual images are classified using a novel convolutional neural network containing two streams: one for recognizing objects and another for recognizing scenes. This network is trained in an end-to-end manner directly on the labeled training images. We propose several novel strategies to overcome the noisiness of our user-generated data including search-based training set augmentation and online adaptive training. We derive a ground truth map of San Francisco in order to evaluate our method. We demonstrate the effectiveness of our approach through geovisualization and quantitative analysis. Our framework achieves over 29% recall at the individual land parcel level that represents a strong baseline for the challenging 45-way land use classification problem, especially given the noisiness of the image data. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

43. A Hierarchical Approach for Associating Body-Worn Sensors to Video Regions in Crowded Mingling Scenarios.

Author: Cabrera-Quiros, Laura and Hung, Hayley
Abstract: We address the complex problem of associating several wearable devices with the spatio-temporal region of their wearers in video during crowded mingling events using only acceleration and proximity. This is a particularly important first step for multisensor behavior analysis using video and wearable technologies, where the privacy of the participants must be maintained. Most state-of-the-art works using these two modalities perform their association manually, which becomes practically unfeasible as the number of people in the scene increases. We proposed an automatic association method based on a hierarchical linear assignment optimization, which exploits the spatial context of the scene. Moreover, we present extensive experiments on matching from 2 to more than 69 acceleration and video streams, showing significant improvements over a random baseline in a real-world crowded mingling scenario. We also show the effectiveness of our method for incomplete or missing streams (up to a certain limit) and analyze the tradeoff between length of the streams and number of participants. Finally, we provide an analysis of failure cases, showing that deep understanding of the social actions within the context of the event is necessary to further improve performance on this intriguing task. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

44. Fast H.264 to HEVC Transcoding: A Deep Learning Method.

Author: Xu, Jingyao, Xu, Mai, Wei, Yanan, Wang, Zulin, and Guan, Zhenyu
Abstract: With the development of video coding technology, high-efficiency video coding (HEVC) has become a promising alternative, compared with the previous coding standards, for example, H.264. In general, H.264 to HEVC transcoding can be accomplished by fully H.264 decoding and fully HEVC encoding, which suffers from considerable time consumption on the brute-force search of the HEVC coding tree unit (CTU) partition for rate-distortion optimization (RDO). In this paper, we propose a deep learning method to predict the HEVC CTU partition, instead of the brute-force RDO search, for H.264 to HEVC transcoding. First, we build a large-scale H.264 to HEVC transcoding database. Second, we investigate the correlation between the HEVC CTU partition and H.264 features, and analyze both temporal and spatial-temporal similarities of the CTU partition across video frames. Third, we propose a deep learning architecture of a hierarchical long short-term memory (H-LSTM) network to predict the CTU partition of HEVC. Then, the brute-force RDO search of the CTU partition is replaced by the H-LSTM prediction such that the computational time can be significantly reduced for fast H.264 to HEVC transcoding. Finally, the experimental results verify that the proposed H-LSTM method can achieve a better tradeoff between coding efficiency and complexity, compared to the state-of-the-art H.264 to HEVC transcoding methods. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

45. Video Big Data Retrieval Over Media Cloud: A Context-Aware Online Learning Approach.

Author: Feng, Yinan, Zhou, Pan, Xu, Jie, Ji, Shouling, and Wu, Dapeng
Abstract: Online video sharing (e.g., via YouTube or YouKu) has emerged as one of the most important services in the current Internet, where billions of videos on the cloud are awaiting exploration. Hence, a personalized video retrieval system is needed to help users find interesting videos from big data content. Two of the main challenges are to process the increasing amount of video big data and resolve the accompanying “cold start” issue efficiently. Another challenge is to satisfy the users’ need for personalized retrieval results, of which the accuracy is unknown. In this paper, we formulate the personalized video big data retrieval problem as an interaction between the user and the system via a stochastic process, not just a similarity matching, accuracy (feedback) model of the retrieval; introduce users’ real-time context into the retrieval system; and propose a general framework for this problem. By using a novel contextual multiarmed bandit-based algorithm to balance the accuracy and efficiency, we propose a context-based online big-data-oriented personalized video retrieval system. This system can support datasets that are dynamically increasing in size and has the property of cross-modal retrieval. Our approach provides accurate retrieval results with sublinear regret and linear storage complexity and significantly improves the learning speed. Furthermore, by learning for a cluster of similar contexts simultaneously, we can realize sublinear storage complexity with the same regret but slightly poorer performance on the “cold start” issue compared to the previous approach. We validate our theoretical results experimentally on a tremendously large dataset; the results demonstrate that the proposed algorithms outperform existing bandit-based online learning methods in terms of accuracy and efficiency and the adaptation from the bandit framework offers additional benefits. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

46. Saliency Detection via Multi-Scale Global Cues.

Author: Lin, Xiao, Wang, Zhi-Jie, Ma, Lizhuang, and Wu, Xiabao
Abstract: The saliency detection technologies are very useful to analyze and extract important information from given multimedia data, and have already been extensively used in many multimedia applications. Past studies have revealed that utilizing the global cues is effective in saliency detection. Nevertheless, most of prior works mainly considered the single-scale segmentation when the global cues are employed. In this paper, we attempt to incorporate the multi-scale global cues for saliency detection problem. Achieving this proposal is interesting and also challenging (e.g., How to obtain appropriate foreground and background seeds effectively? How to merge rough saliency results into the final saliency map efficiently?). To alleviate the challenges, we present a three-phase solution that integrates several targeted strategies, first, a self-adaptive strategy for obtaining appropriate filter parameters; second, a cross-validation scheme for selecting appropriate background and foreground seeds; and third, a weight-based approach for merging the rough saliency maps. Our solution is easy to understand and implement, but without loss of effectiveness. Extensive experimental results based on benchmark datasets demonstrate the feasibility and competitiveness of our proposed solution. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

47. Stochastic Analysis of DASH-Based Video Service in High-Speed Railway Networks.

Author: Jiang, Zhongbai, Xu, Changqiao, Guan, Jianfeng, Liu, Yang, and Muntean, Gabriel-Miro
Abstract: The latest increasing popularity of high-speed railways (HSR) has stimulated growing demands for wireless Internet services in HSR networks, especially for video streaming. However, due to the high variability and unpredictability of wireless communications in HSR networks, it is still difficult for the existing solutions to provide high-quality video streaming services to HSR passengers. This paper addresses this crucial problem first by reporting on field experiments performed to investigate the characteristics of HSR networks. Then the paper formulates an intractable optimization problem for dynamic adaptive streaming over HTTP (DASH)-enabling service in HSR networks considering various factors, including packet loss, energy consumption, video service quality, etc. By leveraging Lyapunov optimization approaches, the formulated optimization problem is transformed into a queue stability problem which is of high scalability and generality. Moreover, in order to overcome the intractability of the initial optimization problem, the queue stability problem is further decomposed into three subproblems which can be easily solved individually. Finally, a novel joint stochastic DASH optimization (JSDO) mechanism consisting of three algorithms for the derived subproblems is proposed. Rigorous theoretical analyses and realistic dataset-based simulations demonstrate the effectiveness of the proposed JSDO mechanism. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

48. A Study of High Frame Rate Video Formats.

Author: Mackin, Alex, Zhang, Fan, and Bull, David R.
Abstract: High frame rates are acknowledged to increase the perceived quality of certain video content. However, the lack of high frame rate test content has previously restricted the scope of research in this area—especially in the context of immersive video formats. This problem has been addressed through the publication of a high frame rate video database BVI-HFR, which was captured natively at 120 fps. BVI-HFR spans a variety of scenes, motions, and colors, and is shown to be representative of BBC broadcast content. In this paper, temporal down-sampling is utilized to enable both subjective and objective comparisons across a range frame rates. A large-scale subjective experiment has demonstrated that high frame rates lead to increases in perceived quality, and that a degree of content dependence exists—notably related to camera motion. Various image and video quality metrics have been benchmarked on these subjective evaluations, and analysis shows that those which explicitly account for temporal distortions (e.g., FRQM) provide improved correlation with subjective opinions compared to generic quality metrics such as PSNR. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

49. Energy-Efficient Multipath TCP for Quality-Guaranteed Video Over Heterogeneous Wireless Networks.

Author: Wu, Jiyan, Tan, Rui, and Wang, Ming
Abstract: Prompted by technological advancements in wireless systems and handheld devices, concurrent multipath transfer is a promising solution to stream high-quality mobile videos in heterogeneous access medium. Multipath TCP (MPTCP) is a transport-layer protocol recommended by the Internet Engineering Task Force (IETF) for concurrent data transmission to multi-radio terminals. However, it is still challenging to stream high-quality real-time videos with the existing MPTCP solutions because of the tradeoff between energy efficiency and video quality. To deliver real-time video in an energy-efficient manner, this paper presents a Delay-Energy-quAlity-aware MPTCP (DEAM) solution. First, an analytical framework is developed to characterize the delay-constrained energy–quality tradeoff for multipath video delivery over heterogeneous access networks. Second, a subflow allocation algorithm is proposed to minimize the device energy consumption while achieving target video quality within the imposed deadline. The performance of the proposed DEAM is verified by means of extensive Exata emulations with real-time streaming videos. Evaluation results demonstrate that DEAM achieves appreciable improvements over the reference MPTCP solutions in mobile energy conservation and user-perceived video quality. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

50. Dual Pursuit for Subspace Learning.

Author: Yi, Shuangyan, Liang, Yingyi, He, Zhenyu, Li, Yi, and Cheung, Yiu-Ming
Abstract: In general, low-rank representation (LRR) aims to find the lowest rank representation with respect to a dictionary. In fact, the dictionary is a key aspect of low-rank representation. However, a lot of low-rank representation methods usually use the data itself as a dictionary (i.e., a fixed dictionary), which may degrade their performances due to the lack of clustering ability of a fixed dictionary. To this end, we propose learning a locality-preserving dictionary instead of the fixed dictionary for low-rank representation, where the locality-preserving dictionary is constructed by using a graph regularization technique to capture the intrinsic geometric structure of the dictionary and, hence, the locality-preserving dictionary has an underlying clustering ability. In this way, the obtained low-rank representation via the locality-preserving dictionary has a better grouping-effect representation. Inversely, a better grouping-effect representation can help to learn a good dictionary. The locality-preserving dictionary and the grouping-effect representation interact with each other, where dual pursuit is called. The proposed method, namely, Dual Pursuit for Subspace Learning, provides us with a robust method for clustering and classification simultaneously, and compares favorably with the other state-of-the-art methods. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

387 results on '"STREAMING media"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources