[ShangHai, China, July 7, 2020] The media service enables us to perceive and connect to the world by recording, processing, and spreading. With the rapid development of information technology, media networks are evolving iteratively to meet the growing spiritual and cultural needs of human beings. To further understand the technology development trends of media networks, it is necessary to study the dynamic relationship between the key components within the system from the perspective of visual perception.
History and Future of Video Media
1.1 Visual Perception System of Human Beings
Humans perceive the world through their sensory reactions to stimuli: by seeing, hearing, smelling, tasting, and touching. Of these, visual perception is the most important channel through which humans obtain more than 90% of information. When we look at the structure and visual characteristics of human eyes, we see that the retina is the medium by which people perceive light. The cone cells and rod cells on the retina determine the visual perception range, including the spectrum, space, time, and brightness. The perception range, together with human visual characteristics such as light adaptation, persistence of vision, visual concentration, and psychology, constitute the human visual perception system.
1.2 History and Future
Video media has been developed based on the human visual perception system. With the invention of televisions, we achieved the remote transmission of dynamic images by means of electrical signals, and changed the static images in sequence into dynamic TV videos. TV video technology has evolved from black and white to color, from analog to digital, from standard definition to high definition, and from interlaced to progressive. Improvements on the resolution, frame rate, color gamut, and field of view (FOV) continue to enhance the quality of the visuals we perceive.
The advent of UHD video technology has taken visual perception to a new level. Major UHD specifications (ITU-R BT.2020 Parameter values for ultra-high definition television systems for production and international programme exchange and GY/T 307-2017 Ultra-high definition television system program production and exchange of parameter values) are enabling video technologies to expand, and become a multi-dimensional indicator system. The system not only specifies spatial features (effective pixels: 7,680 x 4,320 and 3,840 x 2,160; aspect ratio: 16:9), but also specifies a wide array of indicators, such as time features (frame rate: 24/1.001 to 120 Hz; progressive scanning), color gamut coverage (color gamut coverage of the three primary colors: 57.3%), and quantization bit depth (10 bits and 12 bits). In home, mobile, and handheld application scenarios, when the 8K UHD video specifications are 7,680 x 4,320, 100/120 fps, HDR, WCG, and 10-bit, the pixel points of static images, as well as the frame freezing and jitter of dynamic images, are no longer visible. In addition, users can perceive comprehensive brightness and color coverage. This means that visual perception in 8K UHD videos approximates reality.
Once two-dimensional visual perception reaches the same level as reality, users have ever higher requirements in terms of interaction, real-time performance, and flexibility. These requirements will drive video media technologies to develop towards visual communications and interactive media that enhance user engagement. In the post-UHD era, the technical evaluation indicators of media content will leap from two-dimensional to three-dimensional; from a single viewpoint to multiple viewpoints, from zero degree of freedom and three degrees of freedom to six degrees of freedom; from three-dimensional modeling to real-time rendering; from high latency to ultra-low latency; and from a combination of virtual and real to digital twins. These next-generation intelligent media network technologies will reshape application scenarios and business models, and create a media service with high user stickiness. Figure 1 shows how a media service senses and connects to the real world.
Definitions and Key Technologies of the MC2 Intelligent Media Network
2.1 Definitions and Relations
The exponential growth of media content in terms of quality and quantity drives the boom of communication networks and media computing. Additionally, it leads to the generation of the “E = MC2“ intelligent media network model in a broad sense.
In the E = MC2 model, “E“ refers to efficiency, which includes user visual experience quality, service distribution rate, and the latency in obtaining content in different application scenarios. “M“ refers to media, indicating the change in the data load caused by increases in the quality and quantity of the content. As for C2, the first “C“ refers to communication, representing a media distribution architecture formed after different nodes are organically connected using a protocol. The second “C“ refers to computing, indicating the capability achieved when using various high-performance intelligent algorithms and engines to analyze, understand, and process various media data.
The improvement in the visual perception quality will definitely increase the data load, which puts great pressure on the network bandwidth. To improve the network distribution rate and reduce the service latency, a new communication network architecture and a distribution mechanism are required to sense, understand, and predict media content, network modes, and user behaviors, and aggregate and distribute content to relieve the bandwidth pressure. A high-performance, low-power, and highly integrated media computing platform is required to implement various media application scenarios anytime, anywhere, and on any terminal. The “E = MC2“ intelligent media network model is therefore formed, as shown in Figure 2.
2.2 Key Technologies
The key to improving media service efficiency in “E = MC2“ lies in achieving high fidelity, high throughput, and low latency.
2.2.1 High-Fidelity Media Production
Producing media with high fidelity requires an evaluation system based on both subjective and objective aspects. Objective evaluation focuses on hardware test data such as luminance, contrast, display uniformity, dynamic definition, and static definition. Subjective evaluation includes a range of perception indicators such as skin color display, motion estimation/motion compensation (MEMC), area-based light control, high dynamic range (HDR), and color gamut covering the three primary colors. By combining objective and subjective evaluations, we can effectively improve the performance of content production and display devices. High-fidelity media production requires consistency control throughout the entire process, including capture, storage, editing, color tuning, encoding, and output. In addition, the production requires continuous compatibility with video encoding standards in multiple formats, such as AVC, HEVC, AVS/AVS+/AVS2, VP9, and AV1, and next-generation standards AVS3, VVC, EVC, and LCEVC. By further improving encoding efficiency through architectural design elements such as basic block division, inter-/intra-frame encoding, and filters, AVS3 is one of the main technical solutions for future 8K UHD video encoding. Moreover, media production based on the new AI real-time rendering engine reconstructs the production process to form a new mode of media generation, which leverages machine learning to detect, analyze, understand, classify, and predict media and user behavior in real time.
2.2.2 Cloud-Edge-Device Network Architecture
High-fidelity media production inevitably requires massive data. The bit rate of an uncompressed 8K@120 fps UHD video exceeds 40 Gbit/s, and that still reaches 200 Mbit/s after compression. 3D point cloud encoding involves a large amount of differentiated, sparse, irregular, and random feature data on a 3D plane. The exponential data increase will consume a large amount of network bandwidth and media computing resources, which undermines media service efficiency. To boost network service efficiency and distribution performance, a cloud-edge-device network architecture is needed. The architecture breaks the complex media processing on the source side down into simple subunits, aggregates and classifies these subunits, dynamically allocates them to the edge, and then uses the computing capability of the edge to process media and user behavior. The new edge processing capability significantly raises the network peak service rate, reduces the forwarding route latency, shortens the computing response time, reduces computing power consumption and computing storage of terminals, and unloads the computing pressure from the cloud to the device. In this way, it is possible to provide high-throughput and low-latency services on the intelligent media network, ensuring an optimal user experience, high-fidelity mass media distribution, and transmissions with millisecond-level encoding latency. What's more, the new media network can implement hierarchical encoding and parallel encoding through network perception, to enhance the service efficiency.
2.2.3 High-Performance, All-Scenario Local Intelligent Computing
The construction of application scenarios depends on local media computing and processing on the user side. It requires computing and expression, video decoding, computing and processing, and perceptual computing capabilities. Computing and expression control the pixels of each video, while the AI picture quality (PQ) curve improves the contrast and brightness of the images. Video decoding is used to implement the international and domestic mainstream efficient video encoding and decoding standards. Computing and processing can optimize the luminance, contrast, saturation, sharpness, and noise levels of decoded YUV data. It implements parameter optimization and dynamic compensation for the PQ, and provides massive NPU computing power to support various application scenarios such as AI facial recognition and scenario detection. Perceptual computing will become a key hub that connects the physical and intelligent worlds to expand connections between users and smart homes, cities, and automobiles. These computing capabilities are implemented by the multi-core CPU, GPU, AI NPU, high-performance multi-format decoding engine, and the AI PQ video and image processing core. The capabilities jointly form a chipset solution with high performance, high integration, and low power consumption, providing powerful computing support for the intelligent media network system to improve service efficiency and create flexible application scenarios.
3.1 8K@120 fps UHD Videos
Currently, 8K radio signals in Japan are sent to satellites. In China, TV stations and production organizations represented by the China Media Group have conducted 8K UHD video production and distribution tests. The organizations represented by the National Engineering Research Center of Digital Television (NERC-DTV) have also completed the production of the first 8K@120 fps UHD demonstration film in China. The video clips support 7680 x 4320, 120 fps, 10-bit, BT.2020 color gamut, and HDR specifications. Different features, such as high-speed motion, comparison of skin and building textures, and HDR detail presentation, are shown in Figures 3 to 8. In September 2019, at the International Broadcasting Convention (IBC) in Amsterdam (Netherlands), HiSilicon (Shanghai) released the world's first SoC Hi3796C V300 that supports 8K@120 fps. The company demonstrated the 8K@120 fps streams in real-time decoding mode, showing how smooth these videos are achieved with their ultra-high definition and high frame rate. The Audio and Video Coding Standard Workgroup of China has also completed its research into the AVS3 standard. These breakthroughs also indicate that China has taken a global lead in the field of high-quality 8K technology.
3.2 8K+AI+AR Converged Network
The high-performance media computing platform combines 8K decoding, AI computing, and AR presentation to form various UHD intelligent augmented reality (AR) application modes. In terms of home entertainment, users can watch sports games from multiple perspectives simultaneously, and enjoy multiple user services at the same time. In addition, they can play multi-player, somatosensory games with people who are thousands of miles away. In addition, users can utilize AI cameras to collect UHD videos and perform intelligent motion capture, skeleton-based recognition, motion analysis, and posture correction. Brand-new home fitness services are available, as shown in Figures 9 and 10. As for traveling, users can share panoramic or multi-view videos in real time over cloud AI servers connected to a network.
3.3 Visual Interactive Media Communication Based on Point Clouds
A new generation of visual interactive media based on point cloud encoding will be an important application of the media interaction experience in the future. A plurality of sensors can be used to obtain information such as the color, depth, and texture of an object. 3D point cloud encoding is then used to render a geometric structure with the attributes and features of the object and the scene, so as to construct a realistic space that combines virtual reality and reality, as shown in Figure 11. Real-time remote interaction with objects will be further realized with features including multiple degrees of freedom, multi-view perception, high spatial and temporal resolution, and low visual communication latency. These new visual interactive media can be widely used in various intelligent media network application scenarios, such as a stereoscopic conference system, a roundtable forum, a live broadcast of a sports match, distance education, and medical diagnosis, as shown in Figure 12. With the help of the AI-based real-time rendering engine, deep learning algorithms, and massive cloud-edge-device services, endless visual applications and information interaction scenarios will be available.
The “E= MC2“ intelligent media network architecture will drive continuous breakthroughs and mutual promotion of the key technologies of media, communications network, and media computing. This in turn will enable the dramatic improvement of the media network's service efficiency, and all kinds of emerging application scenarios will continue to flourish. Currently, the intelligent media network era is still in the early stages, but its potential is considerable. In addition to the broadcast and TV industry, more new industries will benefit from the development of intelligent media network system technologies in the future. The perception and connection of media services will drive various industries to embrace a new phase of development. In the future, intelligent media networks will not only lead technological innovation in various fields, but also assume more social responsibilities to create better life experiences and foster social development.