This is a guest post created by the Bytedance team.
Bytedance is a technology company that operates a variety of content platforms to inform, educate, enjoy and inspire people across languages, cultures and regions. Users trust and enjoy the content platform for the rich, intuitive and safe experience they offer. These experiences are made possible by the Machine Learning (ML) backend engine, building ML models for video understanding, search, recommendations, advertising, and new visual effects.
To support our mission of “stimulating creativity and enriching life,” we made it open and enjoyable for people to engage, create and consume content. You can also discover and trade dozens of products and services, including Capcut, E-Shop, Lark, Pico, and Mobile Legends: Bang Bang.
Bytedance has worked with Amazon Web Services (AWS) to deploy a multimodal leading language model (LLMS) for video understanding using AWS Emsentia2 in multiple AWS regions around the world. By using sophisticated ML algorithms, the platform efficiently scans billions of videos every day. Use this process to identify and flag content that violates community guidelines, enabling a better experience for all users. By using Amazon EC2 INF2 instances for these video understanding workloads, we were able to reduce inference costs by half.
This post covers the use of multimodal LLM for video understanding, solution architecture, and techniques for performance optimization.
Overcoming the hurdles of video understanding in multimodal LLM
Multimodal LLM provides a better understanding of the world, enables various forms of digital content as input to LLM, and greatly expands the range of useful applications that can be built today. The need for AI systems that can handle a variety of content forms is becoming increasingly apparent. Multimodal LLM has risen to address this challenge by acquiring multiple data modalities such as text, images, audio, and video (see the following diagram). This allows you to fully understand the content and mimic human perception and interaction with the world. The enhanced capabilities of these models far outweigh the capabilities of traditional models of tasks ranging from sophisticated virtual assistants to advanced content creation. By expanding the boundaries of AI capabilities and paving the way for more natural and intuitive interactions with technology, these models not only improve existing applications, but also open the door to a whole new possibilities in the realm of AI and user experience.
In our operations, the implementation of multimodal LLMS for video understanding represents a major shift in thinking about AI-driven content analysis. This innovation addresses the daily challenge of processing billions of video content and overcomes the efficiency of traditional AI models. We have developed our own multimodal LLM architecture designed to deliver cutting-edge performance in single-image, multi-image and video applications. Unlike traditional ML models, this new generation AI-enabled system integrates multiple input streams into a unified representation space. Cross-modal attention mechanisms promote information exchange between modalities, and the fusion layer combines representations of different modalities. The decoder then generates output based on a fused multimodal representation, allowing for more subtle, context-aware analysis of the content.
Solution overview
We have been working with AWS since our first generation recommended chips. Our video understanding department has been committed to finding more cost-effective solutions that provide better performance to better meet the ever-growing business needs. During this period, we found that AWS is continually inventing and adding features and functionality to the AWS Neuron Software Development Kit (SDK). Popular metallama and mistral models were well supported with high performance recommendations right after the release of open source. Therefore, we began to evaluate the estimate-based solution shown in the following diagram.
We made the strategic decision to deploy a medium sized LLM recommended for a fine-tuned medium sized LLM, providing a performance and cost-effective solution that can process billions of videos every day. This process was a comprehensive effort aimed at optimizing end-to-end response times for understanding video. The team investigated a wide range of parameters, including the parallel size of tensors, compilation structure, sequence length, and batch size. Various parallelization techniques were employed, including multithreading and model replication (for non-LLM models) across multiple neuronal cores. Through these optimizations, including parallelization of sequence steps, device reuse, and the use of automated benchmarking and profiling tools, we maintained our position at the forefront of industry performance standards and achieved significant performance boosts.
Tensor parallelism was used to effectively distribute and scale the model to multiple accelerators in INF2 instances. I used a static batch. This improved model latency and throughput by ensuring that data is processed in uniform, fixed size batches during inference. Repeated N-Grams filtering significantly improved the quality of auto-generated text and reduced inference times. By quantizing the weights of the Multimodal model from FP16/BF16 to INT8 format, we were able to perform more efficiently by guessing, reducing device memory usage without compromising accuracy. Using these techniques and model serialization, we optimized the throughput of our inf2.48xlarge instances by maximizing the batch size, allowing the model to still fit a single accelerator so that the model could deploy multiple model replicas on the same instance. This comprehensive optimization strategy helped meet latency requirements while providing optimal throughput and cost savings. In particular, the guess-based solution highlights the important economic benefits of using guesses for large-scale video-understanding tasks, reducing costs in half compared to comparable Amazon Elastic Compute Cloud (Amazon EC2) instances.
The following image shows how you deploy an LLM container to an Amazon EC2 Inf2 instance using neurons.
In summary, collaboration with AWS revolutionized video understanding, setting new industry standards for efficiency and accuracy. The ability of multimodal LLM to adapt to global market demands and scalable performance on the Imefentia2 chip highlights the profound impact of this technology in protecting the global community of the platform.
Future plans
Looking further, the development of unified multimodal LLMs represents a significant change in video comprehension techniques. This ambitious project aims to create a universal content tokenizer that can handle all content types and arrange them within a common semantic space. After being tokenized, the content is analyzed by a sophisticated large model to produce output that understands the appropriate content regardless of its original format (as shown in the following image). This unified approach can streamline the content understanding process and improve both efficiency and consistency across diverse content types.
For additional learning, see the evolution of multimodal model architectures in the paper.
Implementing this comprehensive strategy sets new benchmarks for video understanding technology, balancing accuracy, speed and cultural sensitivity in an increasingly complex digital ecosystem. This future-looking approach not only addresses current challenges in video understanding, but also places the system at the forefront of AI-driven content analysis and management in the near future.
By using cutting-edge AI techniques and a holistic approach to content understanding, this next-generation content understanding system aims to provide a safer and more comprehensive online environment while setting new industry standards and adapting to the ever-evolving landscape of digital communications. At the same time, AWS is investing in next-generation AI chips such as AWS Trainium2. This keeps you pushing the performance boundaries while keeping costs down. Bytedance plans to test this new generation of AWS AI chips and adopt them properly as models and workloads continue to evolve.
Conclusion
The collaboration between bytedance and AWS revolutionized video understanding through the deployment of multimodal LLMS on the Ismerentia2 chip. This partnership has achieved notable results, the ability to process billions of videos every day, as well as significant cost savings and higher performance with comparable EC2 instances.
As we continue to innovate with projects such as integrated multimodal large models, we continue to be committed to driving the boundaries of AI-driven content analytics. Our goal is to ensure that our platform maintains a safe, inclusive and creative space for our global community and sets new industry standards for efficient video understanding.
For more information about INF2 instances, see Amazon EC2 INF2 Architecture.
About the author
Wangpeng an, Tiktok’s leading algorithm engineers specialize in multimodal LLMS for video understanding, advertising and recommendations. He leads major projects in the model acceleration, content moderation, and the ADS LLM pipeline, and strengthens Tiktok’s real-time machine learning system.
Haotian Zhang This is Tiktok’s technical lead MLE specializing in content understanding, searching, and recommendations. He received his ML PhD from the University of Waterloo. At Tiktok, he leads a group of engineers to improve the efficiency, robustness and effectiveness of training and inference of LLMS and multimodal LLM, especially for large distributed ML systems.
Xiaojie Ding He is a senior engineer at Tiktok and focuses on content moderation systems development, model resources and deployment optimization, and the engineering stability structure of algorithms. In his free time, he likes to play single-player games.
Natuan Yang I’m a senior engineer at Tiktok and focuses on content security and moderation. He is engaged in building mitigation systems, model applications, deployment and performance optimizations on a continuous basis.
Chiron Sun I’m a senior SRE of the Bytedance AML team. His role focuses on maintaining seamless manipulation and efficient allocation of resources within a cluster, specializing in maintaining cluster machines and resource optimization.
The author would like to thank other Byte Dance and the AWS team members for the contributions of bytedance Xi Dai, Kaili Zhao, Zhixin Zhang, Jin Ye and Yann Xia. Jia Dong, Bingyang Huang, Kamran Khan, Shruti Koparkar, and Diwakar Bansal.