Since launching in 2018, Amazon’s Just Walk Out technology has transformed the shopping experience by allowing customers to enter a store, pick up items, and leave without waiting in line to pay. This checkout-free technology can be found in more than 180 third-party locations around the world, including travel retailers, sports stadiums, entertainment venues, conference centers, theme parks, convenience stores, hospitals, and university campuses. The Just Walk Out technology’s end-to-end system automatically determines which items each customer selects in-store and provides a digital receipt, eliminating the need for checkout lines.
In this post, we introduce the latest generation of Amazon’s Just Walk Out technology, powered by a multimodal foundational model (FM). This multimodal FM was designed for brick-and-mortar stores using a Transformer-based architecture similar to the one that underpins many generative artificial intelligence (AI) applications. The model helps retailers generate highly accurate shopping receipts using data from multiple inputs, including networks of overhead video cameras, specialized weight sensors on shelves, digital floor plans, and catalog images of products. Simply put, a multimodal model means it uses data from multiple inputs.
Our research and development (R&D) investments in cutting-edge multimodal FM enable us to deploy the Just Walk Out system in a wide range of shopping situations with greater accuracy and at lower cost. Similar to our large-scale language models (LLMs) that generate text, the new Just Walk Out system is designed to generate accurate sales receipts for every shopper that visits a store.
Challenge: Tackling complex long-tail shopping scenarios
Just Walk Out stores presented us with a unique technical challenge because they are innovative, checkout-free environments. Retailers, shoppers, and Amazon demand near 100 percent checkout accuracy in even the most complex shopping situations, including unusual shopping behaviors that can create long, complicated sequences of activity that require additional effort to analyze what happened.
Previous generations of the Just Walk Out system used a modular architecture that addressed complex shopping situations by breaking down a shopper visit into separate tasks such as detecting shopper interactions, tracking items, identifying products, and counting what was selected. These individual components were then sequentially integrated into a pipeline to enable system-wide functionality. While this approach produced highly accurate receipts, it required significant engineering effort to address the challenges of new, never-before-encountered situations or complex shopping scenarios. This limitation limited the scalability of this approach.
Solution: Just Walk Out Multimodal AI
To address these challenges, we have introduced a new multi-modal FM designed specifically for retail environments, enabling Just Walk Out technology to address complex real-world shopping scenarios. The new multi-modal FM further enhances the capabilities of the Just Walk Out system by generalizing more effectively to new store formats, products, and customer behaviors, which is essential for scaling the Just Walk Out technology.
Incorporating continuous learning allows model training to automatically adapt and learn from new and challenging scenarios as they arise. This self-improvement capability ensures that the system maintains high performance even as the shopping environment continues to evolve.
The combination of end-to-end learning and enhanced generalization enables the Just Walk Out system to scale to a wider range of dynamic and complex retail environments. Retailers can deploy this technology with confidence, knowing it will provide their customers with a frictionless checkout experience.
The following video shows the architecture of our system in action.
Key elements of the Just Walk Out multimodal AI model include:
- Flexible Data Entry – The system tracks how users interact with products and fixtures on shelves, refrigerators, etc. It primarily uses multi-view video feeds as input, with weight sensors only used to track smaller items. The model maintains a digital 3D representation of the store and can access catalog images to identify products, even if a shopper puts an item back on the shelf incorrectly.
- Multimodal AI tokens that represent shopper behavior – Multimodal data input is processed by an encoder and compressed into Transformer tokens, the basic unit of input for the receipt model, allowing the model to interpret hand movements, distinguish items, and quickly and accurately count the number of items picked up or put back on the shelf.
- Continuously update your receipts – The system uses the token to create a digital receipt for each shopper, distinguishing their session and dynamically updating each receipt as they pick up or return items.
Just Walk Out FM Training
We found that by feeding Just Walk Out FM large amounts of multi-modal data, it can consistently generate (technically “predict”) accurate receipts for shoppers. To improve accuracy, we designed over 10 auxiliary tasks, including detection, tracking, image segmentation, grounding (linking abstract concepts to real-world objects), and activity recognition. Learning all of these within a single model improves the model’s ability to adapt to new and emerging store formats, products, and customer behaviors, which is critical as we deploy Just Walk Out technology in new locations.
To train an AI model, carefully curated data is fed into a selected algorithm, allowing the system to improve itself and produce accurate results. We quickly discovered that we could accelerate model training by using a data flywheel that continually mines and labels high-quality data in a self-reinforcing cycle. The system is designed to integrate these incremental improvements with minimal manual intervention. The following diagram illustrates the process:
To effectively train FM, we invested in a robust infrastructure that can efficiently handle the vast amounts of data required to train a large-scale neural network that mimics human decision-making. We built the infrastructure for the Just Walk Out model with the help of several Amazon Web Services (AWS) services, including Amazon Simple Storage Service (Amazon S3) for data storage and Amazon SageMaker for training.
To effectively train FM, we invested in a robust infrastructure that can efficiently handle the vast amounts of data required to train a large-scale neural network that mimics human decision-making. We built the infrastructure for the Just Walk Out model with the help of several Amazon Web Services (AWS) services, including Amazon Simple Storage Service (Amazon S3) for data storage and Amazon SageMaker for training.
Below are the key steps we took in training FM.
- Choosing the right data source is difficult – To train the AI models in our Just Walk Out technology, we focus on training data with particularly challenging shopping scenarios that test the limits of the models. These complex cases make up only a small portion of shopping data, but they are the most useful for helping our models learn from their mistakes.
- Leveraging automatic labeling – To improve operational efficiency, we developed algorithms and models that automatically label data with meaning. In addition to receipt prediction, the auto-labeling algorithm also covers auxiliary tasks, enabling the model to gain comprehensive multi-modal understanding and reasoning capabilities.
- Pre-training the model – Our FM is pre-trained on a vast collection of multi-modal data across a range of tasks, which improves the model’s ability to generalize to new store environments it has never encountered before.
- Fine-tuning the model – Finally, we further improved the model and used quantization techniques to create a more compact and efficient model that uses edge computing.
As the data flywheel continues to run, more high-quality, challenging cases are gradually identified and incorporated to test the robustness of the model. These additional challenging examples are incorporated into the training set, further improving the model’s accuracy and applicability across new brick-and-mortar environments.
Conclusion
In this post, we’ve shown how our multimodal AI system brings great new possibilities to Just Walk Out technology. Our innovative approach moves away from modular AI systems that rely on human-defined subcomponents and interfaces, building simpler, more scalable AI systems that can be trained end-to-end. Though we’re only just scratching the surface, multimodal AI will raise the bar on our already highly accurate receipt system and further improve the shopping experience in Just Walk Out technology stores around the world.
To read the official announcement about the new multimodal AI system and learn more about the latest improvements to Just Walk Out technology, please visit About Amazon.
To find a Just Walk Out Technology location, visit Just Walk Out Technology Locations near you. Learn more about how you can enhance your store or venue with Amazon’s Just Walk Out Technology on our Just Walk Out Technology product page.
To learn more about how AWS can reinvent customer experiences with the most comprehensive set of AI and ML services, see Building and Scale the Next Wave of AI Innovation on AWS.
About the Author
Tian Lang He is a Principal Scientist at AWS and is currently leading the development research into the next generation of Just Walk Out 2.0 technology, learning end-to-end and translating it into a multi-modal foundational model with a focus on the store domain.
Chris Broaddus He is a Senior Manager at AWS and currently manages all research activities for Just Walk Out technologies, including projects such as multi-modal AI models, deep learning for human pose estimation, and Radio Frequency Identification (RFID) reception prediction.