The Future of Productivity Agents with NinjaTech AI and AWS Trainium

June 27, 2024

6

This is a guest post by Arash Sadrieh, Tahir Azim, and Tengfui Xue from NinjaTech AI.

At NinjaTech AI, our mission is to make everyone more productive by handling time-consuming, complex tasks with fast, affordable artificial intelligence (AI) agents. We recently launched MyNinja.ai, one of the world’s first multi-agent personal AI assistants, to make this mission a reality. MyNinja.ai is built from the ground up with expert agents that can complete tasks on your behalf, such as scheduling meetings, digging through the web, generating code, and assisting with writing. These agents are able to break down complex, multi-step tasks into branching solutions and dynamically evaluate the solutions generated while continually learning from past experience. All of these tasks are performed fully autonomously and asynchronously, so you are free to continue with your day while Ninja works on these tasks in the background, picking up work when it needs your input.

Because no single large-scale language model (LLM) is optimal for all tasks, we knew that to build a personal AI assistant, we would need multiple LLMs, each optimized specifically for different tasks. We also knew that these multiple models would need to work together to provide the accuracy and functionality that would satisfy our users. Finally, we needed a scalable, cost-effective way to train these different models, which has historically been a costly endeavor for most startups. In this post, we describe how we used AWS Trainium chips to build NinjaLLM, a state-of-the-art productivity agent that is the backbone of MyNinja.ai.

Building the Dataset

We realized early on that our mission of working on tasks on behalf of our users would require multiple models optimized for specific tasks. Examples include the Deep Researcher, Deep Coder, and Advisor models. After testing the available open-source models, we felt that prompt engineering alone was insufficient to meet our needs with out-of-the-box features and responses. Specifically, in our testing with the open-source models, we wanted to ensure that each model was optimized for ReAct/chain-of-thought style prompts. Additionally, we wanted to ensure that when the models were deployed as part of our Retrieval Augmented Generation (RAG) system, they would accurately cite each source, not be prone to answering “I don’t know,” and would not generate incorrect answers. To that end, we decided to fine-tune our models for different downstream tasks.

In building the training dataset, our goal was twofold: to adapt each model to the appropriate downstream tasks and personas (e.g., researchers, advisors, coders, etc.), and to adapt the models to follow a specific output structure. To achieve this, we adopted the Lima approach for fine-tuning. We used a training sample size of approximately 20 million tokens, focusing on the form and tone of the output while using a diverse, yet relatively small, sample size. To build the supervised fine-tuning dataset, we started by creating initial seed tasks for each model. We used these seed tasks to generate an initial synthetic dataset using Meta’s Llama 2 model. The synthetic dataset allowed us to perform an initial fine-tuning round. To initially evaluate the performance of this fine-tuned model, we crowdsourced feedback from users to iteratively create more samples. We also used a set of benchmarks (internal and public) to evaluate the performance of our models and continued to iterate.

Trainium Tweaks

We decided to start with the Llama model as a pre-trained base model for several reasons. Most notable are its excellent out-of-the-box performance, strong ecosystem support with various libraries, and true open source and permissive licensing. At the time, we started with Llama 2 and tested it in different sizes (7B, 13B, 70B). For training, we decided to use a cluster of trn1.32xlarge instances to take advantage of the Trainium chips. To efficiently parallelize training, we used a cluster of 32 instances. We also used AWS ParallelCluster to manage the cluster orchestration. By using a cluster of Trainium instances, each fine-tuning iteration took less than 3 hours and cost less than $1,000. This short iteration time and low cost allowed us to quickly tune and test our model to improve its accuracy. It only cost us about $30,000 to achieve the accuracy we describe in the next section. This can save hundreds of thousands of dollars, potentially millions of dollars, compared to training on a traditional training accelerator.

The following diagram illustrates the training architecture:

After establishing a fine-tuning pipeline built on Trainium, the Neuron Distributed training library allowed us to fine-tune and improve our model. This was extremely useful and timely, as Meta’s Llama 3 model was released prior to the release of MyNinja.ai. Because Llama 3 and Llama 2 share a similar architecture, we were able to upgrade to the new model quickly. This speed of switching allowed us to take advantage of the inherent improvements in model accuracy and quickly perform fine-tuning using Llama 3 weights, getting it ready for release.

Model evaluation

In evaluating our model, we had two objectives: to evaluate the model’s ability to answer user questions, and to evaluate the system’s ability to answer questions using provided sources, since this is the main interface for a personal AI assistant. We chose the HotPotQA and Natural Questions (NQ) open datasets, both of which are suitable because they are open benchmark datasets with public leaderboards.

We calculated accuracy by matching the model’s answer with the expected answer using the top 10 sentences taken from the Wikipedia corpus. We used ColBERTv2, a BERT-based search model, to perform content filtering and ranking. Using the enhanced Llama 3 RAG model, we achieved an accuracy of 62.22% on the NQ Open dataset and 58.84% on HotPotQA, a notable improvement over other baseline models. The following figure summarizes the results.

Future work

Going forward, we are working on several developments to continuously improve our model performance and user experience. First, we plan to fine-tune our models using ORPO, which combines traditional fine-tuning and preference tuning, using a single preference tuning dataset for both. We believe this will allow us to better tune our models to achieve better results for our users.

Additionally, we plan to build a custom ensemble model from the various models we have fine-tuned so far. Inspired by the Mixture of Expert (MoE) model architecture, we plan to introduce a routing layer across the various models, which we believe will significantly simplify the model serving and scaling architecture while maintaining the quality users expect from a personal AI assistant for a variety of tasks.

Conclusion

Building the next generation of AI agents to make everyone more productive is the path forward for NinjaTech AI to achieve its mission. Democratizing access to this transformative technology is key to having access to high performance computing, open source models, and an ecosystem of tools that allow new agents to be trained affordably and quickly. AWS’ purpose-built AI chips, access to leading open source models, and training architectures make this possible.

To learn more about how NinjaTech AI builds multi-agent personal AI, read our whitepaper, or try out these AI agents for free at MyNinja.ai.

About the Author

Arash Sadlier Arash is Co-founder and Chief Scientific Officer at Ninjatech.ai. Arash co-founded Ninjatech.ai with the vision of making everyone more productive by using AI agents to handle time-consuming tasks. This vision was shaped during his tenure as a Senior Applied Scientist at AWS, where he drove significant research initiatives that significantly improved infrastructure efficiency over six years and was awarded multiple patents on core infrastructure optimization. His academic background includes a PhD in Computer Modelling and Simulation in collaboration with renowned institutions such as the University of Oxford, the University of Sydney, and CSIRO. Prior to his tenure in industry, Arash was a postdoctoral researcher and published papers in high impact journals such as Nature Communications.

Tahir Azim Tahir is a Staff Software Engineer at NinjaTech. Tahir focuses on NinjaTech’s Inf2 and Trn1 based training and inference platforms, unified gateways to access these platforms, and RAG based research skills. Previously, he worked as a Senior Software Engineer at Amazon, building data-driven systems to optimally utilize Amazon’s global Internet edge infrastructure to reduce cost, congestion, and latency. Prior to industry, Tahir received his MSc and PhD in Computer Science from Stanford University, taught as an Assistant Professor at NUST (Pakistan) for three years, and did a postdoctoral research fellowship in High Speed Data Analytics Systems at EPFL. Tahir has authored numerous publications that were presented at top conferences such as VLDB, USENIX ATC, MobiCom, and MobiHoc.

Xue Tengfei Tengfei is an Applied Scientist at NinjaTech AI. His current research interests are natural language processing and multi-modal learning, particularly the use of large-scale language models and large-scale multi-modal models. Tengfei completed his PhD at the School of Computer Science, University of Sydney, focusing on deep learning for healthcare using different modalities. He was also a visiting PhD candidate at the Harvard University Institute for the Mathematics of Imaging (LMI), where he worked on 3D computer vision of complex geometric data.

The Future of Productivity Agents with NinjaTech AI and AWS Trainium

Building the Dataset

Trainium Tweaks

Model evaluation

Future work

Conclusion

About the Author

Prompt engineering techniques and best practices: Learn by doing with Anthropic’s Claude 3 on Amazon Bedrock

Introducing guardrails to Amazon Bedrock knowledge bases

Medical content creation in the age of generative AI

LEAVE A REPLY Cancel reply

Most Popular

Blink Outdoor 4 review: Affordable but lacking

Proton Drive: Like Google Docs, but with end-to-end encryption

Proton Drive: Like Google Docs, but with end-to-end encryption

The Apple Watch doesn’t need to be thinner

Recent Comments

EDITOR PICKS

Blink Outdoor 4 review: Affordable but lacking

Proton Drive: Like Google Docs, but with end-to-end encryption

Proton Drive: Like Google Docs, but with end-to-end encryption

POPULAR POSTS

21 Best Bluetooth Speakers (2024): Portable, Waterproof, and More

NYT “Connections” July 3rd Clues and Answers: Clues to solve “Connections” #388.

5 reasons I wouldn’t buy a TV from Walmart

POPULAR CATEGORY

ABOUT US

FOLLOW US