Providing effective multilingual customer support in global business presents important operational challenges. Through collaboration between AWS and DXC technology, we have developed a scalable voice (V2V) translation prototype that translates how contact centers handle multilingual customer interactions.
This post explains how AWS and DXC can use Amazon Connect and other AWS AI services to provide near-real-time V2V translation capabilities.
Challenge: Serve customers in multiple languages
In the third quarter of 2024, DXC Technology approached AWS with a critical business challenge. Their global contact centres had to serve their customers in multiple languages without the exponential cost of hiring language-specific agents for bass languages. Previously, DXC had been investigating several existing alternatives, but found limitations in each approach, ranging from correspondence constraints to infrastructure requirements that affect reliability, scalability, and operational costs. DXC and AWS have decided to organize the focus hackathons that DXC and AWS Solution Architects collaborated with.
- Define important requirements for real-time translation
- Establish a benchmark for delay and accuracy
- Create a seamless integration path on an existing system
- Develop a step-by-step implementation strategy
- Prepare and test the initial proof of concept setup
Impact on business
In the case of DXC, this prototype was used as an enabler, allowing for maximizing technical talent, operational change, and cost improvements.
- Best Technical Expertise Delivery – Employment and Matching Agent based on technical knowledge rather than spoken language ensures that customers get the best technical support regardless of language barriers
- Global operational flexibility – Removing geographic and linguistic constraints in employment, placement, and support delivery while maintaining consistent quality of service across all languages
- Cost Reduction – Eliminate multilingual expertise premium, specialized language training and infrastructure costs through a pay-per-conversion model
- Experience similar to native speakers – maintain a natural conversation flow with near real-time translation and audio feedback while providing premium technical support in customer priority language
Solution overview
Amazon Connect V2V Translation Prototype uses AWS advanced speech recognition and machine translation technology to enable real-time conversational translation between agents and customers, allowing you to speak in your preferred language while having natural conversations. It consists of the following important components:
- Voice Recognition – Your customer’s speech language is captured and converted to text using Amazon Transcribe, which acts as a speech recognition engine. The transcript (text) is then fed to the machine translation engine.
- Machine Translation – Amazon Translate, a machine translation engine, translates customer transcripts into the agent’s preferred language in near real-time. The translated transcript is converted to speech using Amazon Polly, which acts as an engine from text to speech.
- Two-way translation – The process reverses for an agent’s response, translates the speech into the customer’s language, and delivers translated audio to the customer.
- Seamless Integration – The V2V Translation Sample Project integrates with Amazon Connect to enable you to use Amazon Connect Streams JS and Amazon Connect RTC JS libraries to handle customer interactions in multiple languages without any additional effort or training Masu.
The prototype can be extended with other AWS AI services to further customize translation capabilities. It is open source and ready for customization to meet your specific needs.
The following diagram illustrates the solution architecture.
The following screenshot shows the sample agent web application:
The user interface consists of three sections:
- Contact Control Panel – Softphone Client with Amazon Connect
- Customer Control – Customer-Agent Interaction Control, including transcribing customer voice, translating customer voice, and synthesizing customer voice
- Agent Control – Agent-Customer interaction control, including transcribing speech from agents, translating agent speech, and synthesizing agent speech
Challenges when implementing near real-time voice translations
The Amazon Connect V2V Sample Project was designed to minimize audio processing time from the moment a customer or agent finishes talking until the translated audio stream begins. However, even with the shortest audio processing time, the user experience does not match the actual conversation experience if both are speaking the same language. This is due to a specific pattern of customers who only listen to the agent’s translated speech, and agents who only listen to the customer’s translated speech. The following image shows the pattern:
The example workflow consists of the following steps:
- The customer begins to speak in his or her language and speaks for 10 seconds.
- Agents hear only the customer’s translated speech, so the agent first hears a 10-second silence.
- When the customer finishes speaking, the audio processing time takes 1-2 seconds, during which time both the customer and the agent hear the silence.
- Customer’s translated speeches are streamed to agents. Meanwhile, customers hear silence.
- Once the customer’s translated audio playback is complete, the agent begins speaking and speaks for 10 seconds.
- Customers only listen to the agent’s translated speech, so customers hear 10 seconds of silence.
- Once the agent finishes speaking, the audio processing time takes 1-2 seconds, during which time both the customer and the agent hear the silence.
- Agent’s translated speech is streamed to the agent. Meanwhile, the agent hears silence.
In this scenario, the customer hears a complete silence of 22-24 seconds from the moment they finish talking, until they hear the agent’s translated voice. This creates a suboptimal experience, as customers may not be sure what’s going on in 22-24 seconds. For example, if the agent could hear them, or if there were technical issues.
Audio Streaming Add-on
In a face-to-face conversation scenario between two people who do not speak the same language, there may be another person as a translator or interpreter. The example workflow consists of the following steps:
- A person speaks in his own language. This has been heard by People B and translators.
- The translator translates what A said in the language of person B. Translations are asked by people B and people A.
Essentially, people A and people B hear each other speak their language and also hear translations (from the translator). There is no waiting in silence. This is even more important in face-to-face conversations (such as contact center interactions).
To optimize your customer/agent experience, Amazon Connect V2V Sample Project implements audio streaming add-ons to simulate a more natural conversational experience. The following diagram shows an example workflow:
The workflow consists of the following steps:
- The customer begins to speak in his or her language and speaks for 10 seconds.
- Agents are listening to the customer’s original voice on a low volume (enabled from customer microphone to agent to agent”).
- When a customer finishes speaking, the audio processing time takes 1-2 seconds. Meanwhile, customers and agents will hear subtle audio feedback (contact the center background noise) on very low volumes (enable “audio feedback”).
- Customer’s translated speeches are streamed to agents. Meanwhile, customers will listen to translated speeches on a lower volume (enabled “Stream customer translation to customers”).
- Once the customer’s translated audio playback is complete, the agent begins speaking and speaks for 10 seconds.
- Customers will hear the agent’s original audio on a lower volume (enabled “Stream Agent Microphone to Customer”).
- When the agent finishes speaking, the audio processing time takes 1-2 seconds. Meanwhile, customers and agents will hear subtle audio feedback (contact the center background noise) on very low volumes (enable “audio feedback”).
- The agent’s translated audio is streamed to the agent. Meanwhile, the agent listens to translated speeches on a lower volume (enabled by “Stream Agent Translation to Agent”).
In this scenario, instead of a single block of 22-24 seconds of total silence, the customer hears two short blocks (1-2 seconds) of subtle audio feedback. This pattern is much closer to face-to-face conversations involving translators.
Audio streaming add-ons offer additional benefits including:
- Voice Characteristics – If agents and customers only listen to translated integrated audio, the actual audio characteristics will be lost. For example, agents cannot hear whether the customer is talking slowly and quickly, or whether the customer is upset or calm. Translated and synthesized speeches do not carry that information.
- Quality Assurance – When call recording is enabled, translation and synthesis are performed on the agent (client), so only the original voice of the customer and the synthesized voice of the agent are recorded. This makes it difficult for QA teams to properly evaluate and audit conversations. This includes many silent blocks within it. Instead, if the audio streaming add-on is enabled, there is no silent blocking and the QA team listens to the agent’s original voice, the customer’s original voice, and each translated speech all in a single audio file You can do it.
- Transcription and translation accuracy – Using both original and translated speeches available in call recordings makes it easier to detect specific words that improve transcription accuracy (Amazon Transcrible Custom Bocabularies) Make sure that your translation accuracy (using Amazon Translate Custom Terminologies), brand names, character names, model names, and other unique content will be transcribed and converted to the desired result.
Get started with Amazon Connect V2V
Ready to transform contact centre communications? The Amazon Connect V2V sample project is now available on GitHub. We recommend exploring, deploying and experimenting with this powerful prototype. Through the important steps below, you can do so as the foundation for developing innovative multilingual communication solutions in your own contact center.
- Clone the GitHub repository.
- Test different configurations for audio streaming add-ons.
- Check the sample project limits in README.
- Develop an implementation strategy:
- Implement robust security and compliance controls that meet your organization’s standards.
- Work with the customer experience team to define requirements for a particular use case.
- Balance between automation and agent manual control (for example, use Amazon Connect Contact Flow to automatically set contact attributes for preferred languages and audio streaming add-ons).
- Use your preferred transcription, translation, and text-to-speech engine based on your specific language support requirements and business, law, and local preferences.
- Plan a step-by-step rollout starting with a pilot group and iteratively optimize custom vocabulary and translation terms.
Conclusion
The Amazon Connect V2V sample project demonstrates how Amazon Connect and Advanced AWS AI services can break down language barriers, increase operational flexibility and reduce support costs. Start now and revolutionize how your contact center communicates across language barriers!
About the author
Milos Kozic He is a leading solution architect at AWS.
eJFerror I am a senior solution architect at AWS.
Adam El Tambouri I am the technical program manager for prototyping and support services at DXC Modern Workplace.