Anomaly detection in streaming time series data with online learning using Amazon Managed Service for Apache Flink

Time series data is a distinct category that incorporates time as a fundamental element of its structure. In a time series, data points are collected sequentially, often at regular intervals, and typically exhibit a particular pattern, such as a trend, seasonality, or cyclical behavior. Common examples of time series data include sales revenue, system performance data (such as CPU utilization and memory usage), credit card transactions, sensor readings, and user activity analysis.

Time series anomaly detection is the process of identifying unexpected or unusual patterns in data that evolve over time. OutliersOccurs when data points deviate significantly from expected patterns.

For some time series with well-defined prediction ranges, such as a machine’s operating temperature or CPU utilization, threshold-based approaches may be sufficient. However, in areas such as fraud detection and sales, simple rules are insufficient as they cannot detect anomalies across complex relationships, and more sophisticated techniques are required to identify unexpected occurrences.

In this post, I show you how to build a robust real-time anomaly detection solution for streaming time series data using Amazon Managed Service for Apache Flink and other AWS managed services.

Solution overview

The following diagram shows the core architecture of the anomaly detection stack solution:

The solution employs machine learning (ML) for anomaly detection and does not require users to have prior AI expertise. It provides an AWS CloudFormation template that you can easily deploy into your AWS account. The CloudFormation template deploys an application stack that contains the AWS resources required for anomaly detection. Configuring a single stack creates an application that contains a single anomaly detection task or detector. You can configure multiple such stacks running simultaneously, each analyzing data and reporting anomalies.

Once the application is deployed, an ML model is built using the Random Cut Forest (RCF) algorithm. First, we use this live stream to get input time series data from Amazon Managed Streaming for Apache Kafka (Amazon MSK) for model training. After training, the model continues to process input data points from the stream. It evaluates these points against the historical trends of the corresponding time series. The model also generates an initial raw anomaly score during processing and maintains an internal threshold to eliminate noisy data points. The model then generates a normalized anomaly score for each data point that we treat as anomalous. These scores, which range from 0 to 100, indicate deviations from the general pattern. A score closer to 100 signifies a higher level of anomaly. You have the flexibility to set custom thresholds on these anomaly scores, allowing you to define what you consider to be anomalous.

The solution uses CloudFormation templates that take inputs such as MSK broker endpoints and topics, AWS Identity and Access Management (IAM) roles, and other parameters related to virtual private cloud (VPC) configuration. The templates create key resources such as an Apache Flink application and an Amazon SageMaker real-time endpoint in the customer account.

To request access to this solution, please email anomalydetection-support-canvas@amazon.com.

This post provides an overview of how to build an end-to-end solution using the anomaly detection stack. Consider a fictitious sales scenario where AnyBooks, a bookstore located on the campus of a large university, sells a variety of supplies to college students. Due to the timing of the class schedule, there is seasonality, with approximately 20 units of item A and 30 units of item B sold during even hours, and roughly half that number during odd hours. Recently, there have been unexplained spikes in the quantities of items sold, and management would like to track these quantity anomalies to be able to better plan staffing and inventory levels.

The following diagram shows the detailed architecture of the end-to-end solution.

The following sections describe each layer shown in the previous diagram.

intake

At the ingestion layer, an AWS Lambda function retrieves the current minute of sales transactions from the PostgreSQL transactional database, converts each record into a JSON message, and publishes it to an input Kafka topic. This Lambda function is configured to run every minute using the Amazon EventBridge Scheduler.

Anomaly Detection Stack

The Flink application starts the process of reading raw data from the input MSK topic, training the model, detecting anomalies, and finally logging them to the MSK output topic. The following code is the output JSON:

{"detectorName":"canvas-ad-blog-demo-1","measure":"quantity","timeseriesId":"f3c7f14e7a445b79a3a9877dfa02064d56533cc29fb0891945da4512c103e893","anomalyDecisionThreshold":70,"dimensionList":({"name":"product_name","value":"item-A"}),"aggregatedMeasureValue":14.0,"anomalyScore":0.0,"detectionPeriodStartTime":"2024-08-29 13:35:00","detectionPeriodEndTime":"2024-08-29 13:36:00","processedDataPoints":1261,"anomalyConfidenceScore":80.4674989791107,"anomalyDecision":0,"modelStage":"INFERENCE","expectedValue":0.0}

A brief description of the output fields follows:

measurement – This represents the metric that is tracking the anomaly. In our case, measure Fields are quantity Sales Item-A.
Aggregated measurements – this is, quantity Within the time frame.
Time Series ID – This unique identifier corresponds to a unique combination of values for a dimension and a metric. In this scenario, it is the product name. Item-Ain product_name
Anomaly Confidence Score – This confidence score improves over time as the model evolves through training and inference.
Anomaly score – This field represents the score for the anomaly detection. anomalyThreshold If set to 70, any value above 70 is considered a potential anomaly.
Model Stage – When the model is in the learning phase, anomalyScore 0.0, the value of this field LEARNINGOnce training is complete, the value of this field changes to: INFERENCE.
Abnormality threshold – The decision threshold is provided as an input to the CloudFormation stack. You can increase this threshold to change the sensitivity if you determine that you are getting too many false positives.
An unusual decision – if anomalyScore Exceeded anomalyDecisionThresholdThis field is set to 1 to indicate that an anomaly was detected.

Transformation

At the transformation layer, an Amazon Data Firehose stream is configured to consume data from the output Kafka topic and invoke a Lambda function for transformation. The Lambda function flattens the nested JSON data from the Kafka topic. The transformed results are partitioned by date and stored in an Amazon Simple Storage Service (Amazon S3) bucket in Parquet format. An AWS Glue crawler is used to crawl the data in the Amazon S3 location and catalog it in the AWS Glue Data Catalog, ready for querying and analysis.

Visualize

To visualize the data, I created an Amazon QuickSight dashboard that connects to the data in Amazon S3 through the Data Catalog and runs queries using Amazon Athena. You can refresh the dashboard to view the latest anomalies detected, as shown in the following screenshot.

In this example, the dark blue line in the line chart represents seasonality. quantity Measure Item-A Over time, the chart shows higher values during even hours and lower values during odd hours. The pink line represents the anomaly detection score plotted on the right y-axis. The anomaly score approaches 100 when quantity values deviate significantly from seasonal patterns. The blue line represents the anomaly threshold of 70. anomalyScore If this threshold is exceeded, anomalyDecision It is set to 1.

The “Number of time series being tracked” KPI shows the number of time series that the model is currently monitoring. In this case, we are tracking two products (Item-A and Item-B), the count is 2. The “Number of Data Points Processed” KPI shows the total number of data points the model processed, and “Anomaly Confidence Score” shows the confidence level in predicting anomalies. Initially, this score is low, but over time as the model matures, it approaches 100.

notification

While visualizations are useful for investigating anomalies, data analysts often prefer to receive near real-time notification about significant anomalies. This is achieved by adding a Lambda function that reads and analyzes the results from the output Kafka topic. anomalyScore If the value exceeds a defined threshold, the function invokes an Amazon Simple Notification Service (Amazon SNS) topic to send an email or SMS notification to a specified list, alerting your team to the anomaly in near real time.

Conclusion

In this post, I demonstrated how to build a robust real-time anomaly detection solution for streaming time series data using Managed Services for Apache Flink and other AWS services. I walked through an end-to-end architecture that ingests data from a source database, passes it to an Apache Flink application to train an ML model to detect anomalies, and then places the anomaly data into an S3 data lake. The anomaly scores and decisions are visualized in a QuickSight dashboard connected to the Amazon S3 data using AWS Glue and Athena. Additionally, a Lambda function analyzes the results and sends notifications in near real time.

Using AWS managed services such as Amazon MSK, Data Firehose, Lambda, and SageMaker, you can rapidly deploy and extend this anomaly detection solution to your own time series use cases, allowing you to automatically identify unexpected behaviors and patterns in your data streams in real time, without manual rules or thresholds.

Try this solution to learn how real-time anomaly detection on AWS can help you gain insights and optimize operations across your business.

About the Author

Noah Soprala Noah is a Dallas-based Solutions Architect who serves as a trusted advisor to clients, helping them build innovative solutions using AWS technologies. Noah has over 20 years of experience in consulting, development, solution architecture and delivery.

Dan Schinreich He is a Senior Product Manager at Amazon SageMaker, focusing on expanding our no-code/low-code offerings. He is passionate about making ML and generative AI more accessible to help solve tough problems. Outside of work, he enjoys playing hockey, scuba diving, and reading sci-fi novels.

Said Furqan Syedfurqhan is a Senior Software Engineer for AI and ML at AWS. He has been involved in the launch of many AWS services including Amazon Lookout for Metrics, Amazon Sagemaker, and Amazon Bedrock. He is currently focused on generative AI initiatives as part of Amazon Bedrock Core Systems. He is a clean code advocate and an expert in serverless and event-driven architectures. You can follow him on linkedin. syedfurqhan

Nirmal Kumar He is a Senior Product Manager for Amazon SageMaker services. He is committed to expanding access to AI/ML and leads the development of no-code and low-code ML solutions. Outside of work, he enjoys traveling and reading non-fiction.

What's Hot

Amandla Stenberg Wasn’t Shocked by The Acolyte’s Cancelation

Los Angeles needs to make its communities fireproof, not just its homes.

Scientific American November 2024 Contributor

Improve your bike safety with Amazon Rekognition

Earthly Meditation – My Travel and Geology Blog: Tierra del Fuego

Use language embeddings for zero-shot classification and semantic search with Amazon Bedrock

Build a dynamic, role-based AI agent using Amazon Bedrock inline agents

From concept to reality: Navigating the Journey of RAG from proof of concept to production

Improve your bike safety with Amazon Rekognition

Use language embeddings for zero-shot classification and semantic search with Amazon Bedrock

Build a dynamic, role-based AI agent using Amazon Bedrock inline agents

Glowing biological quantum sensor can track how cells form

Steven Nedorosik’s USA Gymnastics Team pommel horse routine goes viral

SocGholish malware exploits BOINC projects in covert cyberattacks

Most Popular

Is Red No. 3 harmful? How is it different from other dyes?

Due to an educational miscalculation, the black plastic spatula is once again being talked about. Experts remain concerned

AI-generated child sexual abuse material is proliferating on the dark web. Big tech companies

Our Picks

Watch elephants use hoses to take showers and prank other elephants

Your Microwave Could Be a Haven for Surprisingly Resilient Bacteria

Falling satellites could provide clues about how objects burn up on re-entry

Subscribe to our newsletter

Subscribe to Updates

What's Hot

Anomaly detection in streaming time series data with online learning using Amazon Managed Service for Apache Flink

Solution overview

intake

Anomaly Detection Stack

Transformation

Visualize

notification

Conclusion

About the Author

Related Posts

Subscribe to our newsletter

Subscribe to our newsletter