Map Earth’s vegetation in under 20 minutes using Amazon SageMaker

In today’s rapidly changing world, monitoring the health of the Earth’s vegetation is more important than ever. Vegetation plays an important role in maintaining ecological balance, providing nutrients, and acting as a carbon sink. Traditionally, monitoring the health of vegetation has been a difficult task. Methods such as field surveys and manual satellite data analysis are not only time-consuming, but also require significant resources and expertise. These traditional approaches are cumbersome. This often results in delays in data collection and analysis, making it difficult to track and quickly respond to changes in the environment. Furthermore, the high costs associated with these methods limit their availability and frequency, impeding comprehensive and continuous global vegetation monitoring efforts on a global scale. Considering these challenges, we have developed an innovative solution to streamline the vegetation monitoring process and increase efficiency on a global scale.

Moving away from traditional, labor-intensive methods of monitoring vegetation health, Amazon SageMaker geospatial capabilities provide a streamlined and cost-effective solution. Amazon SageMaker supports geospatial machine learning (ML) capabilities that enable data scientists and ML engineers to build, train, and deploy ML models using geospatial data. These geospatial capabilities open up a new world of environmental monitoring possibilities. SageMaker allows users to access a wide range of geospatial datasets, efficiently process and enrich this data, and accelerate development timelines. Tasks that previously took days or even weeks to complete can now be completed in a fraction of the time.

In this post, we demonstrate the power of SageMaker’s geospatial capabilities by mapping the world’s vegetation in under 20 minutes. This example highlights not only the efficiency of SageMaker, but also the impact of how geospatial ML can be used to monitor the environment for sustainability and conservation purposes.

identify your area of interest

First, we’ll show you how to apply SageMaker to analyze geospatial data on a global scale. To get started, follow the steps described in Getting Started with Amazon SageMaker Geospatial Features. Start by specifying geographic coordinates that define a bounding box that covers the area of interest. This bounding box acts as a filter to select only relevant satellite images that cover the Earth’s landmass.

import os
import json
import time
import boto3
import geopandas
from shapely.geometry import Polygon
import leafmap.foliumap as leafmap
import sagemaker
import sagemaker_geospatial_map

session = boto3.Session()
execution_role = sagemaker.get_execution_role()
sg_client = session.client(service_name="sagemaker-geospatial")
cooridinates =(
    (-179.034845, -55.973798),
    (179.371094, -55.973798),
    (179.371094, 83.780085),
    (-179.034845, 83.780085),
    (-179.034845, -55.973798)
)           
polygon = Polygon(cooridinates)
world_gdf = geopandas.GeoDataFrame(index=(0), crs="epsg:4326", geometry=(polygon))
m = leafmap.Map(center=(37, -119), zoom=4)
m.add_basemap('Esri.WorldImagery')
m.add_gdf(world_gdf, layer_name="AOI", style={"color": "red"})
m

Data acquisition

SageMaker geospatial capabilities provide access to a wide range of public geospatial datasets, including Sentinel-2, Landsat 8, Copernicus DEM, and NAIP. We chose Sentinel-2 for our vegetation mapping project due to its global coverage and frequency of updates. The Sentinel-2 satellite captures images of the Earth’s surface at a resolution of 10 meters every five days. In this example, we select the first week of December 2023. Filter images with less than 10% cloud coverage to ensure coverage of most of the visible ground. In this way, the analysis is based on clear and reliable images.

search_rdc_args = {
    "Arn": "arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8", # sentinel-2 L2A
    "RasterDataCollectionQuery": {
        "AreaOfInterest": {
            "AreaOfInterestGeometry": {
                "PolygonGeometry": {
                    "Coordinates": (
                        (
                            (-179.034845, -55.973798),
                            (179.371094, -55.973798),
                            (179.371094, 83.780085),
                            (-179.034845, 83.780085),
                            (-179.034845, -55.973798)
                        )
                    )
                }
            }
        },
        "TimeRangeFilter": {
            "StartTime": "2023-12-01T00:00:00Z",
            "EndTime": "2023-12-07T23:59:59Z",
        },
        "PropertyFilters": {
            "Properties": ({"Property": {"EoCloudCover": {"LowerBound": 0, "UpperBound": 10}}}),
            "LogicalOperator": "AND",
        },
    }
}

s2_items = ()
s2_tile_ids = ()
s2_geometries = {
    'id': (),
    'geometry': (),
}
while search_rdc_args.get("NextToken", True):
    search_result = sg_client.search_raster_data_collection(**search_rdc_args)
    for item in search_result("Items"):
        s2_id = item('Id')
        s2_tile_id = s2_id.split('_')(1)
        # filtering out tiles cover the same area
        if s2_tile_id not in s2_tile_ids:
            s2_tile_ids.append(s2_tile_id)
            s2_geometries('id').append(s2_id)
            s2_geometries('geometry').append(Polygon(item('Geometry')('Coordinates')(0)))
            del item('DateTime')
            s2_items.append(item)  

    search_rdc_args("NextToken") = search_result.get("NextToken")

print(f"{len(s2_items)} unique Sentinel-2 images found.")

By utilizing search_raster_data_collection We used SageMaker geospatial functions to identify 8,581 unique Sentinel-2 images taken during the first week of December 2023. To verify the accuracy of our selections, we plotted the footprints of these images on a map to ensure they were the correct images. analysis.

s2_gdf = geopandas.GeoDataFrame(s2_geometries)
m = leafmap.Map(center=(37, -119), zoom=4)
m.add_basemap('OpenStreetMap')
m.add_gdf(s2_gdf, layer_name="Sentinel-2 Tiles", style={"color": "blue"})
m

SageMaker geospatial processing jobs

When we queried the data using SageMaker geospatial capabilities, we received comprehensive details about the target image, including data footprint, properties around spectral bands, and hyperlinks for direct access. These hyperlinks allow you to bypass the memory- and storage-intensive traditional method of first downloading images and then processing them locally. This task is made even more challenging by the size and scale of datasets that exceed 4 TB. Each of the 8,000 images is large, has multiple channels, and is approximately 500 MB in individual size. Processing terabytes of data on a single machine takes too much time. Setting up a processing cluster is an alternative, but it introduces its own complexities, from data distribution to infrastructure management. SageMaker Geospatial uses Amazon SageMaker Processing to streamline this. Uses dedicated geospatial containers and SageMaker processing jobs for a simplified management experience for creating and running clusters. With just a few lines of code, you can scale out your geospatial workloads using SageMaker processing jobs. Simply specify your workload, the location of your geospatial data on Amazon Simple Storage Service (Amazon S3), and a script that defines your geospatial container. SageMaker Processing provisions cluster resources to run geospatial ML workloads at city, country, or continent scale.

Our project uses 25 clusters, each cluster consisting of 20 instances, to scale out geospatial workloads. The 8,581 images were then divided into 25 batches for efficient processing. Each batch contains approximately 340 images. These batches are distributed evenly across the machines in the cluster. All batch manifests are uploaded to Amazon S3 and ready for processing jobs, so each segment is processed quickly and efficiently.

def s2_item_to_relative_metadata_url(item):
    parts = item("Assets")("visual")("Href").split("/")
    tile_prefix = parts(4:-1)
    return "{}/{}.json".format("/".join(tile_prefix), item("Id"))


num_jobs = 25
num_instances_per_job = 20 # maximum 20

manifest_list = {}
for idx in range(num_jobs):
    manifest = ({"prefix": "s3://sentinel-cogs/sentinel-s2-l2a-cogs/"})
    manifest_list(idx) = manifest
# split the manifest for N processing jobs
for idx, item in enumerate(s2_items):
    job_idx = idx%num_jobs
    manifest_list(job_idx).append(s2_item_to_relative_metadata_url(item))
    
# upload the manifest to S3
sagemaker_session = sagemaker.Session()
s3_bucket_name = sagemaker_session.default_bucket()
s3_prefix = 'processing_job_demo'
s3_client = boto3.client("s3")
s3 = boto3.resource("s3")

manifest_dir = "manifests"
os.makedirs(manifest_dir, exist_ok=True)

for job_idx, manifest in manifest_list.items():
    manifest_file = f"{manifest_dir}/manifest{job_idx}.json"
    s3_manifest_key = s3_prefix + "/" + manifest_file
    with open(manifest_file, "w") as f:
        json.dump(manifest, f)

    s3_client.upload_file(manifest_file, s3_bucket_name, s3_manifest_key)
    print("Uploaded {} to {}".format(manifest_file, s3_manifest_key))

Once the input data is ready, we move on to core analysis that reveals insights into vegetation health through the Normalized Difference Vegetation Index (NDVI). NDVI is calculated from the difference between near-infrared (NIR) and red reflectances and normalized by their sum, resulting in a value ranging from -1 to 1. Higher NDVI values indicate denser, healthier vegetation, and a value of 0 indicates no vegetation. Negative values typically refer to bodies of water. This indicator serves as an important tool for assessing vegetation health and distribution. Below is an example of what NDVI looks like.

%%writefile scripts/compute_vi.py

import os
import rioxarray
import json
import gc
import warnings

warnings.filterwarnings("ignore")

if __name__ == "__main__":
    print("Starting processing")

    input_path = "/opt/ml/processing/input"
    output_path = "/opt/ml/processing/output"
    input_files = ()
    items = ()
    for current_path, sub_dirs, files in os.walk(input_path):
        for file in files:
            if file.endswith(".json"):
                full_file_path = os.path.join(input_path, current_path, file)
                input_files.append(full_file_path)
                with open(full_file_path, "r") as f:
                    items.append(json.load(f))

    print("Received {} input files".format(len(input_files)))

    for item in items:
        print("Computing NDVI for {}".format(item("id")))
        red_band_url = item("assets")("red")("href")
        nir_band_url = item("assets")("nir")("href")
        scl_mask_url = item("assets")("scl")("href")
        red = rioxarray.open_rasterio(red_band_url, masked=True)
        nir = rioxarray.open_rasterio(nir_band_url, masked=True)
        scl = rioxarray.open_rasterio(scl_mask_url, masked=True)
        scl_interp = scl.interp(
            x=red("x"), y=red("y")
        )  # interpolate SCL to the same resolution as Red and NIR bands

        # mask out cloudy pixels using SCL (https://sentinels.copernicus.eu/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm-overview)
        # class 8: cloud medium probability
        # class 9: cloud high probability
        # class 10: thin cirrus
        red_cloud_masked = red.where((scl_interp != 8) & (scl_interp != 9) & (scl_interp != 10))
        nir_cloud_masked = nir.where((scl_interp != 8) & (scl_interp != 9) & (scl_interp != 10))

        ndvi = (nir_cloud_masked - red_cloud_masked) / (nir_cloud_masked + red_cloud_masked)
        # save the ndvi as geotiff
        s2_tile_id = red_band_url.split("/")(-2)
        file_name = f"{s2_tile_id}_ndvi.tif"
        output_file_path = f"{output_path}/{file_name}"
        ndvi.rio.to_raster(output_file_path)
        print("Written output: {}".format(output_file_path))

        # keep memory usage low
        del red
        del nir
        del scl
        del scl_interp
        del red_cloud_masked
        del nir_cloud_masked
        del ndvi

        gc.collect()

Now that the calculation logic is defined, you are ready to start your geospatial SageMaker processing job. This involves a simple three-step process: setting up the compute cluster, defining the computation details, and organizing the input and output details.

First, set up your cluster by determining the number and type of instances you need for your job and ensuring they are suitable for geospatial data processing. The computing environment itself is prepared by selecting geospatial images that come with all the packages commonly used to process geospatial data.

Next, use as input the manifest you created earlier that lists all image hyperlinks. Also specify the S3 location to save the results.

Configuring these elements allows you to start multiple processing jobs at once, allowing them to run concurrently and increase efficiency.

from multiprocessing import Process
import sagemaker
import boto3 
from botocore.config import Config
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import ScriptProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

role = get_execution_role()
geospatial_image_uri = '081189585635.dkr.ecr.us-west-2.amazonaws.com/sagemaker-geospatial-v1-0:latest'
# use the retry behaviour of boto3 to avoid throttling issue
sm_boto = boto3.client('sagemaker', config=Config(connect_timeout=5, read_timeout=60, retries={'max_attempts': 20}))
sagemaker_session = sagemaker.Session(sagemaker_client = sm_boto)

def run_job(job_idx):
    s3_manifest = f"s3://{s3_bucket_name}/{s3_prefix}/{manifest_dir}/manifest{job_idx}.json"
    s3_output = f"s3://{s3_bucket_name}/{s3_prefix}/output"
    script_processor = ScriptProcessor(
        command=('python3'),
        image_uri=geospatial_image_uri,
        role=role,
        instance_count=num_instances_per_job,
        instance_type="ml.m5.xlarge",
        base_job_name=f'ca-s2-nvdi-{job_idx}',
        sagemaker_session=sagemaker_session,
    )

    script_processor.run(
        code="scripts/compute_vi.py",
        inputs=(
            ProcessingInput(
                source=s3_manifest,
                destination='/opt/ml/processing/input/',
                s3_data_type="ManifestFile",
                s3_data_distribution_type="ShardedByS3Key"
            ),
        ),
        outputs=(
            ProcessingOutput(
                source="/opt/ml/processing/output/",
                destination=s3_output,
                s3_upload_mode="Continuous"
            )
        ),
    )
    time.sleep(2)

processes = ()
for idx in range(num_jobs):
    p = Process(target=run_job, args=(idx,))
    processes.append(p)
    p.start()
    
for p in processes:
    p.join()

When you launch a job, SageMaker automatically launches the required instances and configures the cluster to process the images listed in the input manifest. This entire setup works seamlessly and requires no manual management. You can use the SageMaker console to monitor and manage your processing jobs. Provides real-time updates on the status and completion of processing tasks. In this example, 500 instances took less than 20 minutes to process all 8,581 images. SageMaker’s scalability allows you to reduce processing time by simply increasing the number of instances as needed.

conclusion

The power and efficiency of SageMaker’s geospatial capabilities has opened new doors for environmental monitoring, especially in the area of vegetation mapping. This example showed how to process over 8,500 satellite images in under 20 minutes. We have not only demonstrated the technical feasibility, but also the efficiency gains of using the cloud for environmental analysis. This approach represents a major leap from traditional resource-intensive methods to a more agile, scalable, and cost-effective approach. The flexibility to scale up or down processing resources as needed, and the ease of accessing and analyzing vast datasets, positions SageMaker as a transformational tool in the field of geospatial analysis. SageMaker simplifies the complexity associated with large-scale data processing, allowing scientists, researchers, and businesses to focus on extracting insights rather than infrastructure and data management. Masu.

Looking to the future, the integration of ML and geospatial analysis promises to further deepen our understanding of Earth’s ecosystems. The possibility of monitoring changes in real time, predicting future trends, and making more informed decisions and responding could significantly contribute to global conservation efforts. This vegetation mapping example is just the beginning for performing planetary-scale ML. For more information, see Amazon SageMaker Geospatial Features.

About the author

Shuo I am a senior applied scientist at AWS. He leads the science team for Amazon SageMaker geospatial capabilities. His current research interests include LLM evaluation and data generation. In his free time, he enjoys running, playing basketball, and spending time with his family.

Anirudh Viswanathan I am a Senior Product Manager for Technical – External Services on the SageMaker Geospatial ML team. He holds a master’s degree in robotics from Carnegie Mellon University, an MBA from the Wharton School of Business, and is named an inventor on more than 40 patents. He enjoys long distance running and visiting art galleries and Broadway shows.

Janos Vositz He is a Senior Solutions Architect at AWS, specializing in AI/ML. With over 15 years of experience, he leverages AI and ML to deliver innovative solutions and supports customers around the world in building ML platforms on AWS. His expertise spans machine learning, data engineering, and scalable distributed systems, enhanced by a strong background in software engineering and industry expertise in areas such as autonomous driving.

Lee Elan Lee Applied Science Manager for Human-in-the-Loop Services, AWS AI, and Amazon. His research interests include 3D deep learning, learning visual and linguistic representations. Previously, he served as Senior Scientist at Alexa AI, Head of Machine Learning at Scale AI, and Chief Scientist at Pony.ai. Previously, he worked on Uber ATG’s Perception team and Uber’s Machine Learning Platform team, where he worked on strategic initiatives in machine learning, machine learning systems, and AI for self-driving. He began his career at Bell Laboratories and served as an adjunct professor at Columbia University. He co-taught tutorials at ICML’17 and ICCV’19, and has done some work on machine learning for autonomous driving, 3D vision and robotics, machine learning systems, and adversarial machine learning at NeurIPS, ICML, CVPR, and ICCV. We co-hosted the shop. He holds a PhD in computer science from Cornell University. He is an ACM Fellow and an IEEE Fellow.

Amit Modi is the product lead for SageMaker MLOps, ML Governance, and Responsible AI on AWS. With over 10 years of B2B experience, he drives innovation and builds scalable products and teams that deliver value to customers around the world.

chris efland is a visionary technology leader with over 20 years of experience driving product innovation and growth. Chris has helped both startups and large corporations develop new products, including consumer electronics and enterprise software, across many industries. In his current role at Amazon Web Services (AWS), Chris leads the geospatial AI/ML category. He works on the front lines of Amazon SageMaker, Amazon’s fastest growing ML service, serving over 100,000 customers worldwide. He recently led the launch of new geospatial capabilities in Amazon SageMaker. It’s a powerful toolset that enables data scientists and machine learning engineers to build, train, and deploy ML models using satellite imagery, maps, and location data. Prior to joining AWS, Chris was responsible for Lyft’s autonomous vehicle (AV) tools and AV maps, leading the company’s automated mapping efforts and the toolchain used to build and operate Lyft’s self-driving vehicle fleet. led. He also served as an engineering director at HERE Technologies and Nokia, and co-founded several startups.

What's Hot

Connect the Amazon Q Business generative AI coding companion to your GitHub repositories with Amazon Q GitHub (Cloud) connector

Cisco achieves 50% latency improvement using Amazon SageMaker Inference rapid autoscaling

How DNA in soil is reshaping our understanding of Stone Age humans

Trump’s order to change the name of Mexico Bay and Denari’s work

Experimental XB-1 aircraft will be a super speed for the first time

Track LLM model evaluation using Amazon SageMaker managed MLflow and FMEval

The twisting light may illuminate how the eerie of the quantum works.

Optimizing AI responsiveness: A practical guide to Amazon Bedrock latency-optimized inference

Track LLM model evaluation using Amazon SageMaker managed MLflow and FMEval

Optimizing AI responsiveness: A practical guide to Amazon Bedrock latency-optimized inference

Develop RAG -based applications using Amazon Kendra and Amazon Aurora

This is the Google Nest Learning Thermostat we get after waiting 9 years

Antarctica is in crisis and we rush to understand its future

These maps will change the way you see the world

Most Popular

We need to rethink cannabis as a strategy in the overdose crisis

How Clearwater Analytics is revolutionizing investment management with generative AI and Amazon SageMaker JumpStart

Belkin’s BoostCharge Pro 5K is a thin, lightweight, and reliable wireless charger

Our Picks

More serious side effects may occur if the drug is a counterfeit generic

Can you trust ChatGPT-4o with your personal data?

Creating global packaging that solves plastic problems

Subscribe to our newsletter

Subscribe to Updates

What's Hot

Map Earth’s vegetation in under 20 minutes using Amazon SageMaker

identify your area of ​​interest

Data acquisition

SageMaker geospatial processing jobs

conclusion

About the author

Related Posts

Subscribe to our newsletter

Subscribe to our newsletter

identify your area of interest