Tuesday, May 30, 2023
HomeArtificial IntelligenceGenerate a counterfactual evaluation of corn response to nitrogen with Amazon SageMaker...

Generate a counterfactual evaluation of corn response to nitrogen with Amazon SageMaker JumpStart options

In his guide The E-book of Why, Judea Pearl advocates for instructing trigger and impact rules to machines in an effort to improve their intelligence. The accomplishments of deep studying are primarily only a sort of curve fitting, whereas causality could possibly be used to uncover interactions between the techniques of the world below numerous constraints with out testing hypotheses instantly. This might present solutions that lead us to AGI (artificial generalized intelligence).

This resolution proposes a causal inference framework utilizing Bayesian networks to symbolize causal dependencies and draw causal conclusions based mostly on noticed satellite tv for pc imagery and experimental trial information within the type of simulated climate and soil circumstances. The case examine is the causal relationship between nitrogen-based fertilizer utility and the corn yields.

The satellite tv for pc imagery is processed utilizing purpose-built Amazon SageMaker geospatial capabilities and enriched with custom-built Amazon SageMaker Processing operations. The causal inference engine is deployed with Amazon SageMaker Asynchronous Inference.

On this publish, we display find out how to create this counterfactual evaluation utilizing Amazon SageMaker JumpStart options.

Answer overview

The next diagram reveals the structure for the end-to-end workflow.


You want an AWS account to make use of this resolution.

To run this JumpStart 1P Answer and have the infrastructure deployed to your AWS account, you should create an lively Amazon SageMaker Studio occasion (seek advice from Onboard to Amazon SageMaker Area). When your Studio occasion is prepared, observe the directions in SageMaker JumpStart to launch the Crop Yield Counterfactuals resolution.

Notice that this resolution is presently accessible within the US West (Oregon) Area solely.

Causal inference

Causality is all about understanding change, however find out how to formalize this in statistics and machine studying (ML) isn’t a trivial train.

On this crop yield examine, the nitrogen added as fertilizer and the yield outcomes may be confounded. Equally, the nitrogen added as a fertilizer and the nitrogen leaching outcomes could possibly be confounded as properly, within the sense {that a} frequent trigger can clarify their affiliation. Nonetheless, affiliation isn’t causation. If we all know which noticed components confound the affiliation, we account for them, however what if there are different hidden variables accountable for confounding? Decreasing the quantity of fertilizer gained’t essentially scale back residual nitrogen; equally, it may not drastically diminish the yield, whereas the soil and weather conditions could possibly be the noticed components that confound the affiliation. How you can deal with confounding is the central downside of causal inference. A way launched by R. A. Fisher referred to as randomized managed trial goals to interrupt potential confounding.

Nonetheless, within the absence of randomized management trials, there’s a want for causal inference purely from observational information. There are methods to attach the causal inquiries to information in observational research by writing the causal graphical mannequin on what we postulate as how issues occur. This includes claiming the corresponding traverses will seize the corresponding dependencies, whereas satisfying the graphical criterion for conditional ignorability (to what extent we will deal with causation as affiliation based mostly on the causal assumptions). After we’ve got postulated the construction, we will use the implied invariances to study from observational information and plug in causal questions, inferring causal claims with out randomized management trials.

This resolution makes use of each information from simulated randomized management trials (RCTs) in addition to observational information from satellite tv for pc imagery. A sequence of simulations carried out over hundreds of fields and a number of years in Illinois (United States) are used to review the corn response to growing nitrogen charges for a broad mixture of climate and soil variation seen within the area. It addresses the limitation of utilizing trial information restricted within the variety of soils and years it may discover through the use of crop simulations of assorted farming situations and geographies. The database was calibrated and validated utilizing information from greater than 400 trials within the area. Preliminary nitrogen focus within the soil was set randomly amongst an inexpensive vary.

Moreover, the database is enhanced with observations from satellite tv for pc imagery, whereas zonal statistics are derived from spectral indices in an effort to symbolize spatio-temporal adjustments in vegetation, seen throughout geographies and phenological phases.

Causal inference with Bayesian networks

Structural causal fashions (SCMs) use graphical fashions to symbolize causal dependencies by incorporating each data-driven and human inputs. A selected sort of construction causal mannequin referred to as Bayesian networks is proposed to mannequin the crop phenology dynamics utilizing probabilistic expressions by representing variables as nodes and relationships between variables as edges. Nodes are indicators of crop progress, soil and climate circumstances, and the perimeters between them symbolize spatio-temporal causal relationships. The mum or dad nodes are field-related parameters (together with the day of sowing and space planted), and the kid nodes are yield, nitrogen uptake, and nitrogen leaching metrics.

For extra data, seek advice from the database characterization and the information for figuring out the corn progress levels.

A couple of steps are required to construct a Bayesian networks mannequin (with CausalNex) earlier than we will use it for counterfactual and interventional evaluation. The construction of the causal mannequin is initially discovered from information, whereas material experience (trusted literature or empirical beliefs) is used to postulate further dependencies and independencies between random variables and intervention variables, in addition to asserting the construction is causal.

Utilizing NO TEARS, a steady optimization algorithm for construction studying, the graph construction describing conditional dependencies between variables is discovered from information, with a set of constraints imposed on edges, mum or dad nodes, and baby nodes that aren’t allowed within the causal mannequin. This preserves the temporal dependencies between variables. See the next code:

tabu_edges: Imposing edges that aren't allowed within the causal mannequin
tabu_parents: Imposing mum or dad nodes that aren't allowed within the causal mannequin
tabu_child: Imposing baby nodes that aren't allowed within the causal mannequin
from causalnex.construction.notears import from_pandas

g_learned = from_pandas(

The following step encodes area information in fashions and captures phenology dynamics, whereas avoiding spurious relationships. Multicollinearity evaluation, variation inflation issue evaluation, and world characteristic significance utilizing SHAP evaluation are carried out to extract insights and constraints on water stress variables (growth, phenology, and photosynthesis round flowering), climate and soil variables, spectral indices, and the nitrogen-based indicators:

edges: Modifying the construction by imposing constraints on edges
from causalnex.construction import StructureModel

g = StructureModel()

Bayesian networks in CausalNex help solely discrete distributions. Any steady options, or options with a lot of classes, are discretized previous to fitting the Bayesian community:

from causalnex.discretiser.discretiser_strategy import (

discretiser = DecisionTreeSupervisedDiscretiserMethod(
    tree_params={"max_depth": 2, "random_state": 2022},

After the construction is reviewed, the conditional likelihood distribution of every variable given its dad and mom will be discovered from information, in a step referred to as probability estimation:

from causalnex.community import BayesianNetwork

bn = BayesianNetwork(g)
bn = bn.fit_node_states(discretised_data)
bn = bn.fit_cpds(

Lastly, the construction and likelihoods are used to carry out observational inference on the fly, following a deterministic Junction Tree algorithm (JTA), and making interventions utilizing do-calculus. SageMaker Asynchronous Inference permits queuing incoming requests and processes them asynchronously. This feature is right for each observational and counterfactual inference situations, the place the method can’t be parallelized, thereby taking significant time to replace the possibilities all through the community, though a number of queries will be run in parallel. See the next code:

Question the marginal probability of states within the graph given some observations. 
These observations will be made wherever within the community, 
and their influence will probably be propagated via to the node of curiosity.
from causalnex.inference import InferenceEngine

ie = InferenceEngine(bn)

pseudo_observation = [{"day_sow":0}, {"day_sow":1}, {"day_sow":2}]
marginals_multi = ie.question(
# distribution earlier than intervention
marginals_before = ie.question()["Y_corn"]

# updating a node distribution
ie.do_intervention("N_fert", 0)

# impact of do on marginals
marginals_after = ie.question()["Y_corn"]

# Resetting the node distribution

For additional particulars, seek advice from the inference script.

The causal mannequin pocket book is a step-by-step information on operating the previous steps.

Geospatial information processing

Earth Commentary Jobs (EOJs) are chained collectively to amass and rework satellite tv for pc imagery, whereas purpose-built operations and pre-trained fashions are used for cloud elimination, mosaicking, band math operations, and resampling. On this part, we focus on in additional element the geospatial processing steps.

Space of curiosity

Within the following determine, inexperienced polygons are the chosen counties, the orange grid is the database map (a grid of 10 x 10 km cells the place trials are carried out within the area), and the grid of grayscale squares is the 100 km x 100 km Sentinel-2 UTM tiling grid.

Spatial files are used to map the simulated database with corresponding satellite tv for pc imagery, overlaying polygons of 10 km x 10 km cells that divide the state of Illinois (the place trials are carried out within the area), counties polygons, and 100 km x 100 km Sentinel-2 UTM tiles. To optimize the geospatial information processing pipeline, a number of close by Sentinel-2 tiles are first chosen. Subsequent, the aggregated geometries of tiles and cells are overlayed in an effort to receive the area of curiosity (RoI). The counties and the cell IDs which can be absolutely noticed inside the RoI are chosen to kind the polygon geometry handed onto the EOJs.

Time vary

For this train, the corn phenology cycle is split into three levels: the vegetative levels v5 to R1 (emergence, leaf collars, and tasseling), the reproductive levels R1 to R4 (silking, blister, milk, and dough) and the reproductive levels R5 (dented) and R6 (physiological maturity). Consecutive satellite tv for pc visits are acquired for every phenology stage inside a time vary of two weeks and a predefined space of curiosity (chosen counties), enabling spatial and temporal evaluation of satellite tv for pc imagery. The next determine illustrates these metrics.

Cloud elimination

Cloud elimination for Sentinel-2 information makes use of an ML-based semantic segmentation mannequin to establish clouds within the picture, the place cloudy pixels are changed by with worth -9999 (nodata worth):

request_polygon_coordinates = [[(-90.571754, 39.839326), (-90.893651, 39.84092), (-90.916609, 39.845075), (-90.916071, 39.757168), (-91.147678, 39.75707), (-91.265848, 39.757258), (-91.365125, 39.758723), (-91.367962, 39.759124), (-91.365396, 39.777266), (-91.432919, 39.840554), (-91.446385, 39.870394), (-91.455887, 39.945538), (-91.460287, 39.980333), (-91.494865, 40.037421), (-91.510322, 40.127994), (-91.512974, 40.181062), (-91.510332, 40.201142), (-91.258828, 40.197299), (-90.911969, 40.193088), (-90.909756, 40.284394), (-90.450227, 40.276335), (-90.451502, 40.188892), (-90.199556, 40.183945), (-90.118966, 40.235263), (-90.033026, 40.377806), (-89.92468, 40.435921), (-89.717104, 40.435655), (-89.714927, 40.319218), (-89.602979, 40.320129), (-89.601604, 40.122432), (-89.578289, 39.976127), (-89.698259, 39.975309), (-89.701864, 39.916787), (-89.994506, 39.901925), (-89.994405, 39.87286), (-90.583534, 39.87675), (-90.582435, 39.854574), (-90.571754, 39.839326)]]

eoj_input_config = {
    "RasterDataCollectionQuery": {
        "RasterDataCollectionArn": 'arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8',
        "AreaOfInterest": {
            "AreaOfInterestGeometry": {
                "PolygonGeometry": {"Coordinates": request_polygon_coordinates}
        "TimeRangeFilter": {"StartTime": start_time, "EndTime": end_time},
        "PropertyFilters": {
            "Properties": [{"Property": {"EoCloudCover": 
            {"LowerBound": 0, "UpperBound": 10}}}],
            "LogicalOperator": "AND",

eoj_config = {
    "JobConfig": {
        "CloudRemovalConfig": {
            "AlgorithmName": "INTERPOLATION",
            "InterpolationValue": "-9999",
            "TargetBands": ["red", "green", "blue", "nir", "swir16"],

eojParams = {
    "Title": "cloudremoval",
    "InputConfig": eoj_input_config,
    "ExecutionRoleArn": role_arn,

eoj_response = sg_client.start_earth_observation_job(**eojParams)

After the EOJ is created, the ARN is returned and used to carry out the next geomosaic operation.

To get the standing of a job, you possibly can run sg_client.get_earth_observation_job(Arn = response['Arn']).


The geomosaic EOJ is used to merge pictures from a number of satellite tv for pc visits into a big mosaic, by overwriting nodata or clear pixels (together with the cloudy pixels) with pixels from different timestamps:

eoj_config = {"JobConfig": {"GeoMosaicConfig": {"AlgorithmName": "NEAR"}}}

eojParams = {
    "Title": "geomosaic",
    "InputConfig": {"PreviousEarthObservationJobArn": eoj_arn},
    "ExecutionRoleArn": role_arn,

eoj_response = sg_client.start_earth_observation_job(**eojParams)

After the EOJ is created, the ARN is returned and used to carry out the next resampling operation.


Resampling is used to downscale the decision of the geospatial picture in an effort to match the decision of the crop masks (10–30 m decision rescaling):

eoj_config = {
    "JobConfig": {
        "ResamplingConfig": {
            "OutputResolution": {"UserDefined": {"Worth": 30, "Unit": "METERS"}},
            "AlgorithmName": "NEAR",

eojParams = {
    "Title": "resample",
    "InputConfig": {"PreviousEarthObservationJobArn": eoj_arn},
    "ExecutionRoleArn": role_arn,

eoj_response = sg_client.start_earth_observation_job(**eojParams)

After the EOJ is created, the ARN is returned and used to carry out the next band math operation.

Band math

Band math operations are used for reworking the observations from a number of spectral bands to a single band. It contains the next spectral indices:

  • EVI2 – Two-Band Enhanced Vegetation Index
  • GDVI – Generalized Distinction Vegetation Index
  • NDMI – Normalized Distinction Moisture Index
  • NDVI – Normalized Distinction Vegetation Index
  • NDWI – Normalized Distinction Water Index

See the next code:

spectral_indices = [['EVI2', ' 2.5 * ( nir - red ) / ( nir + 2.4 * red + 1.0 ) '],
 ['GDVI', ' ( ( nir * * 2.0 ) - ( red * * 2.0 ) ) / ( ( nir * * 2.0 ) + ( red * * 2.0 ) ) '],
 ['NDMI', ' ( nir - swir16 ) / ( nir + swir16 ) '],
 ['NDVI', ' ( nir - red ) / ( nir + red ) '],
 ['NDWI', ' ( green - nir ) / ( green + nir ) ']]

eoj_config = {
    "JobConfig": {
        "BandMathConfig": {"CustomIndices": {"Operations": []}},

for indices in spectral_indices:
        {"Title": indices[0], "Equation": indices[1][1:-1]}

eojParams = {
    "Title": "bandmath",
    "InputConfig": {"PreviousEarthObservationJobArn": eoj_arn},
    "ExecutionRoleArn": role_arn,

eoj_response = sg_client.start_earth_observation_job(**eojParams)

Zonal statistics

The spectral indices are additional enriched utilizing Amazon SageMaker Processing, the place GDAL-based {custom} logic is used to do the next:

  • Merge the spectral indices right into a single multi-channel mosaic
  • Reproject the mosaic to the crop masks‘s projection
  • Apply the crop masks and reproject the mosaic to the cells polygons’s CRC
  • Calculate zonal statistics for chosen polygons (10 km x 10 km cells)

With parallelized information distribution, manifest files (for every crop phenological stage) are distributed throughout a number of situations utilizing the ShardedByS3Key S3 information distribution sort. For additional particulars, seek advice from the characteristic extraction script.

The geospatial processing pocket book is a step-by-step information on operating the previous steps.

The next determine reveals RGB channels of consecutive satellite tv for pc visits representing the vegetative and reproductive levels of the corn phenology cycle, with (proper) and with out (left) crop masks (CW 20, 26 and 33, 2018 Central Illinois).

Within the following determine, spectral indices (NDVI, EVI2, NDMI) of consecutive satellite tv for pc visits symbolize the vegetative and reproductive levels of the corn phenology cycle (CW 20, 26 and 33, 2018 Central Illinois).

Clear up

In the event you not wish to use this resolution, you possibly can delete the sources it created. After the answer is deployed in Studio, select Delete all sources to robotically delete all normal sources that had been created when launching the answer, together with the S3 bucket.


This resolution offers a blueprint to be used circumstances the place causal inference with Bayesian networks are the popular methodology for answering causal questions from a mixture of information and human inputs. The workflow contains an efficient implementation of the inference engine, which queues incoming queries and interventions and processes them asynchronously. The modular side permits the reuse of assorted parts, together with geospatial processing with purpose-built operations and pre-trained fashions, enrichment of satellite tv for pc imagery with custom-built GDAL operations, and multimodal characteristic engineering (spectral indices and tabular information).

As well as, you should use this resolution as a template for constructing gridded crop fashions the place nitrogen fertilizer administration and environmental coverage evaluation are carried out.

For extra data, seek advice from Answer Templates and observe the information to launch the Crop Yield Counterfactuals resolution within the US West (Oregon) Area. The code is on the market within the GitHub repo.


German Mandrini, Sotirios V. Archontoulis, Cameron M. Pittelkow, Taro Mieno, Nicolas F. Martin,
Simulated dataset of corn response to nitrogen over hundreds of fields and a number of years in Illinois,
Knowledge in Transient, Quantity 40, 2022, 107753, ISSN 2352-3409

Helpful sources

In regards to the Authors

Paul Barna is a Senior Knowledge Scientist with the Machine Studying Prototyping Labs at AWS.



Please enter your comment!
Please enter your name here

- Advertisment -

Most Popular

Recent Comments