Monday, June 5, 2023
HomeArtificial IntelligenceRun safe processing jobs utilizing PySpark in Amazon SageMaker Pipelines

Run safe processing jobs utilizing PySpark in Amazon SageMaker Pipelines

Amazon SageMaker Studio will help you construct, prepare, debug, deploy, and monitor your fashions and handle your machine studying (ML) workflows. Amazon SageMaker Pipelines lets you construct a safe, scalable, and versatile MLOps platform inside Studio.

On this put up, we clarify tips on how to run PySpark processing jobs inside a pipeline. This allows anybody that wishes to coach a mannequin utilizing Pipelines to additionally preprocess coaching information, postprocess inference information, or consider fashions utilizing PySpark. This functionality is particularly related when it is advisable course of large-scale information. As well as, we showcase tips on how to optimize your PySpark steps utilizing configurations and Spark UI logs.

Pipelines is an Amazon SageMaker device for constructing and managing end-to-end ML pipelines. It’s a totally managed on-demand service, built-in with SageMaker and different AWS companies, and due to this fact creates and manages sources for you. This ensures that cases are solely provisioned and used when operating the pipelines. Moreover, Pipelines is supported by the SageMaker Python SDK, letting you monitor your information lineage and reuse steps by caching them to ease growth time and price. A SageMaker pipeline can use processing steps to course of information or carry out mannequin analysis.

When processing large-scale information, information scientists and ML engineers typically use PySpark, an interface for Apache Spark in Python. SageMaker gives prebuilt Docker photographs that embrace PySpark and different dependencies wanted to run distributed information processing jobs, together with information transformations and have engineering utilizing the Spark framework. Though these photographs assist you to rapidly begin utilizing PySpark in processing jobs, large-scale information processing typically requires particular Spark configurations with the intention to optimize the distributed computing of the cluster created by SageMaker.

In our instance, we create a SageMaker pipeline operating a single processing step. For extra details about what different steps you’ll be able to add to a pipeline, check with Pipeline Steps.

SageMaker Processing library

SageMaker Processing can run with particular frameworks (for instance, SKlearnProcessor, PySparkProcessor, or Hugging Face). Impartial of the framework used, every ProcessingStep requires the next:

  • Step identify – The identify for use in your SageMaker pipeline step
  • Step arguments – The arguments in your ProcessingStep

Moreover, you’ll be able to present the next:

  • The configuration in your step cache with the intention to keep away from pointless runs of your step in a SageMaker pipeline
  • A listing of step names, step cases, or step assortment cases that the ProcessingStep is determined by
  • The show identify of the ProcessingStep
  • An outline of the ProcessingStep
  • Property recordsdata
  • Retry insurance policies

The arguments are handed over to the ProcessingStep. You should use the sagemaker.spark.PySparkProcessor or sagemaker.spark.SparkJarProcessor class to run your Spark software within a processing job.

Every processor comes with its personal wants, relying on the framework. That is greatest illustrated utilizing the PySparkProcessor, the place you’ll be able to go extra data to optimize the ProcessingStep additional, for example by way of the configuration parameter when operating your job.

Run SageMaker Processing jobs in a safe surroundings

It’s greatest apply to create a personal Amazon VPC and configure it in order that your jobs aren’t accessible over the general public web. SageMaker Processing jobs assist you to specify the non-public subnets and safety teams in your VPC in addition to allow community isolation and inter-container visitors encryption utilizing the NetworkConfig.VpcConfig request parameter of the CreateProcessingJob API. We offer examples of this configuration utilizing the SageMaker SDK within the subsequent part.

PySpark ProcessingStep inside SageMaker Pipelines

For this instance, we assume that you’ve got Studio deployed in a safe surroundings already accessible, together with VPC, VPC endpoints, safety teams, AWS Id and Entry Administration (IAM) roles, and AWS Key Administration Service (AWS KMS) keys. We additionally assume that you’ve got two buckets: one for artifacts like code and logs, and one in your information. The basic_infra.yaml file gives instance AWS CloudFormation code to provision the mandatory prerequisite infrastructure. The instance code and deployment information can also be accessible on GitHub.

For instance, we arrange a pipeline containing a single ProcessingStep through which we’re merely studying and writing the abalone dataset utilizing Spark. The code samples present you tips on how to arrange and configure the ProcessingStep.

We outline parameters for the pipeline (identify, function, buckets, and so forth) and step-specific settings (occasion sort and rely, framework model, and so forth). On this instance, we use a safe setup and likewise outline subnets, safety teams, and the inter-container visitors encryption. For this instance, you want a pipeline execution function with SageMaker full entry and a VPC. See the next code:

	"pipeline_name": "ProcessingPipeline",
	"trial": "test-blog-post",
	"pipeline_role": "arn:aws:iam::<ACCOUNT_NUMBER>:function/<PIPELINE_EXECUTION_ROLE_NAME>",
	"network_subnet_ids": [
	"network_security_group_ids": [
	"pyspark_process_volume_kms": "arn:aws:kms:<REGION_NAME>:<ACCOUNT_NUMBER>:key/<KMS_KEY_ID>",
	"pyspark_process_output_kms": "arn:aws:kms:<REGION_NAME>:<ACCOUNT_NUMBER>:key/<KMS_KEY_ID>",
	"pyspark_helper_code": "s3://<INFRA_S3_BUCKET>/src/helper/",
	"spark_config_file": "s3://<INFRA_S3_BUCKET>/src/spark_configuration/configuration.json",
	"pyspark_process_code": "s3://<INFRA_S3_BUCKET>/src/processing/",
	"process_spark_ui_log_output": "s3://<DATA_S3_BUCKET>/spark_ui_logs/{}",
	"pyspark_framework_version": "2.4",
	"pyspark_process_name": "pyspark-processing",
	"pyspark_process_data_input": "s3a://<DATA_S3_BUCKET>/data_input/abalone_data.csv",
	"pyspark_process_data_output": "s3a://<DATA_S3_BUCKET>/pyspark/data_output",
	"pyspark_process_instance_type": "ml.m5.4xlarge",
	"pyspark_process_instance_count": 6,
	"tags": {
		"Undertaking": "tag-for-project",
		"Proprietor": "tag-for-owner"

To exhibit, the next code instance runs a PySpark script on SageMaker Processing inside a pipeline through the use of the PySparkProcessor:

# import code necessities
# normal libraries import
import logging
import json

# sagemaker mannequin import
import sagemaker
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.pipeline_experiment_config import PipelineExperimentConfig
from sagemaker.workflow.steps import CacheConfig
from sagemaker.processing import ProcessingInput
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.spark.processing import PySparkProcessor

from helpers.infra.networking.networking import get_network_configuration
from helpers.infra.tags.tags import get_tags_input
from helpers.pipeline_utils import get_pipeline_config

def create_pipeline(pipeline_params, logger):
        pipeline_params ( pipeline parameters
        logger (logger): logger
    # Create SageMaker Session
    sagemaker_session = PipelineSession()

    # Get Tags
    tags_input = get_tags_input(pipeline_params["tags"])

    # get community configuration
    network_config = get_network_configuration(

    # Get Pipeline Configurations
    pipeline_config = get_pipeline_config(pipeline_params)

    # setting processing cache obj"Setting " + pipeline_params["pyspark_process_name"] + " cache configuration 3 to 30 days")
    cache_config = CacheConfig(enable_caching=True, expire_after="p30d")

    # Create PySpark Processing Step"Creating " + pipeline_params["pyspark_process_name"] + " processor")

    # establishing spark processor
    processing_pyspark_processor = PySparkProcessor(
    # establishing arguments
    run_ags =
        # processing input arguments. To add new arguments to this list you need to provide two entrances:
        # 1st is the argument name preceded by "--" and the 2nd is the argument value
        # setting up processing arguments
            "--input_table", pipeline_params["pyspark_process_data_input"],
            "--output_table", pipeline_params["pyspark_process_data_output"]
        inputs = [
                vacation spot="/decide/ml/processing/enter/conf",

    # create step
    pyspark_processing_step = ProcessingStep(

    # Create Pipeline
    pipeline = Pipeline(
        description="Instance pipeline",
    return pipeline

def most important():
    # arrange logging
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.INFO)"Get Pipeline Parameter")

    with open("ml_pipeline/params/pipeline_params.json", "r") as f:
        pipeline_params = json.load(f)
    print(pipeline_params)"Create Pipeline")
    pipeline = create_pipeline(pipeline_params, logger=logger)"Execute Pipeline")
    execution = pipeline.begin()
    return execution

if __name__ == "__main__":
    most important()

As proven within the previous code, we’re overwriting the default Spark configurations by offering configuration.json as a ProcessingInput. We use a configuration.json file that was saved in Amazon Easy Storage Service (Amazon S3) with the next settings:


We will replace the default Spark configuration both by passing the file as a ProcessingInput or through the use of the configuration argument when operating the run() operate.

The Spark configuration relies on different choices, just like the occasion sort and occasion rely chosen for the processing job. The primary consideration is the variety of cases, the vCPU cores that every of these cases have, and the occasion reminiscence. You should use Spark UIs or CloudWatch occasion metrics and logs to calibrate these values over a number of run iterations.

As well as, the executor and driver settings will be optimized even additional. For an instance of tips on how to calculate these, check with Greatest practices for efficiently managing reminiscence for Apache Spark purposes on Amazon EMR.

Subsequent, for driver and executor settings, we suggest investigating the committer settings to enhance efficiency when writing to Amazon S3. In our case, we’re writing Parquet recordsdata to Amazon S3 and setting “spark.sql.parquet.fs.optimized.comitter.optimization-enabled” to true.

If wanted for a connection to Amazon S3, a regional endpoint “spark.hadoop.fs.s3a.endpoint” will be specified inside the configurations file.

On this instance pipeline, the PySpark script (as proven within the following code) masses a CSV file from Amazon S3 right into a Spark information body, and saves the info as Parquet again to Amazon S3.

Word that our instance configuration will not be proportionate to the workload as a result of studying and writing the abalone dataset might be achieved on default settings on one occasion. The configurations we talked about needs to be outlined primarily based in your particular wants.

# import necessities
import argparse
import logging
import sys
import os
import pandas as pd

# spark imports
from pyspark.sql import SparkSession
from pyspark.sql.features import (udf, col)
from pyspark.sql.varieties import StringType, StructField, StructType, FloatType

from data_utils import(

sys.stdout = Unbuffered(sys.stdout)

# Outline customized handler
logger = logging.getLogger(__name__)
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(logging.Formatter("%(asctime)s %(message)s"))

def most important(data_path):

    spark = SparkSession.builder.appName("PySparkJob").getOrCreate()

    schema = StructType(
            StructField("sex", StringType(), True),
            StructField("length", FloatType(), True),
            StructField("diameter", FloatType(), True),
            StructField("height", FloatType(), True),
            StructField("whole_weight", FloatType(), True),
            StructField("shucked_weight", FloatType(), True),
            StructField("viscera_weight", FloatType(), True),
            StructField("rings", FloatType(), True),

    df = spark.learn.csv(data_path, header=False, schema=schema)
    return df.choose("intercourse", "size", "diameter", "rings")

if __name__ == "__main__":"===============================================================")"================= Beginning pyspark-processing =================")
    parser = argparse.ArgumentParser(description="app inputs")
    parser.add_argument("--input_table", sort=str, assist="path to the channel information")
    parser.add_argument("--output_table", sort=str, assist="path to the output information")
    args = parser.parse_args()
    df = most important(args.input_table)"Writing remodeled information")
    df.write.csv( a part of(args.output_table, "remodeled.csv"), header=True, mode="overwrite")

    # save information
    df.coalesce(10).write.mode("overwrite").parquet(args.output_table)"================== Ending pyspark-processing ==================")"===============================================================")

To dive into optimizing Spark processing jobs, you need to use the CloudWatch logs in addition to the Spark UI. You possibly can create the Spark UI by operating a Processing job on a SageMaker pocket book occasion. You possibly can view the Spark UI for the Processing jobs operating inside a pipeline by operating the historical past server inside a SageMaker pocket book occasion if the Spark UI logs have been saved inside the similar Amazon S3 location.

Clear up

In case you adopted the tutorial, it’s good apply to delete sources which are now not used to cease incurring costs. Ensure that to delete the CloudFormation stack that you just used to create your sources. This may delete the stack created in addition to the sources it created.


On this put up, we confirmed tips on how to run a safe SageMaker Processing job utilizing PySpark inside SageMaker Pipelines. We additionally demonstrated tips on how to optimize PySpark utilizing Spark configurations and arrange your Processing job to run in a safe networking configuration.

As a subsequent step, discover tips on how to automate your entire mannequin lifecycle and the way clients constructed safe and scalable MLOps platforms utilizing SageMaker companies.

In regards to the Authors

Maren Suilmann is a Information Scientist at AWS Skilled Companies. She works with clients throughout industries unveiling the ability of AI/ML to realize their enterprise outcomes. Maren has been with AWS since November 2019. In her spare time, she enjoys kickboxing, climbing to nice views, and board sport nights.

Maira Ladeira Tanke
is an ML Specialist at AWS. With a background in information science, she has 9 years of expertise architecting and constructing ML purposes with clients throughout industries. As a technical lead, she helps clients speed up their achievement of enterprise worth via rising applied sciences and revolutionary options. In her free time, Maira enjoys touring and spending time along with her household someplace heat.

Pauline Ting
is Information Scientist within the AWS Skilled Companies group. She helps clients in attaining and accelerating their enterprise end result by creating AI/ML options. In her spare time, Pauline enjoys touring, browsing, and making an attempt new dessert locations.

Donald Fossouo
is a Sr Information Architect within the AWS Skilled Companies group, principally working with International Finance Service. He engages with clients to create revolutionary options that deal with buyer enterprise issues and speed up the adoption of AWS companies. In his spare time, Donald enjoys studying, operating, and touring.



Please enter your comment!
Please enter your name here

- Advertisment -

Most Popular

Recent Comments