ETL Articles / Blogs / Perficient https://blogs.perficient.com/tag/etl/ Expert Digital Insights Tue, 10 Jun 2025 20:18:51 +0000 en-US hourly 1 https://blogs.perficient.com/files/favicon-194x194-1-150x150.png ETL Articles / Blogs / Perficient https://blogs.perficient.com/tag/etl/ 32 32 30508587 Mastering Databricks Jobs API: Build and Orchestrate Complex Data Pipelines https://blogs.perficient.com/2025/06/06/mastering-databricks-jobs-api-build-and-orchestrate-complex-data-pipelines/ https://blogs.perficient.com/2025/06/06/mastering-databricks-jobs-api-build-and-orchestrate-complex-data-pipelines/#respond Fri, 06 Jun 2025 18:45:09 +0000 https://blogs.perficient.com/?p=382492

In this post, we’ll dive into orchestrating data pipelines with the Databricks Jobs API, empowering you to automate, monitor, and scale workflows seamlessly within the Databricks platform.

Why Orchestrate with Databricks Jobs API?

When data pipelines become complex involving multiple steps—like running notebooks, updating Delta tables, or training machine learning models—you need a reliable way to automate and manage them with ease. The Databricks Jobs API offers a flexible and efficient way to automate your jobs/workflows directly within Databricks or from external systems (for example AWS Lambda or Azure Functions) using the API endpoints.

Unlike external orchestrators such as Apache Airflow, Dagster etc., which require separate infrastructure and integration, the Jobs API is built natively into the Databricks platform. And the best part? It doesn’t cost anything extra. The Databricks Jobs API allows you to fully manage the lifecycle of your jobs/workflows using simple HTTP requests.

Below is the list of API endpoints for the CRUD operations on the workflows:

  • Create: Set up new jobs with defined tasks and configurations via the POST /api/2.1/jobs/create Define single or multi-task jobs, specifying the tasks to be executed (e.g., notebooks, JARs, Python scripts), their dependencies, and the compute resources.
  • Retrieve: Access job details, check statuses, and review run logs using GET /api/2.1/jobs/get or GET /api/2.1/jobs/list.
  • Update: Change job settings such as parameters, task sequences, or cluster details through POST /api/2.1/jobs/update and /api/2.1/jobs/reset.
  • Delete: Remove jobs that are no longer required using POST /api/2.1/jobs/delete.

These full CRUD capabilities make the Jobs API a powerful tool to automate job management completely, from creation and monitoring to modification and deletion—eliminating the need for manual handling.

Key components of a Databricks Job

  • Tasks: Individual units of work within a job, such as running a notebook, JAR, Python script, or dbt task. Jobs can have multiple tasks with defined dependencies and conditional execution.
  • Dependencies: Relationships between tasks that determine the order of execution, allowing you to build complex workflows with sequential or parallel steps.
  • Clusters: The compute resources on which tasks run. These can be ephemeral job clusters created specifically for the job or existing all-purpose clusters shared across jobs.
  • Retries: Configuration to automatically retry failed tasks to improve job reliability.
  • Scheduling: Options to run jobs on cron-based schedules, triggered events, or on demand.
  • Notifications: Alerts for job start, success, or failure to keep teams informed.

Getting started with the Databricks Jobs API

Before leveraging the Databricks Jobs API for orchestration, ensure you have access to a Databricks workspace, a valid Personal Access Token (PAT), and sufficient privileges to manage compute resources and job configurations. This guide will walk through key CRUD operations and relevant Jobs API endpoints for robust workflow automation.

1. Creating a New Job/Workflow:

To create a job, you send a POST request to the /api/2.1/jobs/create endpoint with a JSON payload defining the job configuration.

{
  "name": "Ingest-Sales-Data",
  "tasks": [
    {
      "task_key": "Ingest-CSV-Data",
      "notebook_task": {
        "notebook_path": "/Users/name@email.com/ingest_csv_notebook",
        "source": "WORKSPACE"
      },
      "new_cluster": {
        "spark_version": "15.4.x-scala2.12",
        "node_type_id": "i3.xlarge",
        "num_workers": 2
      }
    }
  ],
  "schedule": {
    "quartz_cron_expression": "0 30 9 * * ?",
    "timezone_id": "UTC",
    "pause_status": "UNPAUSED"
  },
  "email_notifications": {
    "on_failure": [
      "name@email.com"
    ]
  }
}

This JSON payload defines a Databricks job that executes a notebook-based task on a newly provisioned cluster, scheduled to run daily at 9:30 AM UTC. The components of the payload are explained below:

  • name: The name of your job.
  • tasks: An array of tasks to be executed. A job can have one or more tasks.
    • task_key: A unique identifier for the task within the job. Used for defining dependencies.
    • notebook_task: Specifies a notebook task. Other task types include spark_jar_task, spark_python_task, spark_submit_task, pipeline_task, etc.
      • notebook_path: The path to the notebook in your Databricks workspace.
      • source: The source of the notebook (e.g., WORKSPACE, GIT).
    • new_cluster: Defines the configuration for a new cluster that will be created for this job run. You can also use existing_cluster_id to use an existing all-purpose cluster (though new job clusters are recommended).
      • spark_version, node_type_id, num_workers: Standard cluster configuration options.
  • schedule: Defines the job schedule using a cron expression and timezone.
  • email_notifications: Configures email notifications for job events.

To create a Databricks workflow, the above JSON payload can be included in the body of a POST request sent to the Jobs API’s create endpoint—either using curl or programmatically via the Python requests library as shown below:

Using Curl:

curl -X POST \
  https://<databricks-instance>.cloud.databricks.com/api/2.1/jobs/create \
  -H "Authorization: Bearer <Your-PAT>" \
  -H "Content-Type: application/json" \
  -d '@workflow_config.json' #Place the above payload in workflow_config.json

Using Python requests library:

import requests
import json
create_response = requests.post("https://<databricks-instance>.cloud.databricks.com/api/2.1/jobs/create", data=json.dumps(your_json_payload), auth=("token", token))
if create_response.status_code == 200:
    job_id = json.loads(create_response.content.decode('utf-8'))["job_id"]
    print("Job created with id: {}".format(job_id))
else:
    print("Job creation failed with status code: {}".format(create_response.status_code))
    print(create_response.text)

The above example demonstrated a basic single-task workflow. However, the full potential of the Jobs API lies in orchestrating multi-task workflows with dependencies. The tasks array in the job payload allows you to configure multiple dependent tasks.
For example, the following workflow defines three tasks that execute sequentially: Ingest-CSV-DataTransform-Sales-DataWrite-to-Delta.

{
  "name": "Ingest-Sales-Data-Pipeline",
  "tasks": [
    {
      "task_key": "Ingest-CSV-Data",
      "notebook_task": {
        "notebook_path": "/Users/name@email.com/ingest_csv_notebook",
        "source": "WORKSPACE"
      },
      "new_cluster": {
        "spark_version": "15.4.x-scala2.12",
        "node_type_id": "i3.xlarge",
        "num_workers": 2
      }
    },
    {
      "task_key": "Transform-Sales-Data",
      "depends_on": [
        {
          "task_key": "Ingest-CSV-Data"
        }
      ],
      "notebook_task": {
        "notebook_path": "/Users/name@email.com/transform_sales_data",
        "source": "WORKSPACE"
      },
      "new_cluster": {
        "spark_version": "15.4.x-scala2.12",
        "node_type_id": "i3.xlarge",
        "num_workers": 2
      }
    },
    {
      "task_key": "Write-to-Delta",
      "depends_on": [
        {
          "task_key": "Transform-Sales-Data"
        }
      ],
      "notebook_task": {
        "notebook_path": "/Users/name@email.com/write_to_delta_notebook",
        "source": "WORKSPACE"
      },
      "new_cluster": {
        "spark_version": "15.4.x-scala2.12",
        "node_type_id": "i3.xlarge",
        "num_workers": 2
      }
    }
  ],
  "schedule": {
    "quartz_cron_expression": "0 30 9 * * ?",
    "timezone_id": "UTC",
    "pause_status": "UNPAUSED"
  },
  "email_notifications": {
    "on_failure": [
      "name@email.com"
    ]
  }
}

 

Picture1


2. Updating Existing Workflows:

For modifying existing workflows, we have two endpoints: the update endpoint /api/2.1/jobs/update and the reset endpoint /api/2.1/jobs/reset. The update endpoint applies a partial update to your job. This means you can tweak parts of the job — like adding a new task or changing a cluster spec — without redefining the entire workflow. While the reset endpoint does a complete overwrite of the job configuration. Therefore, when resetting a job, you must provide the entire desired job configuration, including any settings you wish to keep unchanged, to avoid them being overwritten or removed entirely. Let us go over a few examples to better understand the endpoints better.

2.1. Update Workflow Name & Add New Task:

Let us modify the above workflow by renaming it from Ingest-Sales-Data-Pipeline to Sales-Workflow-End-to-End, adding an input parametersource_location to the Ingest-CSV-Data, and introducing a new task Write-to-Postgres, which runs after the successful completion of Transform-Sales-Data.

{
  "job_id": 947766456503851,
  "new_settings": {
    "name": "Sales-Workflow-End-to-End",
    "tasks": [
      {
        "task_key": "Ingest-CSV-Data",
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/ingest_csv_notebook",
          "base_parameters": {
            "source_location": "s3://<bucket>/<key>"
          },
          "source": "WORKSPACE"
        },
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      },
      {
        "task_key": "Transform-Sales-Data",
        "depends_on": [
          {
            "task_key": "Ingest-CSV-Data"
          }
        ],
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/transform_sales_data",
          "source": "WORKSPACE"
        },
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      },
      {
        "task_key": "Write-to-Delta",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/write_to_delta_notebook",
          "source": "WORKSPACE"
        },
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      },
      {
        "task_key": "Write-to-Postgres",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/write_to_postgres_notebook",
          "source": "WORKSPACE"
        },
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      }
    ],
    "schedule": {
      "quartz_cron_expression": "0 30 9 * * ?",
      "timezone_id": "UTC",
      "pause_status": "UNPAUSED"
    },
    "email_notifications": {
      "on_failure": [
        "name@email.com"
      ]
    }
  }
}

Picture2

2.2. Update Cluster Configuration:

Cluster startup can take several minutes, especially for larger, more complex clusters. Sharing the same cluster allows subsequent tasks to start immediately after previous ones complete, speeding up the entire workflow. Parallel tasks can also run concurrently sharing the same cluster resources efficiently. Let’s update the above workflow to share the same cluster between all the tasks.

{
  "job_id": 947766456503851,
  "new_settings": {
    "name": "Sales-Workflow-End-to-End",
    "job_clusters": [
      {
        "job_cluster_key": "shared-cluster",
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      }
    ],
    "tasks": [
      {
        "task_key": "Ingest-CSV-Data",
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/ingest_csv_notebook",
          "base_parameters": {
            "source_location": "s3://<bucket>/<key>"
          },
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Transform-Sales-Data",
        "depends_on": [
          {
            "task_key": "Ingest-CSV-Data"
          }
        ],
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/transform_sales_data",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Write-to-Delta",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/write_to_delta_notebook",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Write-to-Postgres",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/write_to_postgres_notebook",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      }
    ],
    "schedule": {
      "quartz_cron_expression": "0 30 9 * * ?",
      "timezone_id": "UTC",
      "pause_status": "UNPAUSED"
    },
    "email_notifications": {
      "on_failure": [
        "name@email.com"
      ]
    }
  }
}

Picture3

2.3. Update Task Dependencies:

Let’s add a new task named Enrich-Sales-Data and update the dependency as shown below:
Ingest-CSV-Data →
Enrich-Sales-Data → Transform-Sales-Data →[Write-to-Delta, Write-to-Postgres].Since we are updating dependencies of existing tasks, we need to use the reset endpoint /api/2.1/jobs/reset.

{
  "job_id": 947766456503851,
  "new_settings": {
    "name": "Sales-Workflow-End-to-End",
    "job_clusters": [
      {
        "job_cluster_key": "shared-cluster",
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      }
    ],
    "tasks": [
      {
        "task_key": "Ingest-CSV-Data",
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/ingest_csv_notebook",
          "base_parameters": {
            "source_location": "s3://<bucket>/<key>"
          },
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Enrich-Sales-Data",
        "depends_on": [
          {
            "task_key": "Ingest-CSV-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/enrich_sales_data",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Transform-Sales-Data",
        "depends_on": [
          {
            "task_key": "Enrich-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/transform_sales_data",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Write-to-Delta",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/write_to_delta_notebook",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Write-to-Postgres",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/write_to_postgres_notebook",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      }
    ],
    "schedule": {
      "quartz_cron_expression": "0 30 9 * * ?",
      "timezone_id": "UTC",
      "pause_status": "UNPAUSED"
    },
    "email_notifications": {
      "on_failure": [
        "name@email.com"
      ]
    }
  }
}

Picture4

The update endpoint is useful for minor modifications like updating the workflow name, updating the notebook path, input parameters to tasks, updating the job schedule, changing cluster configurations like node count etc., while the reset endpoint should be used for deleting existing tasks, redefining task dependencies, renaming tasks etc.
The update endpoint does not delete tasks or settings you omit i.e. tasks not mentioned in the request will remain unchanged, while the reset endpoint removes/deletes any fields or tasks not included in the request.

3. Trigger an Existing Job/Workflow:

Use the/api/2.1/jobs/run-now endpoint to trigger a job run on demand. Pass the input parameters to your notebook tasks using thenotebook_paramsfield.

curl -X POST https://<databricks-instance>/api/2.1/jobs/run-now \
  -H "Authorization: Bearer <DATABRICKS_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "job_id": 947766456503851,
    "notebook_params": {
      "source_location": "s3://<bucket>/<key>"
    }
  }'

4. Get Job Status:

To check the status of a specific job run, use the /api/2.1/jobs/runs/get endpoint with the run_id. The response includes details about the run, including its state (e.g., PENDING, RUNNING, COMPLETED, FAILED etc).

curl -X GET \
  https://<databricks-instance>.cloud.databricks.com/api/2.1/jobs/runs/get?run_id=<your-run-id> \
  -H "Authorization: Bearer <Your-PAT>"

5. Delete Job:

To remove an existing Databricks workflow, simply call the DELETE /api/2.1/jobs/delete endpoint using the Jobs API. This allows you to programmatically clean up outdated or unnecessary jobs as part of your pipeline management strategy.

curl -X POST https://<databricks-instance>/api/2.1/jobs/delete \
  -H "Authorization: Bearer <DATABRICKS_PERSONAL_ACCESS_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{ "job_id": 947766456503851 }'

 

Conclusion:

The Databricks Jobs API empowers data engineers to orchestrate complex workflows natively, without relying on external scheduling tools. Whether you’re automating notebook runs, chaining multi-step pipelines, or integrating with CI/CD systems, the API offers fine-grained control and flexibility. By mastering this API, you’re not just building workflows—you’re building scalable, production-grade data pipelines that are easier to manage, monitor, and evolve.

]]>
https://blogs.perficient.com/2025/06/06/mastering-databricks-jobs-api-build-and-orchestrate-complex-data-pipelines/feed/ 0 382492
PWC-IDMC Migration Gaps https://blogs.perficient.com/2025/06/05/pwc-idmc-migration-gaps/ https://blogs.perficient.com/2025/06/05/pwc-idmc-migration-gaps/#respond Thu, 05 Jun 2025 05:26:54 +0000 https://blogs.perficient.com/?p=382445

In the age of technological advancements happening almost every minute, upgrading a business is essential to survive competition, offering a customer experience beyond expectations while deploying fewer resources to derive value from any process or business.

Platform upgrades, software upgrades, security upgrades, architectural enhancements, and so on are required to ensure stability, agility, and efficiency.

Customers prefer to move from Legacy systems to the Cloud due to the offerings it brings. From cost, monitoring, maintenance, operations, ease of use, and landscape, Cloud has transformed D&A businesses significantly over the last decade.

Movement from Informatica Powercenter to IDMC has been perceived as the need of the hour due to the humongous advantages it offers. Developers must understand both flavors to perform this code transition effectively.

This post explains the PWC vs IDMC CDI gaps from different perspectives.

  • Development
  • Data
  • Operations

Development

  • The difference in native datatypes can be observed in IDMC when importing Source, Target, or Lookup. Workaround as follows.,
    • If any consistency is observed in IDMC mappings with Native Datatype/Precision/Scale, ensure that the Metadata Is Edited to keep them in sync between DDL and CDI mappings.
  • In CDI, taskflow workflow parameter values experience read and consumption issues. Workaround as follows.,
    • A Dummy Mapping task has to be created where the list of Parameters/Variables needs to be defined for further consumption by tasks within the taskflows (Ex, Command task/Email task, etc)
    • Make sure to limit the # of Dummy Mapping tasks during this process
    • Best practice is to create 1 Dummy Mapping task for a folder to capture all the Parameters/Variables required for that entire folder.
    • For Variables whose value needs to be persistent for the next taskflow run, make sure the Variable value is mapped to the Dummy Mapping task via an Assignment task. This Dummy mapping task would be used at the start and end of the task flow to ensure that the overall task flow processing is enabled for Incremental Data processing.
  • All mapping tasks/sessions in IDMC are reusable. They could be used in any task flow. If some Audit sessions are expected to run concurrently within other taskflows, ensure that the property “Allow the mapping task to be executed simultaneously” is enabled.
  • Sequence generator: Data overlap issues in CDI. Workaround as follows.,
    • If a sequence generator is likely to be used in multiple sessions/workflows, it’s better to make it a reusable/SHARED Sequence.
  • VSAM Sources/Normalizer was not available in CDI. Workaround as follows.,
    • Use the Sequential File connector type for mappings using Mainframe VSAM Sources/Normalizer.
  • Sessions are configured to have STOP ON ERRORS >0. Workaround as follows.,
    • Ensure the LINK conditions for the next task to be “PreviousTask.TaskStatus – STARTS WITH ANY OF 1, 2” within CDI taskflows.
  • Partitions are not supported with Sources under Query mode. Workaround as follows.,
    • Ensure multiple sessions are created and run in parallel as a workaround.
  • Currently, parameterization of Schema/Table is not possible for Mainframe DB2. Workaround as follows.,
    • Use an ODBC-type connection to access DB2 with Schema/Table parameterization.
  • A mapping with a LOOKUP transformation used across two sessions cannot be overridden at the session or mapping task level to enable or disable caching. Workaround as follows.,
    • Use 2 different mappings with LOOKUP transformations if 1 mapping/session has to have cache enabled and the other mapping/session has to have cache disabled.

Data

  • IDMC Output data containing additional Double quotes. Workaround as follows.,
    • Session level – use this property – __PMOV_FFW_ESCAPE_QUOTE=No
    • Administrator settings level – use this property – UseCustomSessionConfig = Yes
  • IDMC Output data containing additional Scale values with Decimal datatype (ex., 11.00). Workaround as follows.,
    • Use IF-THEN-ELSE statement to remove Unwanted 0s in data (O/P : from 11.00 -> 11)

Operations

  • CDI doesn’t store logs beyond 1000 mapping tasks run in 3 days on Cloud (it does store logs in Secure Agent). Workaround as follows.,
    • To retain Cloud job run stats, create Audit tables and use the Data Marketplace utility to get the Audit info (Volume processes, Start/End time, etc) loaded to the Audit tables by scheduling this job at regular intervals (Hourly or Daily).
  • Generic Restartability issues occur during IDMC Operations. Workaround as follows.,
    • Ensure a Dummy assignment task is introduced whenever the code contains Custom error handling flow.
  • SKIP FAILED TASK and RESUME FROM NEXT TASK operations have issues in IDMC. Workaround as follows.,
    • Ensure every LINK condition has an additional condition appended, “Mapping task. Fault.Detail.ErrorOutputDetail.TaskStatus=1”
  • In PWC, any task can be run from anywhere within a workflow; however, this is not possible in IDMC. Workaround as follows.
    • Feature request worked upon by GCS to update the Software
  • IDMC mapping task config level is not capable due to parameter concatenation issues. Workaround as follows.,
    • Ensure to use a separate parameter within the parameter file to have the Mapping task log file names suffixed with the Concurrent run workflow instance name.
  • IDMC doesn’t honour the “Save Session log for these runs” property set at the mapping task level when the session log file name is parameterized. Workaround as follows.,
    • Ensure to copy the mapping task log files in the Secure agent server after the job run
  • If Session Log File Directory contains / (Slash) when used along with parameters (ex., $PMSessionLogDir/ABC) under Session Log Directory Path, this would append every run log to the same log file. Workaround as follows.,
    • Ensure to use a separate parameter within the parameter file for $PMSessionLogDir
  • In IDMC, the @numAppliedRows and @numAffectedRows features are not available to get the source and target success rows to load them in the audit table. Workaround as follows.,
    • @numAppliedRows is used instead of @numAffectedRows
  • Concurrent runs cannot be performed on taskflows from the CDI Data Integration UI. Workaround as follows.,
    • Use the Paramset utility to upload concurrent paramsets and use the runAJobCli utility to run taskflows with multiple concurrent run instances from the command prompt.

Conclusion

While performing PWC to IDMC conversions, the following Development and Operations workarounds will help avoid rework and save effort, thereby achieving customer satisfaction in delivery.

]]>
https://blogs.perficient.com/2025/06/05/pwc-idmc-migration-gaps/feed/ 0 382445
IDMC – CDI Best Practices https://blogs.perficient.com/2025/06/05/idmc-cdi-best-practices/ https://blogs.perficient.com/2025/06/05/idmc-cdi-best-practices/#respond Thu, 05 Jun 2025 05:01:33 +0000 https://blogs.perficient.com/?p=382442

Every end product must meet and exceed customer expectations. For a successful delivery, it is not just about doing what matters, but also about how it is done by following and implementing the desired standards.

This post outlines the best practices to consider with IDMC CDI ETL during the following phases.

  • Development
  • Operations 

Development Best Practices

  • Native Datatypes check between Database table DDLs and IDMC CDI Mapping Source, Target, and Lookup objects.
    • If any consistency is observed in IDMC mappings with Native Datatype/Precision/Scale, ensure that the Metadata Is Edited to keep them in sync between DDL and CDI mappings.
  • In CDI, workflow parameter values in order to be consumed by the taskflows, a Dummy Mapping task has to be created where the list of Parameters/Variables needs to be defined for further consumption by tasks within the taskflows (Ex, Command task/Email task, etc)
    • Make sure to limit the # of Dummy Mapping tasks during this process
    • Best practice is to create 1 Dummy Mapping task for a folder to capture all the Parameters/Variables required for that entire folder.
    • For Variables whose value needs to be persistent for the next taskflow run, make sure the Variable value is mapped to the Dummy Mapping task via an Assignment task. This Dummy mapping task would be used at the start and end of the task flow to ensure that the overall task flow processing is enabled for Incremental Data processing.
  • If some Audit sessions are expected to run concurrently within other taskflows, ensure that the property “Allow the mapping task to be executed simultaneously” is enabled.
  • Avoid using the SUSPEND TASKFLOW option, as it requires manual intervention during job restarts. Additionally, this property may cause issues during job restarts.
  • Ensure correct parameter representation using Single Dollar/Double Dollar. Incorrect representation will cause the parameters not to be read by CDI during Job runs.
  • While working with Flatfiles in CDI mappings, always enable the property “Retain existing fields at runtime”.
  • If a sequence generator is likely to be used in multiple sessions/workflows, it’s better to make it a reusable/SHARED Sequence.
  • Use the Sequential File connector type for mappings using Mainframe VSAM Sources/Normalizer.
  • If a session is configured to have STOP ON ERRORS >0, ensure the LINK conditions for the next task to be “PreviousTask.TaskStatus – STARTS WITH ANY OF 1, 2” within CDI taskflows.
  • For mapping task failure flows, set the LINK conditions for the next task to be “PreviousTask.Fault.Detail.ErrorOutputDetail.TaskStatus – STARTS WITH ANY OF 1, 2” within CDI taskflows.
  • Partitions are not supported with Sources under Query mode. Ensure multiple sessions are created and run in parallel as a workaround.
  • Currently, parameterization of Schema/Table is not possible for Mainframe DB2. Use an ODBC-type connection to access DB2 with Schema/Table parameterization.

Operations Best Practices

  • Use Verbose data Session log config only if absolutely required, and then only in the lower environment.
  • Ensure the Sessions pick the parameter values properly during job execution
    • This can be verified by changing the parameter names and values to incorrect values and determining if the job fails during execution. If the job fails, it means that the parameters are READ correctly by the CDI sessions.
  • Ensure the Taskflow name and API name always match. If different, the job will face issues during execution via the runAJobCli utility from the command prompt.
  • CDI doesn’t store logs beyond 1000 mapping tasks run in 3 days on Cloud (it does store logs in Secure Agent). To retain Cloud job run stats, create Audit tables and use the Data Marketplace utility to get the Audit info (Volume processes, Start/End time, etc) loaded to the Audit tables by scheduling this job at regular intervals (Hourly or Daily).
  • In order to ensure no issues with Generic Restartability during Operations, ensure a Dummy assignment task is introduced whenever the code contains Custom error handling flow.
  • In order to facilitate SKIP FAILED TASK and RESUME FROM NEXT TASK operations, ensure every LINK condition has an additional condition appended, “Mapping task. Fault.Detail.ErrorOutputDetail.TaskStatus=1”
  • If mapping task log file names are to be suffixed with the Concurrent run workflow instance name, ensure it is done within the Parameter file. IDMC mapping task config level is not capable due to parameter concatenation issues.
  • Ensure to copy mapping task log files in the Secure agent server after job run, since IDMC doesn’t honour the “Save Session log for these runs” property set at the mapping task level when the session log file name is parameterized.
  • Ensure Session Log File Directory doesn’t contain / (Slash) when used along with parameters (ex., $PMSessionLogDir/ABC) under Session Log Directory Path. When used, this would append every run log to the same log file.
  • Concurrent runs cannot be performed on taskflows from the  CDI Data Integration UI. Use the Paramset utility to upload concurrent paramsets and use the runAJobCli utility to run taskflows with multiple concurrent run instances from the command prompt.

Conclusion

In addition to coding best practices, following these Development and Operations best practices will help avoid rework and save efforts, thereby achieving customer satisfaction with the Delivery.

]]>
https://blogs.perficient.com/2025/06/05/idmc-cdi-best-practices/feed/ 0 382442
Azure SQL Server Performance Check Automation https://blogs.perficient.com/2024/04/11/azure-sql-server-performance-check-automation/ https://blogs.perficient.com/2024/04/11/azure-sql-server-performance-check-automation/#respond Thu, 11 Apr 2024 13:37:29 +0000 https://blogs.perficient.com/?p=361522

On Operational projects that involves heavy data processing on a daily basis, there’s a need to monitor the DB performance. Over a period of time, the workload grows causing potential issues. While there are best practices to handle the processing by adopting DBA strategies (indexing, partitioning, collecting STATS, reorganizing tables/indexes, purging data, allocating bandwidth separately for ETL/DWH users, Peak time optimization, effective DEV query Re-writes etc.,), it is necessary to be aware of the DB performance and consistently monitor for further actions. 

If Admin access is not available to validate the performance on Azure, building Automations can help monitor the space and necessary steps before the DB causes Performance issues/failures. 

Regarding the DB performance monitoring, IICS Informatica Job can be created with a Data Task to execute DB (SQL Server) Metadata tables query to check for the performance and Emails can be triggered once Free space goes below the threshold percentage (ex., 20 %). 

IICS Mapping Design below (scheduled Hourly once). Email alerts would contain the Metric percent values. 

                        Iics Mapping Design Sql Server Performance Check Automation 1

Note : Email alerts will be triggered only if the Threshold limit exceeds. 

                                             

IICS ETL Design : 

                                                     

                     Iics Etl Design Sql Server Performance Check Automation 1

IICS ETL Code Details : 

 

  1. Data Task is used to get the Used space of the SQL Server performance (CPU, IO percent).

                                          Sql Server Performance Check Query1a

Query to check if Used space exceeds 80% . I Used space exceeds the Threshold limit (User can set this to a specific value like 80%), and send an Email alert. 

                                                            

                                         Sql Server Performance Check Query2

If Azure_SQL_Server_Performance_Info.dat has data (data populated when CPU/IO processing exceeds 80%) the Decision task is activated and Email alert is triggered. 

                                          Sql Server Performance Result Output 1                                          

Email Alert :  

                                            Sql Server Performance Email Alert

]]>
https://blogs.perficient.com/2024/04/11/azure-sql-server-performance-check-automation/feed/ 0 361522
Step by step guide to secure JDBC SSL connection with Postgres in AWS Glue https://blogs.perficient.com/2024/04/05/step-by-step-guide-to-secure-jdbc-ssl-connection-with-postgre-in-aws-glue/ https://blogs.perficient.com/2024/04/05/step-by-step-guide-to-secure-jdbc-ssl-connection-with-postgre-in-aws-glue/#respond Sat, 06 Apr 2024 04:48:06 +0000 https://blogs.perficient.com/?p=360279

Have you ever tried connecting a database to AWS Glue using a JDBC SSL encryption connection? It can be quite a puzzle. A few months ago, I faced this exact challenge. I thought it would be easy, but  I was wrong! When I searched for help online, I couldn’t find much useful guidance. So, I rolled up my sleeves and experimented until I finally figured it out.

Now, I am sharing my learnings with you. In this blog, I’ll break down the steps in a clear, easy-to-follow way. By the end, you’ll know exactly how to connect your database to AWS Glue with SSL encryption. Let’s make this complex task a little simpler together.

Before moving ahead let’s discuss briefly how SSL encryption works

  1. The client sends a connection request (Client Hello).
  2. The server responds, choosing encryption (Server Hello).
  3. The client verifies the server’s identity using its certificate and root certificate.
  4. Key exchange establishes a shared encryption key.
  5. Encrypted data exchanged securely.
  6. Client may authenticate with its certificate before encrypted data exchange.
  7. Connection terminates upon session end or timeout.

SSL encription

Now you got basic understanding let’s continue to configure the Glue for SSL encryption.

The steps above are the basic steps of SSL encryption process. Let’s us now discuss how to configure the AWS Glue for SSL encryption.Before we start the configuration process we need the following Formatting below

1)Client Certificate

2 Root Certificate

3) Certificate Key

into DER format. This is the format suitable for AWS glue.

DER (Distinguished Encoding Rules) is a binary encoding format used in cryptographic protocols like SSL/TLS to represent and exchange data structures defined by ASN.1. It ensures unambiguous and minimal-size encoding of cryptographic data such as certificates.

Here’s how you can do it for each component:

1 .Client Certificate (PEM):

This certificate is used by the client (in this case, AWS Glue) to authenticate itself to the server (e.g., another Database) during the SSL handshake. It includes the public key of the client and is usually signed by a trusted Certificate Authority (CA) or an intermediate CA.

If your client certificate is not already in DER format, you can convert it using the OpenSSL command-line tool:

openssl x509 -in client_certificate.pem -outform der -out client_certificate.der

Replace client_certificate.pem with the filename of your client certificate in DER format, and client_certificate.der with the desired filename for the converted DER-encoded client certificate.

 

2.Root Certificate (PEM):

The root certificate belongs to the Certificate Authority (CA) that signed the server’s certificate (in this case, Postgre Database). It’s used by the client to verify the authenticity of the server’s certificate during the SSL.

Convert the root certificate to DER format using the following command:

openssl x509 -in root_certificate.pem -outform der -out root_certificate.der

Replace root_certificate.pem with the filename of your root certificate in DER format, and root_certificate.der with the desired filename for the converted DER-encoded root certificate.

 

3.Certificate Key (PKCS#8 PEM):

This is the private key corresponding to the client certificate. It’s used to prove the ownership of the client certificate during the SSL handshake.

Convert the certificate key to PKCS#8 PEM format using the OpenSSL command-line tool:

openssl pkcs8 -topk8 -inform PEM -outform DER -in certificate_key.pem -out certificate_key.pk8 -nocrypt

Replace certificate_key.pem with the filename of your certificate key in PEM format, and certificate_key.pk8 with the desired filename for the converted PKCS#8 PEM-encoded certificate key.

 

Stored the above certificates and key to S3 bucket. We will need these certificates while configuring the AWS glue.

 

AWS S3 Files

 

To connect AWS Glue to a PostgreSQL database over SSL using PySpark, you’ll need to provide the necessary SSL certificates and configure the connection properly. Here’s an example PySpark script demonstrating how to achieve this:

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql import SparkSession

# Initialize Spark and Glue contexts
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Define PostgreSQL connection properties
jdbc_url = "jdbc:postgresql://your_postgresql_host:5432/your_database"
connection_properties = {
    "user": "your_username",
    "password": "your_password",
    "ssl": "true",
    "sslmode": "verify-ca",  # SSL mode: verify-ca or verify-full
    "sslrootcert": "s3://etl-test-bucket1/root_certificate.der",  # S3 Path to root certificate
    "sslcert": "s3://etl-test-bucket1/client_certificate.der",     # S3 Path to client certificate
    "sslkey": "s3://etl-test-bucket1/certificate_key.pk8"         # S3 Path to client certificate key
}

# Load data from PostgreSQL table
dataframe = spark.read.jdbc(url=jdbc_url, table="your_table_name", properties=connection_properties)

# Perform data processing or analysis
# For example:
dataframe.show()

# Stop Spark session
spark.stop()

 

Now inside your glue job click on Job details page and scroll down until you see Dependent JARs path and Referenced files path options. Under Dependent JARs path put location of S3  path where you stored the jar file and in Referenced files path add the S3 path of converted Client,Root and Key certificate separated by comma “,”

AWS Glue Job Details

 

Now Click on Save option and you are ready to Go

 

This concludes the steps to configure secure JDBC Connection with DB in AWS Glue. To summarize, in this blog we:

1)Explained how to SSL encryption can be used for secure data exchange between AWS Glue and your database(here Postgresql)

2) The steps to configure SSL Encryption in AWS Glue to secure JDBC connection with a database

 

You can read my other blogs here

read more about AWS Glue

 

 

 

 

 

 

]]>
https://blogs.perficient.com/2024/04/05/step-by-step-guide-to-secure-jdbc-ssl-connection-with-postgre-in-aws-glue/feed/ 0 360279
Navigating Snaplogic Integration: A Beginner’s Guide https://blogs.perficient.com/2024/03/05/navigating-snaplogic-integration-a-beginners-guide/ https://blogs.perficient.com/2024/03/05/navigating-snaplogic-integration-a-beginners-guide/#comments Tue, 05 Mar 2024 07:52:56 +0000 https://blogs.perficient.com/?p=353553

As there is rapid growth in businesses going digital, the need to develop scalable and reliable functionalities to connect applications, Cloud environments, on-premises assets have grown. To resolve these complex scenarios, iPaaS seems to be a perfect solution.

For example, if a developer needs to connect and transfer huge data from an e-commerce platform to a CRM system, writing custom code to handle data transfer would be tedious. Instead, the developer can simply consume APIs deployed to iPaaS, significantly reducing development time and effort.

But What Exactly is iPaaS?

Integration Platform as a Service (iPaaS) is a cloud-based solution that makes integrating different applications, data sources and systems easier. It typically provides built-in connectors, reusable components, and tools for designing, executing, and monitoring integrations. This helps businesses enhance operational efficiency, reduce manual efforts, and quickly adapt to changing technology landscapes.

Today, we will talk about one of the iPaaS solutions which stands as a visionary in the Gartner’s magic quadrant of 2023 i.e. SnapLogic.

Picture1

What is SnapLogic?

SnapLogic is an iPaaS (Integration Platform as a Service) tool, that allows organization to connect various applications, data sources, and APIs to facilitate data integration, automation, and workflows.

It provides a visual interface for designing integration pipelines, making it easier for both technical and non-technical users to create and manage data integrations. SnapLogic supports hybrid cloud and on-premises deployment and is used for tasks such as data migration, ETL (Extract, Transform and Load) processes and application integration.

Getting Started with the Basics of SnapLogic

To kick-start your journey, spend 5-10 minutes for setup. Here are the steps to quickly setup your training environment.

  1. Sign Up for SnapLogic: You must sign up for an account. For training and better hands-on experience, SnapLogic provides a training account for 4 weeks. You can start with the training account to explore its features. Here is the link to get the training account: SnapLogic User Login.
  2. Access SnapLogic designer: SnapLogic designer is the heart of its integration capabilities. Once you have signed up, you can access it from your account.
  3. Course suitable for beginners: Click this link to enroll in theSnapLogic Certified Enterprise Automation Professional” entry-level course to quickly get up to speed on SnapLogic.

Features of SnapLogic

SnapLogic is an integration platform that makes connecting different data sources and applications easier. Some key features include:

  1. Multi-cloud Integration: Supports integration across various cloud platforms.
  2. Low-Code Approach: Reduces the requirement for advanced coding knowledge.
  3. API Management: Helps manage APIs and create custom APIs between different applications.
  4. Real-time Integration: Supports real-time data integration.

Overview of Use Case

Done with sign-up and setup! Lessons that are theoretical are never easy to learn until you continue to do hands-on in parallel. Let’s look at a practical use case to simplify learning.

The customer must automatically insert the employee records from the Excel file in a shared directory to the salesforce CRM end system.

How Can We Achieve This Using SnapLogic?

SnapLogic provides pre-built snaps, such as file reader, CSV parser, mapper, salesforce create, and many more.

For achieving the below use case, we need to add the File Reader Snap to fetch the csv file, to parse the data use the CSV Parser, Mapper Snap to transform the data, and lastly, Salesforce Create to insert the data into it.

Creating the pipeline

  1. Upload your CSV file to the SnapLogic file system as we need to read the csv file.Picture1
  2. Creating a pipeline is the first step in building an integration. You can click the “+” sign on the top of the middle canvas as follows: Picture2Then fill the pipeline name and parent project then click “Save”.Picture3
  3. Add and configure the file reader snap: For the file field you uploaded in step 1. Because you are accessing the file system, no authentication information is needed.Picture4Picture5
  4. Add a CSV parser snap; you will use the default configuration.Picture6Picture7
  5. Add the Mapper: It transforms the incoming data using specific mapping and produces new output data.Picture8Picture9
  6. Salesforce Create: It creates the records into a Salesforce account object using the Rest API.Picture10 Picture11
  7. After saving, SnapLogic will automatically validate changes; you can click on the green document icon to view what your data looks like.Picture12
  8. Test the pipeline: After the build is done, we can test the pipeline now. To do that, click on the “play” icon in the pipeline menu and wait for the pipeline to finish executing. Notice how the color of the snaps turns yellow while executing, indicating they are currently running.Picture13
  9. Validate the Results: Once the execution finishes, the pipeline turns dark green. If there’s any exception, the failing snap turns red.Picture14
  10. Results: Login to the salesforce account > accounts > Click on the recently viewed accounts. You will be able to see the records that were fetched from the Employee_Data.csv file.Picture15

Conclusion

Congratulations on completing your first SnapLogic integration! In this blog, we went through the basics of iPaaS and SnapLogic. We also went through a practical use case to gain confidence and better understand. Our journey in SnapLogic has just started, and we’ll be exploring more in the future to expand on the knowledge we accumulated in this article.

Perficient and SnapLogic

At Perficient, we develop scalable and robust integrations within the SnapLogic Platform. With our expertise in SnapLogic, we resolve customers’ complex business problems, which helps them grow their business efficiently.

Contact us today to explore more options for elevating your business.

]]>
https://blogs.perficient.com/2024/03/05/navigating-snaplogic-integration-a-beginners-guide/feed/ 1 353553
Data Virtualization with Oracle Enterprise Semantic Models https://blogs.perficient.com/2024/02/22/data-virtualization-with-oracle-enterprise-semantic-models/ https://blogs.perficient.com/2024/02/22/data-virtualization-with-oracle-enterprise-semantic-models/#respond Thu, 22 Feb 2024 22:51:57 +0000 https://blogs.perficient.com/?p=357386

A common symptom of organizations operating at suboptimal performance is when there is a prevalent challenge of dealing with data fragmentation. The fact that enterprise data is siloed within disparate business and operational systems is not the crux to resolve, since there will always be multiple systems. In fact, businesses must adapt to an ever-growing need for additional data sources. However, with this comes the challenge of mashing up data across systems to provide a holistic view of the business. This is the case for example for a customer 360 view that provides insight into all aspects of customer interactions, no matter where that information comes from, or whether it’s financial, operational or customer experience related. In addition, data movements are complex and costly. Organizations need the agility to adapt quickly to the additional sources, while maintaining a unified business view.

Data Virtualization As a Key Component Of a Data Fabric

That’s where the concept of data virtualization provides an adequate solution. Data stays where it is, but we report on it as if it’s stored together. This concept plays a key role in a data fabric architecture which aims at isolating the complexity of data management and minimizing disruption for data consumers. Besides data-intensive activities such as data storage management and data transformation, a robust data fabric requires a data virtualization layer as a sole interfacing logical layer that integrates all enterprise data across various source applications. While complex data management activities may be decentralized across various cloud and on-premises systems maintained by various teams, the virtual layer provides a centralized metadata layer with well-defined governance and security.

How Does This Relate To a Data Mesh?

What I’m describing here is also compatible with a data mesh approach whereby a central IT team is supplemented with products owners of diverse data assets that relate to various business domains.  It’s referred to as the hub-and-spoke model where business domain owners are the spokes, but the data platforms and standards are maintained by a central IT hub team. Again, the data mesh decentralizes data assets across different subject matter experts but centralizes enterprise analytics standards. Typically, a data mesh is applicable for large scale enterprises with several teams working on different data assets. In this case, an advanced common enterprise semantic layer is needed to support collaboration among the different teams while maintaining segregated ownerships. For example, common dimensions are shared across all product owners allowing them to report on the company’s master data such as product hierarchies and organization rollups. But the various product owners are responsible for consuming these common dimensions and providing appropriate linkages within their domain-specific data assets, such as financial transactions or customer support requests.

Oracle Analytics for Data Virtualization

Data Virtualization is achieved with the Oracle Analytics Enterprise Semantic Model. Both the Cloud version, Oracle Analytics Cloud (OAC) and the on-premises version, Oracle Analytics Server (OAS), enable the deployment of the semantic model. The semantic model virtualizes underlying data stores to simplify data access by consumers. In addition, it defines metadata for linkages across the data sources and enterprise standards such as common dimensions, KPIs and attribute/metric definitions. Below is a schematic of how the Oracle semantic model works with its three layers.

Oracle Enterprise Semantic Model

Outcomes of Implementing the Oracle Semantic Model

Whether you have a focused data intelligence initiative or a wide-scale program covering multi-cloud and on-premises data sources, the common semantic model has benefits in all cases, for both business and IT.

  • Enhanced Business Experience

With Oracle data virtualization, business users tap into a single source of truth for their enterprise data. The information available out of the Presentation Layer is trusted and is reported on reliably, no matter what front end reporting tool is used: such as self-service data visualization, dashboards, MS Excel, Machine Learning prediction models, Generative AI, or MS Power BI.

Another value-add for the business is that they can access new data sources quicker and in real-time now that the semantic layer requires no data movement or replication. IT can leverage the semantic model to provide this access to the business quickly and cost-effectively.

  • Future Proof Investment

The three layers that constitute the Oracle semantic model provide an abstraction of source systems from the presentation layer accessible by data consumers. Consequently, as source systems undergo modernization initiatives, such as cloud migrations, upgrades and even replacement with totally new systems, data consuming artifacts, such as dashboards, alerts, and AI models remain unaffected. This is a great way for IT to ensure any analytics investment’s lifespan is prolonged beyond any source system.

  • Enterprise Level Standardization

The semantic model enables IT to enforce governance when it comes to enterprise data shared across several departments and entities within an organization. In addition, very fine-grained object and data levels security configurations are applied to cater for varying levels of access and different types of analytics personas.

Connect with us for consultation on your data intelligence and business analytics initiatives.

]]>
https://blogs.perficient.com/2024/02/22/data-virtualization-with-oracle-enterprise-semantic-models/feed/ 0 357386
3 Key Takeaways from AWS re:Invent 2023 https://blogs.perficient.com/2023/12/11/three-key-takeaways-from-aws-reinvent-2023/ https://blogs.perficient.com/2023/12/11/three-key-takeaways-from-aws-reinvent-2023/#comments Mon, 11 Dec 2023 20:58:39 +0000 https://blogs.perficient.com/?p=351249

Now that the dust has settled, the team has had the chance to Re:flect on the events and announcements of AWS re:Invent 2023. Dominating the conversation was the advancement and capabilities of Generative AI across several AWS Services, while not losing sight on the importance of application modernization and cloud migration. Perficient walked away with 3 key takeaways: 1) Amazon Q 2) Serverless Innovation 3) The Zero ETL Future

1. Amazon Q

Generative AI was the talk of the conference, and no topic was discussed more than Amazon Q. The powerful, new generative AI assistant can be tailored to your business and can be used to generate content and solve problems, or if leveraged with Amazon Connect, now with generative AI capabilities that are powered through Amazon Bedrock, it can allow your agents to respond faster by assisting with suggesting actions or links to relevant articles. AI is here and it isn’t going anywhere, but what might be most important is to ensure it is being used responsibly. “What’s exciting here is that the path to responsibly enabling AI for enterprise is starting to light up…” Steve Holstad, Principal of Cloud said, “We know it’s going to be an ongoing journey for years to come, but the time for a private pilot leveraging your data, based on your unique use cases, is here.” At Perficient, we are at the forefront of the next generation of AI and ML. We’re excited about the progress we’ve made and are looking forward to creating innovative solutions with AWS Q.

Read Zachary Fischer’s, Senior Solutions Architect, blog about exploring the potential of Amazon Q and Perficient Handshake.

2. Serverless Innovation

Serverless computing isn’t new to AWS, as their wide variety of serverless data offerings have been helping customers take advantage of automated methods of setting up infrastructure, real time scaling, and dynamic pricing. Three new AWS serverless innovations for Amazon Aurora, Amazon Redshift, and Amazon ElastiCache build on the work AWS has already been doing for some time.

  1. Amazon Aurora Limitless Database: A new feature supporting automated horizontal scaling to process millions of transactions at a speed unlike any before and manage an excessive amount of data in a single Aurora database.
  2. Amazon Redshift Serverless: Gather insights in seconds without having to manage data warehouse infrastructure. Leverage its self-service analytics and autoscaling capabilities to better make sense of your data.
  3. Amazon ElastiCache Serverless: An innovative serverless solution enabling users to create a cache within a minute and dynamically adjust capacity in real-time according to application traffic trends.

Learn more by reading Shishir Meshram’s, Senior Technical Consultant, blog about Perficient’s ability to help achieve a serverless infrastructure.

3. The Zero ETL Future

Historically, to connect all your data sources to find new insights, you’d need to “extract, transform, and load” (ETL) information in a tedious manual effort. AWS announced several new integrations as part of their continued commitment to a, “zero ETL future,” so users can access data when and where they need it. In his keynote presentation, Dr. Swami Sivasubramanian, Vice President of Data and AI at AWS, said, ““In addition to having the right tool for the job, customers need to be able to integrate the data that is spread across their organizations to unlock more value for their business and innovate faster. That is why we are investing in a zero-ETL future, where data integration is no longer a tedious, manual effort, and customers can easily get their data where they need it.”

Learn more about these integrations, and find out how like AWS, you can work your way toward a “zero ETL future.”

This was just the tip of the iceberg of what was discussed at AWS re:Invent and Perficient is excited to be in the thick of it! Join us on this journey of discovery. Let’s see what we can build together.

]]>
https://blogs.perficient.com/2023/12/11/three-key-takeaways-from-aws-reinvent-2023/feed/ 1 351249
SQL Server Space Monitoring https://blogs.perficient.com/2023/11/28/sql-server-space-monitoring/ https://blogs.perficient.com/2023/11/28/sql-server-space-monitoring/#respond Wed, 29 Nov 2023 05:47:24 +0000 https://blogs.perficient.com/?p=350339

On Operational projects that involves heavy data volume load on a daily basis, there’s a need to monitor the DB Disk Space availability. Over a period of time, the size grows occupying the disk space. While there are best practices to handle the size by adopting strategies of Purge for outdated data and add buffer/temp/data/log space to address the growing needs, it is necessary to be aware of the Disk space and consistently monitor for further actions.

If Admin access is not available to validate the Available, building Automations can help monitor the space and necessary steps before the DB causes Performance issues/failures.

Regarding the DB Space monitoring, IICS Informatica Job can be created with a Data Task to execute DB (SQL Server) Metadata tables query to check for the Available Space and Emails can be triggered once Free space goes below the threshold percentage (ex., 20 %).

IICS Mapping Design below (scheduled Daily once). Email alerts would contain the Metric percent values.

 

Capture

 

Note : Email alerts will be triggered only if the Threshold limit exceeds.

 

IICS ETL Code Details :

 

  1. Data Task is used to get the Used space of the SQL Server Log and Data files.

Capture

Capture

 

Query to check if Used space exceeds 80% . I Used space exceeds the Threshold limit (User can set this to a specific value like 80%), and send an Email alert.

 

Capture

 

If D:\Out_file.dat has data (data populated when Used space exceeds 80%) the Decision task is activated and Email alert is triggered.

 

 

]]>
https://blogs.perficient.com/2023/11/28/sql-server-space-monitoring/feed/ 0 350339
Windows Folder/Drive Space Monitoring https://blogs.perficient.com/2023/11/28/windows-folder-drive-space-monitoring/ https://blogs.perficient.com/2023/11/28/windows-folder-drive-space-monitoring/#respond Wed, 29 Nov 2023 05:38:23 +0000 https://blogs.perficient.com/?p=350216

Often there’s a need to monitor the OS Disk Drive Space availability with the Drive holding ETL operational files (log, cache, temp, bad files etc.). Over a period of time, the # of files grows occupying the disk space. While there are best practices to limit the # of operational files and clear them from the Disk on regular basis (via Automations), it is recommended to be aware of the available space.

If Admin access is not available to validate the Available space and if the ETL Server is on a Remote machine, building Automations can help monitor the space and necessary steps before ETL causes Performance issues/failures.

Regarding the OS Folder/Drive Space monitoring, IICS Informatica Job can be created with a Command Task to execute Windows commands via Batch scripts to check for the Available Space and Emails can be triggered once Free space goes below the threshold percentage (ex., 20 %).

IICS Taskflow Design below (can be scheduled Bi-Weekly or Monthly according to the requirements). Email alerts would have the Free space percent value.

 

Capture

 

 

 

 

 

Note : Email alerts will be triggered only if the Threshold limit exceeds.

 

IICS ETL Code Details :

 

  1. Windows Command Task is used to get the Free space of the OS Drive/Network Drive/Folder on which ETL is installed and the log files are held.

 

Capture

 

 

D:\space_file_TGT.dat Content: (Drive Name, Free space, Overall Space)

D:,11940427776,549736935424

 

D:\Out_file.dat Content : (Drive Name, Free space[GB], Overall Space[GB], Flag[if Free space <25 set as Alert],Used space percent

D:,11940427776,549736935424,ALERT,98%

  1. IICS Data Task to populate D:\Out_file.dat

If D:\Out_file.dat has data (data populated when Free space <25 ) the Decision task is activated and Email alert is triggered.

Email Alert :

 

Capture

 

]]>
https://blogs.perficient.com/2023/11/28/windows-folder-drive-space-monitoring/feed/ 0 350216
An Introduction to ETL Testing https://blogs.perficient.com/2023/08/23/an-introduction-to-etl-testing/ https://blogs.perficient.com/2023/08/23/an-introduction-to-etl-testing/#comments Wed, 23 Aug 2023 05:13:30 +0000 https://blogs.perficient.com/?p=341215

ETL testing is a type of testing technique that requires human participation in order to test the extraction, transformation, and loading of data as it is transferred from source to target according to the given business requirements.

Take a look at the block below, where an ETL tool is being used to transfer data from Source to Target. Data accuracy and data completeness can be tested via ETL testing.

Etl 1

What Is ETL? (Extract, Transform, Load)

Data is loaded from the source system to the data warehouse using the Extract-Transform-Load (ETL) process, referred to as ETL.

Extraction defines the extraction of data from the sources (The sources can be either from a legacy system, a Database, or through Flat files).

Transformation defines Data that is transformed as part of cleaning, aggregation, or any other data alterations completed in this step of the transformation process.

Loading defines the load of data from the Transformed data into the Target Systems called Destinations (The Destinations can again be either a Legacy system, Database, or flat file).

Etl4

 

What is ETL testing?

Data is tested via ETL before being transferred to live data warehouse systems. Reconciliation of products is another name for it. ETL testing differs from database testing in terms of its scope and the procedures used to conduct the test. When data is loaded from a source to a destination after transformation, ETL testing is done to ensure the data is accurate. Data that is used between the source and the destination is verified at several points throughout the process.

What Is Etl Testing

In order to avoid duplicate records and data loss, ETL testing verifies, validates, and qualifies data. Throughout the ETL process, there are several points where data must be verified.

While testing tester confirms that the data we have extracted, transformed, and loaded has been extracted completely, transferred properly, and loaded into the new system in the correct format.

ETL testing helps to identify and prevent issues with data quality during the ETL process, such as duplicate data or data loss.

Test Scenarios of ETL Testing:

 1. Mapping Document Validation

Examining the mapping document for accuracy to make sure all the necessary data has been provided. The most crucial document for the ETL tester to design and construct the ETL jobs is the ETL mapping document, which comprises the source, target, and business rules information.

Example:  Consider the following real-world scenario: We receive a source file called “Employee_info” that contains employee information that needs to be put into the target’s EMP_DIM table.

The following table shows the information included in any mapping documents and how mapping documents will look.

Depending on your needs, you can add additional fields.

Etl6

 2. DDL/Metadata Check

Validate the source and target table structure against the corresponding mapping doc. The source data type and target data type should be identical. Length of data type in both the source and target should be equal. Will verify that the data field type and format are specified. Also, validate the name of the column in the table against the mapping doc.

Ex. Check the below table to verify the mentioned point of metadata check.

Source – company_dtls_1

Target – company_dtls_2

Etl5

 3. Data Completeness Validation

Data Completeness will Ensure that all expected data is loaded into the target table. And check for any rejected records and boundary value analysis. Will Compare record counts between the source and target. And will see data should not be truncated in the column of target tables. Also, compare the unique value of key fields between data loaded to WH and source data.

Example:

You have a Source table with five columns and five rows that contain company-related details. You have a Target table with the same five columns. After the successful completion of an ETL, all 5 records of the source table (SQ_company_dtls_1) are loaded into the target table (TGT_company_dtls_2) as shown in the below image. If any Error is encountered while ETL execution, its error code will be displayed in statistics.

Etl7

 4. Constraint Validation

To make sure the key constraints are defined for specific tables as expected.

    • Not Null & Null
    • Unique
    • Primary Key & Foreign Key
    • Default value check

5. Data Consistency Check

    • The data type and data length for particular attributes may vary in files or tables though the semantic definition is the same.
    • Validating the misuse of integrity constraints like Foreign Key

6. Data Correctness

    • Data that is misspelled or inaccurately recorded.
    • Null, non-unique, or out-of-range data

 

Why Perform ETL Testing?

Etl3

Inaccurate data resulting from flaws in the ETL process can lead to data issues in reporting and poor strategic decision-making. According to analyst firm Gartner, bad data costs companies, on average, $14 million annually with some companies costing as much as $100 million.

A consequence of inaccurate data is:

A large fast-food company depends on business intelligence reports to determine how much raw chicken to order every month, by sales region and time of year. If these data are inaccurate, the business may order too much, which could result in millions of dollars in lost sales or useless items.

When do we need ETL Testing?

Here are a few situations where it is essential to use ETL testing:

  • Following a data integration project.
  • Following a data migration project.
  • When the data has been loaded, during the initial setup of a data warehouse.
  • Following the addition of a new data source to your existing data warehouse.
  • When migrating data for any reason.
  • In case there are any alleged problems with how well ETL operations work.
  • whether any of the source systems or the target system has any alleged problems with the quality of the data

Required Skillset for ETL Tester:

  • Knowledge of BI, DW, DL, ETL, and data visualization process
  • Very good experience in analyzing the data and their SQL queries
  • Knowledge of Python, UNIX scripting
  • Knowledge of cloud technologies like AWS, Azure, Hadoop, Hive, Spark

Roles and responsibilities of ETL Tester:

To protect the data quality of the company, an ETL tester plays a crucial role.

ETL testing makes sure that all validity checks are met and that all transformation rules are strictly followed while transferring data from diverse sources to the central data warehouse. The main role of an ETL tester includes evaluating the data sources, data extraction, transformation logic application, and data loading in the destination tables. Data reconciliation is used in database testing to acquire pertinent data for analytics and business intelligence. ETL testing is different from data reconciliation. It is used by data warehouse systems.

Responsibilities of an ETL tester:

  • Understand the SRS document.
  • Create, design, and execute test cases, test plans, and test harnesses.
  • Test components of ETL data warehouse.
  • Execute backend data-driven test.
  • Identify the problem and provide solutions for potential issues.
  • Approve requirements and design specifications.
  • Data transfers and Test flat files.
  • Constructing SQL queries for various scenarios, such as count tests.
  • Inform development teams, stakeholders, and other decision-makers of the testing results.
  • To enhance the ETL testing procedure over time, incorporate new knowledge and best practices.

In general, an ETL tester is the organization’s data quality guardian and ought to participate in all significant debates concerning the data used for business intelligence and other use cases.

Conclusion:

Here we learned what ETL is, what is ETL testing, why we perform ETL testing when we need ETL testing, what skills are required for an ETL tester, and the Role and responsibilities of an ETL tester.

Happy Reading!

]]>
https://blogs.perficient.com/2023/08/23/an-introduction-to-etl-testing/feed/ 3 341215
Basic Understanding of Full Load And Incremental Load In ETL (PART 2) https://blogs.perficient.com/2023/05/15/basic-understanding-of-full-load-and-incremental-load-in-etl-part-2/ https://blogs.perficient.com/2023/05/15/basic-understanding-of-full-load-and-incremental-load-in-etl-part-2/#comments Mon, 15 May 2023 12:19:31 +0000 https://blogs.perficient.com/?p=334752

In the last blog PART1, we discussed Full load with the help of an example in the SSIS (SQL Server Integration Service).

In this blog, we will discuss the concept of Incremental load with the help of the Talend Open Studio ETL Tool.

Incremental Load:

The ETL Incremental Loading technique is a fractional loading method. It reduces the amount of data that you add or change and that may need to be rectified in the event of any irregularity. Because less data is loaded and reviewed, it also takes less time to validate the data and review changes.

Let’s Elaborate this with an Example:

Suppose the file is very large, for example, there are 200m to 500m records to load,  it is not possible to load this amount of data in a feasibletime because sometimes we do not havethe required amount of time to load the data during the day.So we  have to update the data during night-time and  which is limited in terms of hours.. Hence there is a great possibility that the entire amount of data is not loaded.

In scenarios where the actual updated records are very less in number but the overall  data size is very huge, we go with the incremental load, or in other words the differential load.

In the incremental load, we figure out how many many records are to be updated to the destination table and how many records in the source file or source table which are new that can be inserted into the destination table. Once this is decided, we just update or insert to the destination table.

How to Perform Incremental Load in Talend ETL?

Incremental loading with Talend can be done like in any other ETL tool. You must measure in your job the necessary time stamps of sequence values and keep the highest value for the next run and use this value in a query where the condition is to start reading all rows with this higher value.

Incremental loading is a way to update a data set with new data. It can be done by replacing or adding records in a table or partition of a database.

There are different ways to perform an incremental load in Talend ETL:

1) Incremental Load on New File: This method updates the existing data set with new data from an external file. This is done by importing the new data from the external file and overwriting the existing records.

2) Incremental Load on Existing File: This method updates the existing data set with new data from another source, such as a database table. In this case, records from both sources are merged and updated in one go.

3) The source database may have date time fields that may help us identify which source records got updated. Using the context variable and audit control table features, we can retrieve only the newly inserted or updated records from the source database.

 

Now you all know what Incremental Load in ETL is, Let’s Explore this using the Talend Open Studio.

Source Table:

We have a source table Product_Details with created_on and modified_on columns. Also, we have  some existing data in the table.

T2

ETL Control Table:

By using the etl_control table we will capture the last time when the job was successful. When we have 100 jobs and tables. We don’t want to keep it in different places it is always good practice to keep one etl_control table. In which we will capture the particular job name, table name, and last success as when it was last loaded.

T1

Target Table:

Product_inc is our target table. In the ETL Control table, we will give a last success date older than the source table and we will give conditions on the basis of the created_on column to insert and update data in the target table Product_inc.

T3

Now we will Explore our Talend job.

Job

First, we will drag and drop tDBConnection for our Postgres SQL connection. So, we can use this connection multiple times in the job. hen we will import all the tables.

Now we drag the etl_control table as input where we are saving the last success timestamp for a particular job.

Then we will drag and drop the tJavaRow component. With the help of this component, we will set the value for the last success timestamp. We write a Java code as below.

1

To store those values, we will create two context variables last_success timestamp and current_run. Timestamp.

  1. Last_success will be used to retrieve the data from the source.
  2. Current_run will be used to update the etl_control back when the job was successful.

Variable

Now we drag and drop the tPreJob component ensures that the below steps are always executed before the sub-job execution.

Subjob1

Next we add the actual source and target table to create the Sub-Job. Also, we drag the etl_control table as a tDBRow component to update back etl_control table.

It is connected with OnSubJobOk with the source table. So, if the job fails for any reason so it will not update back etl_control table because in the next run or the next day run the same records will be processed from the point it was processed last time.

Subjob2

Input Component:

We  change the existing query which is selecting all the column’s data with no condition.

For incremental load, we provide filter conditions so it will select newly inserted rows and updated values from the last run of the job.

 

“select * from product_details

where created_on >= ‘” + context.last_success +

“‘ or modified_on >= ‘” + context.last_success + “‘”

2

 

 

Output Component:

In the target table, we will modify in Action on Data to Insert and Update for the table

Based on the key value so in the edit schema we will provide the key value in the target table to product_id.

4

 

Control Component:

We will add an update command to update the etl_control table.

“Update etl_control set last_success = ‘”

+ context.current_run+

“‘ where job_name = ‘” + jobName + “‘”

3

This update command will dynamically update the last_success timestamp with a timestamp of the job run time. If we have multiple jobs so for a particular job, we also provided a condition where we used the global variable jobName to update the particular job’s last_success time stamp.

RUN1:

Now save the job and run it. We can see we read one record from the etl_control table and inserted 5 rows in the target table.

Run1

In the etl_control table based on the job name; it will update the last_success timestamp with the job run timestamp.

Run1t1Run1t3

If we rerun the job without any changes, it will not process any record in the sub-job present in the source table.

RUN2:

Now we will update one of the values in the source table and then run the job again.

It will capture only one record that is updated based on the last successful run time.

Update1

 

Run2 Run2t1 Run2t3

 

RUN3:

Now, we will insert one new record and update one of the existing values and then run the job again.

Insertupdate

We can see two records from which one is a newly inserted record, and one is an updated record.

Run3 Run3t1 Run3t3

So, this is how incremental load works based on the last successful run time to the start of the job to pick up inserted or updated records.

Please share your thoughts and suggestions in the space below, and I’ll do my best to respond to all of them as time allows.

For more such blogs click here

Happy Reading!

]]>
https://blogs.perficient.com/2023/05/15/basic-understanding-of-full-load-and-incremental-load-in-etl-part-2/feed/ 1 334752