Running ADF pipelines in recurrence

Madhu raju
6 min readJun 9, 2019

What is ADF?

Azure Data Factory is a cloud based data integration service that allows you to create data driven workflows. It is kind of SSIS Cloud version with default integrations with Azure Data Lake Analytics, Machine Learning, HD Insights.

What you can do with ADF?

  • Access to data sources: Can access data from On-prem SQL, Azure SQL, Blob storage
  • Load to desired destination: On-prem SQL, Azure SQL, Blob storage.
  • We can monitor the pipelines, validation and execution of scheduled jobs.
  • Processing huge amounts of data becomes simpler with ADF.
  • ADF can fire-up HDInsights clusters and run Pig and Hive scripts.
  • Supports wide range of transformations through HDInsights.

Scenario:

In our org, we do continuous monitoring of the resort availability information of each property. Say, if there is a room type (1 bedroom) available from July 1st to Oct 30th in one of the XYZ resort in abc place. This kind of information will be available for different types of room types at difference locations. We have to read data of all combinations and process them accordingly.

As per our existing architecture we read the room type information from one of the Azure SQL tables, and execute multiple stored procedures in each stage, so that the data will be pumped to different targets.

For the above scenario I could continue using SSIS package, but the reason for choosing ADF over SSIS is:

1) As of now, we are heavy on SQL(Azure SQL/ On-Prem), we would like to transform the data and inject data to different database variants like NoSQL (CosmosDB).

2) Processing huge amounts of data with SSIS needs scaling and configure it to make it fast. But with ADF, need not to worry about the infrastructure.

Let’s start creating ADF from portal.

  1. Login to the portal, Search for Data factories, Click on Add and provide all necessary information to create ADF.
  2. After creating ADF, need to start creating Datasets, Pipelines etc. Click on “Author and Monitor”
1. Author ADF.

3. Data factory dashboard will open in a new tab. There we need to click on “Author”.

2. Authoring the data factory.

Creating a Dataset

4. You can create Datasets and pipelines in the Author tab. Read more about basics of ADF and these terminologies here.

5. Create a new Dataset. Select the type of datasource (SQL, Table storage, or even from Amazon S3). In my case it is Azure SQL so I specify the connection string of the source, from where I want to read the data as a “Linked Service”.

3. Dataset with LinkedService and the table name from where we are reading data.

6. Creating a Linked Service.

4. Specify details of your Azure SQL here in the linked service.

So I have my dataset (entity from where I can read the availability information) ready.

Creating a Pipeline

  1. Since I have to lookup for the new data in my dataset I have selected Lookup activity.
5. Lookup activity.

a) Specify the name of lookup activity in the General tab.

b) Select the Source dataset, which we have created as mentioned above.

c) To pull the data from the source, we have 3 options.

1. Read whole data from table

2. Query the data

3. Execute a Stored Proc, so that i returns the data.

In my case, I’m using Query option. Since I need to pick the top(latest) record.

2. The output of the Lookup activity can be used in another activity.

3. After the Lookup activity is executed, my next step is I have to execute bunch of Stored Procedures based on some conditions. So I have chosen ForEach activity, where each row will be treated as an Item.

4. For Foreach activity specify the name in the General tab, and specify the Items collection as below.

5. Click on the Items text box, a side nav will popup to add the Items content. Add the previous activity name from where you want to get the input.

6. Reading output value from the lookup activity.

6. Add the child activities inside the Foreach. Which means for each item in the foreach activity it executes all the child activities inside the foreach.

In my case I have added 4 child activities inside the foreach and each one of them have specific responsibilities.

7. Activities inside the Foreach.

7. By default you’ll see “Add Activities” , since I already have activities in it I have “Edit activities”. Click on Edit activities.

8. Activities inside the Foreach.

8.Pass input parameters while executing these stored procedures, it will be like @item().Propertyyouwanttoaccess.

9. Input parameter for the SP.

In short:

For example:

a) I have a Students Dataset, which has StudentId, StudentName, StudentGrade as columns.

b) Lookup activity will return me the record which was inserted recently into my Students dataset.

c) Now, Items in my Foreach activity is complete row from Lookup activity. You can access the output of Lookup activity like this

@activity(‘lookup activity name’).output.value

d) Now for all child activities in Foreach activity can access the content as below.

@item().StudentId , @item().StudentName.

So these values can be passed in as an input parameters to those child activities under the Foreach activity.

With this we have done our pipeline as well.

Creating a trigger

We have created Datasets, Pipeline. But to execute the pipeline we need to create Triggers.

10. Click on Triggers
  1. Create a new trigger, and define the type of trigger. If you want this to be recurring specify the recurrence.
11. New trigger.

2. Associate the trigger to the pipeline so that the pipeline will be executed recursively.

12. Associate the trigger.

3. After clicking on New/Edit, you’ll get the list of triggers that were created. Select one of them and associate.

All these will work only after you publish them.

I hope you’ll get an understanding of recurring ADF pipelines. Please do let me know if you have any suggestions.

References:

  1. Microsoft Documentation — https://docs.microsoft.com/en-us/azure/data-factory/introduction
  2. Differences with SSIS — https://www.jamesserra.com/archive/2017/03/azure-data-factory-and-ssis-compared/

--

--