Posted in Microsoft, Microsoft, Microsoft Azure

How to Scale Queries Using Azure Stream Analytics


Learn how to scale queries using Streaming Units and partitions in this tutorial by Krishnaswamy Venkataraman, a big data and cloud architect.

In a traditional static data scenario, a query can be executed against a fixed data set and results will be available after a known interval. On the other hand, with a streaming data scenario involving constant changes to a dataset, the queries will run for a longer duration or might not even complete.

Additionally, a constant stream of data will increase the volume of the data, and the query might drain the working memory. One way to draw a data boundary is through the context of time. For example, with a streaming dataset, you can specify a data boundary that resides within the start and end time. This will restrict the query execution between known boundaries. There are two types of timing constraints you can use to set time boundaries for the streaming data, application and arrival time.

Application and Arrival Time

The time at the event origin is known as the Application Time; whereas, the time at the event landing is called the Arrival Time. In your queries, you can use TIMESTAMP BY for the application time and System.Timestamp for the arrival time. You can review the attached end-to-end flow as follows:

Stream Analytics with Microsoft Azure 1

Application Time and Arrival Time

The Application Time with the event payload is set by the user, and the Arrival Time is set by Azure depending on the Azure input source as shown in the following table. Depending on the Azure resource, the Arrival Time will be set accordingly:

Azure input resource  Arrival Time
Blob Blob last modified time (BlobLastModified)
Azure Event Hub Event Enqueued Time

Azure Stream Analytics is a PaaS service in which all the complexities associated with physical infrastructure or configurations are abstracted away. Building a streaming solution without a PaaS-based system will involve scores of the HW and SW orchestration. In particular, when you need to scale up resources to cater to a surge in demand, PaaS-based streaming is extremely advantageous.

In the following sections, you’ll get a glimpse into the key Azure Stream Analytics scaling techniques and the embarrassingly parallel and not embarrassingly parallel jobs.

Streaming Units

Streaming Units (SUs) are a blend of a resource (CPU, memory, and read and write rates) to execute a given job. Each SU corresponds to roughly 1MB/second of throughput and all queries are executed in the memory.

The costing is based on the volume of data processed and the number of SUs required per hour for the job to run. The costing calculation involves the volume of streaming data processed and the compute capacity of the SU. The key factors that will affect the costing of the SU are as follows:

  • Query complexity, query latency, and the volume of data processed. Complex queries and queries that need multiple steps will increase the cost of the SU. A query can have one or many steps. Each step is a sub-query defined using the WITH
  • Depending on the expected performance, more SUs will be required.
  • Choosing the number of SUs required for a particular job depends on the partition configuration for the input and the query defined for the job.
  • You can select your quota of SUs for a job using the Azure portal. Each Azure subscription by default has a quota of up to 48 SUs for all the analytics jobs in a specific region. To increase the SUs for your subscription, log a support ticket through the Microsoft support portal. Do note that each SU can process about 1 MB of input.
  • The number of SUs that a job can utilize depends on the partition configuration for the input and the query defined for the job. Note that a valid value for the SUs must be used. The valid values start at one, three, and six, and higher in increments of six.

Partitions

The scaling of a Stream Analytics job takes advantage of partitions in the input or output; partition key enables the data to be divided into subsets. A process can consume the subdivided data—such as an Azure Stream Analytics job—and write it into different partitions in parallel, which will increase the throughput significantly. For example, in the following illustration, you can use Event Hubs as the scalable event broker to scale out input:

Stream Analytics with Microsoft Azure 2

Event Hub ingress and egress with partition

The key advantages of using event brokers such as Event Hub are as follows:

  • When you use an event broker such as Event Hub, the query is partitioned; input events will be processed and aggregated into separate partition groups. To read from Event Hub, ensure that the number of partitions matches.
  • Output events are produced for each partition. group.Partitioningallows for parallel execution and scale-out. Do note that the queries should use the PARTITION BY (Partition Id) 

In order to build a scalable event broker model like the preceding one, you need to have a very clear understanding of the type of input and output paradigms that can be used.

Input source

The following is the full summary of the storage type and its support for parallelization:

Storage Type Supports Partition Id 
  • Blob storage
  • Azure Data Lake Store as a Service
  • Table storage
  • Event Hub
  • Cosmos DB
  • Azure Functions (Preview)
  • Service Bus Topics
  • Service Bus Queues
Support Partitioning

If you are using Blob storage, note the following points:

  • SQL database (Do note that an SQL server on IaaS is not supported)
  • Power BI
  • SQL Data warehouse.
Don’t support Partitioning

 Output source

The following is the full summary of the output storage type and its support for parallelization:

Storage Type Parallelizable Manual/Automatic 
Blobs, Tables, ADLS, Service Bus, and Azure Function Yes Automatic
 CosmosDB and Event Hub Yes Manual by specifying PARTITION BY field (usually Partition ID).

For Event Hub, verify the number of portions for inputs and output that match to avoid cross-over between partitions.

 SQL, SQL DW, and PowerBI No Not applicable

Using the right type of Input and Output with Partition ID can scale the query parallely. But not all jobs are parallelized. Here are the constructs of the queries and query parallelization.

Embarrassingly parallel jobs and not embarrassingly parallel jobs

Embarrassingly parallel jobs are tasks that can be executed independently without any dependency requirement between the tasks. This enables each task to have its process and computing space and in turn, scale parallely to complete the given job faster:

Stream Analytics with Microsoft Azure 3

Embarrassingly parallel job

Here is a simple example, where you’ll count the number of topics from the Twitter stream every 10 minutes. The key word to look in this example is PARTITION BY PartitionId. Stream Analytics job takes advantage of partitions in the input or output. PARTITION BY PartitionId allows division of the data into subsets based on a partition key. The query will consume and write different partitions in parallel, which increases throughput.

The following is an illustration of tumbling windows for reference purposes:

SELECT Count(*) AS Count, Topic

FROM TwitterStream PARTITION BY PartitionId

GROUPBY TumblingWindow(minute,10), Topic, PartitionId

Stream Analytics with Microsoft Azure 4

Tumbling windows are continues fixed-sized, non-overlapping and contiguous time intervals

Here is an image of the how the query will consume the data from different partitions from the Event Hub:

Stream Analytics with Microsoft Azure 5

A pictorial view of how the query will get executed in parallel

Here is the summary of the requirement for parallel execution over scaled-out resources:

  • The Input source must be partitioned
  • The query must-read from the partitioned input source
  • The query within the step must have the Partition By keyword

Not embarrassingly parallel jobs

With not embarrassingly parallel jobs, the tasks that get executed have dependency and cannot be executed independently of each other. Due to the dependency, executing the tasks in parallel becomes challenging. But you can run some of the tasks in parallel:

Stream Analytics with Microsoft Azure 6

In the following query sample, there are two parts of the query. The first part of the query can be parallelized with the keyword PartitionID. The second piece of the query, that’s outside the WITH construct will not be parallelized. In other words, this is a partial parallelization:

WITH Step1 AS (

SELECT COUNT(*) AS Count, TollBoothId, PartitionId

FROM Input1 Partition By PartitionId

GROUP BY TumblingWindow(minute, 3), TollBoothId, PartitionId

)

SELECT SUM(Count) AS Count, TollBoothId

FROM Step1

GROUP BY TumblingWindow(minute, 3), TollBoothId

Here is a sample deployment of the not embarrassingly parallel deployment using Event Hub and Power BI:

Stream Analytics with Microsoft Azure 7

Implementation of the Non-Embarrassingly parallel job using Event Hub, Azure Stream Analytics, and Power BI

The following are the key points on how to size your scaling units:

  • Sizing your Stream Units(SUs):
  1. Start your testing with No Partition By with six SUs and measure the required performance:
  • If six SUs provide the required performance, leave it as it is. You can always try with 3 and 1 SUs to gauge the performance
  • A single SU is not recommended for production workloads
  • SUs depend on a partition configuration for the inputs and on the query defined for the job
  • SUs start with 1, 3, 6 and increments of 6 thereafter
  • Non-partitioned steps can scale up to 6 SUs together
  • Each partitioned step can have 6 SUs

A sample use case

In this sample, you have three automated tollbooths and cars pass through every few minutes. The goal is to calculate how many cars have passed through in a given duration:

  • No Partition:

SELECT COUNT(*) AS Count, TollBoothId

FROM Input1 GROUP BY TumblingWindow(minute, 3), TollBoothId, PartitionId

  • Partitioned Query:

SELECT COUNT(*) AS Count, TollBoothId

FROM Input1 Partition By PartitionId

GROUP BY TumblingWindow(minute, 3), TollBoothId, PartitionId

  • Split query with Partition and Non-Partition:

WITH Step1 AS (

 SELECT COUNT(*) AS Count, TollBoothId, PartitionId

 FROM Input1 Partition By PartitionId

 GROUP BY TumblingWindow(minute, 3), TollBoothId, PartitionId

 )

 SELECT SUM(Count) AS Count, TollBoothId

 FROM Step1

 GROUP BY TumblingWindow(minute, 3), TollBoothId

The summary of the result with the total SU:

Stream Analytics with Microsoft Azure 7

If you found this article helpful, you can explore Stream Analytics with Microsoft Azure. This book is your guide to understanding the basics of how Azure Stream Analytics works and building your own analytics solution using its capabilities.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s