FAQs for Stream Operation¶

This topic lists the advanced questions and answers for stream processing pipeline operation.

Q: The system pipelines must be started before starting the user pipelines. Why?¶

A: System pipelines include Data Reader and Data Writer. The function of these system pipelines are as follows:

Data Reader

Its function is filtering input data records for stream processing pipelines. When data flows from the Kafka Origin Topic to the Internal Topic, for example, if there are 100 data records in total, but the calculation logic of the stream processing pipeline requires 10 records. The Data Reader controls unnecessary data calculation and reduces resource costs.

Data Writer

Its function is filtering output data records for stream processing pipelines. In stream processing, the input data records and output data records are all in the Kafka Internal Topic (the output data records can be used as input data of subsequent pipelines). The Data Writer filters the output data from the Internal Topic for stream processing pipelines.

Q: How to estimate the required resources by a stream processing pipeline that is configured with a template?¶

A: Taking the system Data Reader as example, the Data Reader can process about 12,000 data records per second with 1Core/2GB resource configuration. With the amount of data records to be processed by the stream processing pipeline per second, you can estimate the computing resources that are required by the pipeline.

Theoretically, the amount of data records to be processed by a stream processing pipeline per second can be calculated by Devices * Points * Data Ingestion Frequency. However, the calculation is not easy because you need to count the measurement points used by each pipeline, and the data ingestion frequency of each measurement point is different.

EnOS supports monitoring the input data of stream processing pipelines. If data ingestion frequency is stable, you can monitor the average data input amount (for example, in the past 7 days). Using the average data as reference, you can adjust the resource configuration for the stream processing pipeline.

For the following special cases, you can adjust the resource configuration based on the actual situations:

In the solar domain, input data during daytime is much more than that in the night. If you estimate data input by average value, data might not be processed in time during day time. In this case, you need to estimate data input with the amount of day time.
In case of network recovery after interruption, the amount of input data will increase greatly. To avoid impacting key stream processing pipelines, you can reserve some resource buffer for this situation. Assuming that the data input amount might increase from 100 to 500 records, you can set 500 data records as the estimated data input to configure the computing resources for the stream processing pipelines.

Q: How to estimate the required resources by a stream processing pipeline that is configured with operators?¶

A: The data processing capability of each operator is different, take the following steps to test and estimate resources that are required by a stream processing pipeline.

Complete configuration of the stream processing pipeline (including all the input points).
Start the system pipeline Data Reader, which will write all the input points that are needed in the stream processing pipeline into Kafka Internal Topic. After running the Data Reader, ensure that there are enough data to be processed by the stream processing pipeline (for example, the Offset reaches 100 million), so that the test results will be accurate enough.
Start the stream processing pipeline by the Standalone mode (for example, with 2Core/4GB resource configuration) and review the data consumption rate after the pipeline running is stable. Assuming that the data processing capacity of the pipeline is 2,000 records per second, you can basically confirm that the capacity of the pipeline with 1Core/2GB resource is 1,000 records per second. If the actual data input rate is 4,000 records per second, you need to allocate at least 4Core/8GB resources for the stream processing pipeline.

Q: How to estimate the required resources by system pipelines?¶

A: When you publish a stream processing pipeline and complete the running configuration, the system will create required system pipelines by default for the Standalone mode and Cluster mode.

For the required resources by system pipelines, you can refer to the method of estimating required resources by a stream processing pipeline that is configured with templates.

Q: How to set the Lag alert threshold when configuring alert service for a stream processing pipeline?¶

A: Lag is used for monitoring the data records that are not processed in time by stream processing pipelines, which can be viewed with the Pipeline Monitoring feature. You can set an alert threshold to monitor the Lag of a pipeline. In normal cases, if enough computing resources are configured for a stream processing pipeline, input data will be processed in time, so the Lag value will be equal to or close to 0.

Setting Lag alert threshold for a stream processing pipeline allows you to detect and troubleshoot problems as early as possible, so as to minimize the impact on business. For example, when a runtime error occurs, the stream processing pipeline will stop running, and the input data will pill up.

It is recommended to set the Lag threshold as: Input Rate * T, where T is the acceptable real-time data processing latency by the business, in seconds.

For example, if input rate = 1000/s and T = 15min, then Lag = 1000 * 15 * 60 = 900000.

For the acceptable real-time data processing latency by the business, it can be estimated by the agreement in SLA or based on with the actual business scenario. Generally, you need to consider the following factors:

Time needed to resolve problems
Time needed to restart the stream processing pipeline
Subsequent calculation based on the real-time data processing
Other business impacts

The smaller the latency is set, the sooner problems can be detected, but it is easier to have false alerts because real-time data may have traffic fluctuations. It is recommended to set the latency to a value between 1 and 15 minutes, but the decision is still up to users from the business point of view.

How to deal with sudden increase of data traffic and data overstock?¶

A: The real-time data traffic of a stream processing pipeline is stable because the connected IoT devices usually upload measurement point data by fixed frequency. In normal cases, computing resources for stream processing pipelines are configured based on the actual data input traffic. If no resource buffer is reserved, sudden increase of data traffic will lead to data overstock.

The following situations might cause sudden increase of data traffic:

More devices are connected: This situation can be predicted. It is recommended to make traffic plan and scale up resources in advance to deal with increased data traffic.
Unexpected traffic increase: For example, network recovery after interruption. In this case, some data is piled up in the Edge. Upon network recovery, data is uploaded to the cloud in batch, sudden increase of data traffic will impact the real-time data processing pipeline.

To deal with unexpected traffic increase, consider solutions from the following aspects based on actual business scenario and requirements:

Upload piled data and real-time data separately

Upload and process piled data through the offline channel separately, which will not impact the calculation of the real-time channel.

Control the data traffic from Edge

If the real-time calculation on the cloud has some resource buffer, you can control the traffic of data uploaded from Edge. In this case, data uploaded from Edge can be processed before real-time data with an overall traffic control. For example, the real-time data processing capacity of the cloud is 2,000 records per second, and the normal traffic is 1, 000 records per second, with 1 times computing resources reserved. If the piled data amount is 100,000 records, it takes 100000/(2000-1000)=100s to process the piled data.

Discard piled data to process real-time data only

Without data traffic control, piled data and real-time data will be uploaded at the same time. Data will be piled up on the cloud. If there is no resource buffer reserved, you can choose to discard the piled data and process real-time data only by stopping the stream processing pipeline, discarding the unprocessed data by calling the Stream Operation APIs, and then restarting the pipeline to consume the latest data directly. This solution has the advantage of not impacting the real-time processing results and disadvantage of impacting the calculation accuracy. You need improve the calculation accuracy in other ways.

Emergent resource expansion

When there is enough resource quota and the resource costs are acceptable, you can choose to scale up the computing resources. You need to stop the stream processing pipeline first, scale up the resource allocation that is required by the pipeline, and then restart the pipeline to continue processing data. When piled data is processed, you can scale down the resource again. This solution has the advantage of not impacting the result accuracy and disadvantage of higher cost on resources when the amount of piled data is huge.

Process piled data gradually with reserved resource

If there is reserved computing resource buffer, and the piled data is on the cloud, you can process piled data gradually with reserved resource. This solution has the advantage of accurate results without extra operations and disadvantage of extra time needed to process piled data. You can make decision according to the amount of piled data and cost of reserved resources.