Introduction
Azure App Service is a powerful platform that simplifies the deployment and management of web applications. However, maintaining application performance and availability is crucial. When performance issues arise, identifying the root cause can be challenging. This is where Auto-Heal in Azure App Service becomes a game-changer.
Auto-Heal is a diagnostic and recovery feature that allows you to proactively detect and mitigate issues affecting your application’s performance. It enables automatic corrective actions and helps capture vital diagnostic data to troubleshoot problems efficiently. In this blog, we’ll explore how Auto-Heal works, its configuration, and how it assists in diagnosing performance bottlenecks.
What is Auto-Heal in Azure App Service?
Auto-Heal is a self-healing mechanism that allows you to define custom rules to detect and respond to problematic conditions in your application. When an issue meets the defined conditions, Auto-Heal can take actions such as:
- Recycling the application process
- Collecting diagnostic dumps
- Logging additional telemetry for analysis
- Triggering a custom action
By leveraging Auto-Heal, you can minimize downtime, improve reliability, and reduce manual intervention for troubleshooting.
Configuring Auto-Heal in Azure App Service
To set up Auto-Heal, follow these steps:
-
Access Auto-Heal Settings
-
- Navigate to the Azure Portal.
- Go to your App Service.
- Select Diagnose and Solve Problems.
- Search for Auto-Heal or go to Diagnostic tools tile and select Auto-Heal.
-
Define Auto-Heal Rules
Auto-Heal allows you to define rules based on:
-
- Request Duration: If a request takes too long, trigger an action.
- Memory Usage: If memory consumption exceeds a certain threshold.
- HTTP Status Codes: If multiple requests return specific status codes (e.g., 500 errors).
- Request Count: If excessive requests occur within a defined time frame.
-
Configure Auto-Heal Actions
Once conditions are set, you can configure one or more of the following actions:
-
- Recycle Process: Restart the worker process to restore the application.
- Log Events: Capture logs for further analysis.
- Custom Action: You can do the following:
- Run Diagnostics: Gather diagnostic data (Memory Dump, CLR Profiler, CLR Profiler with Threads Stacks, Java Memory Dump, Java Thread Dump) for troubleshooting.
- Run any Executable: Run scripts to automate corrective measures.
Capturing Relevant Data During Performance Issues
One of the most powerful aspects of Auto-Heal is its ability to capture valuable diagnostic data when an issue occurs. Here’s how:
-
Collecting Memory Dumps
Memory dumps provide insights into application crashes, high CP or high memory usage. These can be analyzed using WinDbg or DebugDiag.
-
Enabling Logs for Deeper Insights
Auto-Heal logs detailed events in Kudu Console, Application Insights, and Azure Monitor Logs. This helps identify patterns and root causes.
-
Collecting CLR Profiler traces
CLR Profiler traces capture call stacks and exceptions, providing a user-friendly report for diagnosing slow responses and HTTP issues at the application code level.
In this article, we will cover the steps to configure an Auto-Heal rule for the following performance issues:
- To capture a .NET Profiler/CLR Profiler trace for Slow responses.
- To capture a .NET Profiler/CLR Profiler trace for HTTP 5XX Status codes.
- To capture Memory dump for a High Memory usage.
Auto-Heal rule to capture .NET Profiler trace for Slow response:
1. Navigate to your App Service on Azure Portal, and click on Diagnose and Solve problems:
2. Search for Auto-Heal or go to Diagnostic tools tile and select Auto-Heal:
3. Click on 'On':
4. Select Request Duration and click on Add Slow Request rule:
5. Add the following information with respect to how much slowness you are facing:
-
- After how many slow requests you want this condition to kick in? - After how many slow requests you want this Auto-Heal rule to start writing/capturing relevant data.
- What should be minimum duration (in seconds) for these slow requests? - How many seconds should the request take to be considered as a slow request.
- What is the time interval (in seconds) in which the above condition should be met? - In how many seconds, the above defined slow request should occur.
- What is the request path (leave blank for all requests)? - If there is a specific URL which is slow, you can add that in this section or leave it as blank.
In the below screenshot, the rule is set for this example "1 request taking 30 seconds in 5 minutes/300 seconds should trigger this rule"
Add the values in the text boxes available and click "Ok"
6. Select Custom Action and select CLR Profiler with Thread Stacks option:
7. The tool options provide three choices:
-
- CollectKillAnalyze: If this option is selected, the tool will collect the data, analyze and generate the report and recycle the process.
- CollectLogs: If this option is selected, the tool will collect the data only. It will not analyze and generate the report and recycle the process.
- Troubleshoot: If this option is selected, the tool will collect the data and analyze and generate the report, but it will not recycle the process.
Select the option, according to your scenario:
Click on "Save".
8. Review the new settings of the rule:
Clicking on "Save" will cause a restart as this is a configuration level change and for this to get in effect a restart is required. So, it is advised to make such changes in non-business hours.
9. Click on "Save". Once you click on Save, the app will get restarted and the rule will become active and monitor for Slow requests.
Auto-Heal rule to capture .NET Profiler trace for HTTP 5XX Status code:
For this scenario, Steps 1, 2, 3 will remain the same as above (from the Slow requests scenario). There will be following changes:
1. Select Status code and click on Add Status Code rule
2. Add the following value with respect to what Status code or range of status code you want this rule to be triggered by:
- Do you want to set this rule for a specific status code or a range of status codes? - Is it single status code you want to set this rule for or a range of status code.
- After how many requests you want this condition to kick in? - After how many requests throwing the concerned status code you want this Auto-Heal rule to start writing/capturing relevant data.
- What should be the status code for these requests? - Mention the status code here.
- What should be the sub-status code for these requests? - Mention the sub-status code here, if any, else you can leave this blank.
- What should be the win32-status code for these requests? - Mention the win32-status code here, if any, else you can leave this blank.
- What is the time interval (in seconds) in which the above condition should be met? - In how many seconds, the above defined status code should occur.
- What is the request path (leave blank for all requests)? - If there is a specific URL which is throwing that status code, you can add that in this section or leave it as blank.
Add the values according to your scenario and click on "Ok"
In the below screenshot, the rule is set for this example "1 request throwing HTTP 500 status code in 60 seconds should trigger this rule"
After adding the above information, you can follow the Steps 6, 7 ,8, 9 from the first scenario (Slow Requests) and the Auto-Heal rule for the status code will become active and monitor for this performance issue.
Auto-Heal rule to capture Memory dump for High Memory usage:
For this scenario, Steps 1, 2, 3 will remain the same as above (from the Slow requests scenario). There will be following changes:
1. Select Memory Limit and click on Configure Private Bytes rule:
2. According to your application's memory usage, add the Private bytes in KB at which this rule should be triggered:
In the below screenshot, the rule is set for this example "The application process using 2000000 KB (~2 GB) should trigger this rule"
Click on "Ok"
3. In Configure Actions, select Custom Action and click on Memory Dump:
4. The tool options provide three choices:
-
- CollectKillAnalyze: If this option is selected, the tool will collect the data, analyze and generate the report and recycle the process.
- CollectLogs: If this option is selected, the tool will collect the data only. It will not analyze and generate the report and recycle the process.
- Troubleshoot: If this option is selected, the tool will collect the data and analyze and generate the report, but it will not recycle the process.
Select the option, according to your scenario:
5. For the memory dumps/reports to get saved, you will have to select either an existing Storage Account or will have to create a new one:
Click on Select:
Create a new one or choose existing:
6. Once the storage account is set, click on "Save". Review the rule settings and click on "Save". Clicking on "Save" will cause a restart as this is a configuration level change and for this to get in effect a restart is required. So, it is advised to make such changes in non-business hours.
Best Practices for Using Auto-Heal
- Start with Conservative Rules: Avoid overly aggressive auto-restarts to prevent unnecessary disruptions.
- Monitor Performance Trends: Use Azure Monitor to correlate Auto-Heal events with performance metrics.
- Regularly Review Logs: Periodically analyze collected logs and dumps to fine-tune your Auto-Heal strategy.
- Combine with Application Insights: Leverage Application Insights for end-to-end monitoring and deeper diagnostics.
Conclusion
Auto-Heal in Azure App Service is a powerful tool that not only helps maintain application stability but also provides critical diagnostic data when performance issues arise. By proactively setting up Auto-Heal rules and leveraging its diagnostic capabilities, you can minimize downtime and streamline troubleshooting efforts.
Have you used Auto-Heal in your application? Share your experiences and insights in the comments!
Stay tuned for more Azure tips and best practices!
Published Mar 06, 2025
Version 1.0shagnihotri
Microsoft
Joined December 08, 2023
Apps on Azure Blog
Follow this blog board to get notified when there's new activity