troubleshooting
125 TopicsSQL Server Unconstrained Delegation failing after 10 hours
After the Windows updates from November 2020, you might be facing some issues running Bulk Inserts or working with linked servers, if you keep an open session for more than 10 hours. Some recent changes were done on the Windows side when it comes to the way S4U (Unconstrained delegation) Kerberos tickets work.4.7KViews7likes0CommentsOS Hang or Out of Memory due to SQL Ser... No Wait, it's SQL Analysis Services (SSAS)
First published on MSDN on Jan 12, 2018 Recently, we have observed a number of cases where DBAs or application developers are complaining about out-of-memory errors or even machine not responding (hangs) despite the fact that there is plenty of available memory on the system.9.9KViews5likes1CommentEffectively troubleshoot latency in SQL Server Transactional replication: Part 1
Are you struggling with latency issues in SQL Server Transactional replication? This article provides clear, step-by-step instructions to effectively troubleshoot and resolve these challenges. Dive into proven techniques and best practices that will help you enhance your SQL Server's performance and ensure seamless data replication. Don't let latency slow you down—master the art of SQL Server troubleshooting today! I hope you find this teaser engaging! If you need any adjustments or additional content, feel free to let me know. Thanks to Collin Benkler, Senior Escalation Engineer in Microsoft for SQL Server for his valuable feedbacks.5.2KViews4likes4CommentsMemory Grants: The mysterious SQL Server memory consumer with Many Names
First published on MSDN on Jan 01, 2013 The Memory Consumer with Many NamesHave you ever wondered what Memory grants are? What about QE Reservations? And Query Execution Memory? Workspace memory? How about Memory Reservations?As with most things in life, complex concepts often reduce to a simple one: all these names refer to the same memory consumer in SQL Server: memory allocated during query execution for Sort and Hash operations (bulk copy and index creation fit into the same category but a lot less common).45KViews3likes0CommentsCommon Causes and Troubleshooting Solutions for SQL AG Data Synchronization Latency
This article summarizes the common causes , solutions and troubleshooting mechanism for SQL Availability Group (AG) data synchronization latency between primary and secondary for both synchronous-commit and asynchronous-commit mode. The latency commonly happens at log harden phase or log redo phase. And sometimes, both. If it happens at log harden phase, you will likely to see HADR_SYNC_COMMIT wait type in your primary that waiting for your synchronous-commit secondary replica's acknowledgement. Also, your primary is likely to encounter latency on committing transactions sent from application because of this wait. If latency happens purely at log redo phase, you are likely to see a relatively low redo rate in your issued secondary replica with high value of redo queue. The key point is to first narrow down the actual phase of workflow (attached at the bottom of this article as also mentioned in a few other blogs thanks to great work of Simon Su )in data synchronization that induces the latency. For experienced DBA, if you have already narrowed down the latency to be caused by extensive HADR_SYNC_COMMIT wait type in your primary for your synchronous-commit secondary replica, you can refer to this article for resolving this bottleneck : https://techcommunity.microsoft.com/t5/sql-server/troubleshooting-high-hadr-sync-commit-wait-type-with-always-on/ba-p/385369 If you wish to capture detailed traces to narrow down the bottleneck, except for performance monitor counters where you can check log send and redo efficiency, blocking chain and its block header on primary replica, SQL AG dedicated X-events in both primary and issued secondary are also required to see the data synchronization workflow latency. We have an automatic tool "AGLatency Report Tool" to help you analyze the AG X-event data thanks to the great work of Simon Su . You can find this tool in this link with its user manual and the AG X-event script in this link Below are the common causes and its solution/troubleshooting mechanism for SQL AG Data Synchronization Latency: Long-running active transactions in read-only secondary replica ---- > When there is a long-running transaction in readable secondary replica, there can be blocking for the local redo thread in secondary replica when accessing the same data. High network latency / low network throughput ----> When the network transmission between primary and the issued secondary is slow or instable, the log send rate will be slow and impact the amount of data being transmitted to secondary(refer doc https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2012/dn135335(v=sql.110)?redirectedfrom=MSDN) Flow control induced high log send queue size and low log send rate ---- > When your SQL AG's performance triggered flow control gate for specific SQL AG databases or at SQL AG replica level, there would be temporary termination and limitation of sending log data from primary to secondary. Flow control can be commonly caused by server overloading or slow network. We need to resolve the conditions in your environment that cause the flow control if we have constantly observed flow control gate being entered by your AG. You can check below performance monitor counters to see if your SQL AG Database or SQL AG replica has ever triggered flow control gate: AG Replica level Availability Replica : Flow Control Time , Availability Replica : Flow Control/sec , AG Database Level Database Replica: Database Flow Control Delay Database Replica: Database Flow Controls/Sec (For Flow Control Gate details as discussed below, please refer this link Monitor Performance for AlwaysOn Availability Groups | Microsoft Docs) AlwaysOn Availability Groups is designed with flow control gates on the primary replica to avoid excessive resource consumption, such as network and memory resources, on all availability replicas. These flow control gates do not affect the synchronization health state of the availability replicas, but they can affect the overall performance of your availability databases, including RPO. After the logs have been captured on the primary replica, they are subject to two levels of flow controls, as shown in the table below. Level Number of Gates Number of messages Useful Metrics Transport 1 per availabiltiy replica 8192 Extended event database_transport_flow_control_action Database 1 per availability database 11200 (x64) 1600 (x86) DBMIRROR_SEND Extended event hadron_database_flow_control_action FLOW CONTROL GATES Once the message threshold of either gate is reached, log messages are no longer sent to a specific replica or for a specific database. These can be sent once acknowledgement messages are received for the sent messages to bring the number of sent messages below the threshold. In addition to the flow control gates, there is another factor that can prevent the log messages from being sent. The synchronization of replicas ensures that the messages are sent and applied in the order of the log sequence numbers (LSN). Before a log message is sent, its LSN also checked against the lowest acknowledged LSN number to make sure that it is less than one of thresholds (depending on the message type). If the gap between the two LSN numbers is larger than the threshold, the messages are not sent. Once the gap is below the threshold again, the messages are sent. Blocked REDO thread ---- When there are other issues or performance bottlenecks in secondary replica, the REDO thread can be blocked and cause latency of data synchronization. Refer document https://techcommunity.microsoft.com/t5/sql-server/alwayson-minimizing-blocking-of-redo-thread-when-running/ba-p/383963 Shared REDO Target ----- This issue only happens when you have multiple secondary replicas. When one secondary replica is very slow in finishing the redo process, it will limit the other secondary replicas in proceeding further on the maximum allowable LSN with the redo process. Trace flag 9559 can be used to mitigate the issue. (refer doc https://docs.microsoft.com/en-us/archive/blogs/alwaysonpro/recovery-on-secondary-lagging-shared-redo-target) Running out of parallel redo threads in secondary replica --- When secondary replica is running out of parallel redo threads, you can use TF3478 to allow maximal number of parallel redo thread to increase with total number of CPUs. By default, A SQL Server instance uses up to 100 threads for parallel redo for secondary replicas. Each database uses up to one-half of the total number of CPU cores, but not more than 16 threads per database. If the total number of required threads for a single instance exceeds 100, SQL Server uses a single redo thread for every remaining database. Disk or I/O subsystem latency -- > when the disk or I/O subsystem in the secondary replica has critical bottleneck, the redo rate will be impacted and quite small in secondary replica. We are less likely to see it in your secondary before it hit on primary if your secondary replica is not readable and has comparable hardware for SQL database files' storage as that in primary and dedicated for this SQL AG's usage. However, if your secondary replica has high volume of read-only workload, it can bring extra IO overhead to your IO subsystem. Frequent database backup and VM snapshot can be another potential cause of extra I/O overhead.23KViews3likes0Comments