This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .
Customer was trying to apply SQL2014 SP3 in Failover cluster instance. They applied patch to passive node, then failover to that node. Failover failed. Issue occurred every time. They can mitigate failover issues by uninstalling SQL2014 SP3. According to cluster logs, we can see:
00005f04.00005968::2022/10/07-00:06:42.743 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] Request to bring SQL Server online
00005f04.00005968::2022/10/07-00:06:42.743 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] SQL Server resource state is changed from 'ClusterResourceFailed' to 'ClusterResourceOnlinePending'
00001554.00003fa4::2022/10/07-00:06:42.745 INFO [RCM] HandleMonitorReply: ONLINERESOURCE for 'SQL Server (CAxxxxxDB)', gen(11) result 997/0.
00001554.00003fa4::2022/10/07-00:06:42.745 INFO [RCM] Res SQL Server (CAxxxxxDB): OnlineCallIssued -> OnlinePending( StateUnknown )
00001554.00003fa4::2022/10/07-00:06:42.745 INFO [RCM] TransitionToState(SQL Server (CAxxxxxDB)) OnlineCallIssued-->OnlinePending.
00005f04.00001d14::2022/10/07-00:06:42.745 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] Online worker is started
00001554.00001db8::2022/10/07-00:06:42.745 INFO [GEM] Node 2: Processing message as part of GemRepair message 2:35071 from node 2. Action: causal, Target: CAUS
00005f04.00001d14::2022/10/07-00:06:42.831 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] XEvent session CAxxxxxDB is created with RolloverCount 10, MaxFileSizeInMBytes 100, and LogPath 'C:\ClusterStorage\VirtualDisk-CAxxxxxDB\Data\MSSQL13.CAxxxxxDB\MSSQL\LOG\'
00005f04.00001d14::2022/10/07-00:06:42.831 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] Extended Event logging is started
00005f04.00001d14::2022/10/07-00:06:42.831 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] The private property VerboseLogging is 0
00005f04.00001d14::2022/10/07-00:06:42.831 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] The private property HealthCheckTimeout is 60000
00005f04.00001d14::2022/10/07-00:06:42.831 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] The private property FailureConditionLevel is 3
00005f04.00001d14::2022/10/07-00:06:42.831 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] The private property SqlDumperDumpFlags is 0x0
00005f04.00001d14::2022/10/07-00:06:42.831 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] The private property SqlDumperDumpTimeOut is 0
00005f04.00001d14::2022/10/07-00:06:42.831 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] The private property SqlDumperDumpPath is ''
00005f04.00001d14::2022/10/07-00:06:42.831 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] The property LogIsEnabled is 1
00005f04.00001d14::2022/10/07-00:06:42.831 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] The property LogFileRolloverCount is 10
00005f04.00001d14::2022/10/07-00:06:42.831 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] The property LogMaxFileSizeInMBytes is 100
00005f04.00001d14::2022/10/07-00:06:42.831 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] The property LogPath is ''
00005f04.00001d14::2022/10/07-00:06:42.833 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] Server name is GOAAZRVDB226\CAxxxxxDB
00005f04.00001d14::2022/10/07-00:06:42.833 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] Service name is MSSQL$CAxxxxxDB
00005f04.00001d14::2022/10/07-00:06:42.833 INFO [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] Dependency expression for resource 'SQL Network Name (xxxxxx)' is '([5bxxxxf4-3e0d-4787-9e65-769xxxxx68])'
00005f04.00006a48::2022/10/07-00:06:42.835 ERR [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] Worker Thread (43A1E9F0): Failed to retrieve the ftdata root registry value (hr = 2147942402, last error = 0). Full-text upgrade will be skipped.
00005f04.00006a48::2022/10/07-00:06:42.927 WARN [RES] SQL Server <SQL Server (CAxxxxxDB)>: [sqsrvres] Worker Thread (43A1E9F0): ReAclDirectory : Failed to apply security to C:\ClusterStorage\VirtualDisk-CAxxxxxDB\Data\MSSQL13.CAxxxxxDB\MSSQL\Data (50).
We checked registry keys which might point to a wrong path to cause this issue. But we don't find any wrong registry key.
We captured rhs.exe dump to analyze this issue. According to the dump, it pointed to SQL Error log folder.
00 ntdll!ZwSetSecurityObject
01 KERNELBASE!SetKernelObjectSecurity
02 ntmarta!MartaSetFileRights
03 ntmarta!MartaUpdateTree
04 ntmarta!MartaManualPropagation
05 ntmarta!AccRewriteSetHandleRights
06 advapi32!SetSecurityInfo
07 SQSRVRES!SQLClusterSharedDataUpgradeWorker::ReAclDirectory
08 SQSRVRES!SQLClusterSharedDataUpgradeWorker::DoSQLDataRootApplyACL
09 SQSRVRES!SQLClusterSharedDataUpgradeWorker::Execute
0a SQSRVRES!SQLClusterResourceWorker::WorkerStartRoutine
0b resutils!ClusWorkerStart
0c kernel32!BaseThreadInitThunk
0d ntdll!RtlUserThreadStart
wchar_t * directory = 0x00000000`01bdf130 "D:\MSSQL11.MSSQLSERVER\MSSQL\Log"
We checked SQL Error logs folder. This folder contains 48000+ files. There appears to be a timeout limit, and if there are too many files in the folder, we hit the timeout and the failover fails.
Customer cleaned up 'Maintenance Plan logs', 'Rebuild index logs', 'Old Dump files' where are not useful. They reduced files number to 300. Failover was successful afterwards.