Troubleshoot event ID 1020 warnings on a file server that is running Windows Server

Troubleshoot event ID 1020 warnings on a file server that is running Windows Server

https://ift.tt/3hM3zX0

Symptoms


On a Windows Server-based file server, the following Event ID 1020 entry that contains warnings from SMB-Server is logged in the Microsoft-Windows-SMBServer/Operational event log:

File system operation has taken longer than expected.

Client Name: <Client-IP/Name>
Client Address: <Client-IP>:<Client-Port>
User Name: <Username>
Session ID: <SMB-Session-ID>
Share Name: <SMB-Share-Name>
File Name: <File-Name>
Command: <SMB-Command-Code>
Duration (in milliseconds): <Duration>
Warning Threshold (in milliseconds): 15000

Guidance:
The underlying file system has taken too long to respond to an operation. This typically indicates a problem with the storage and not SMB.

You may also notice the following symptoms:

  • Performance issues occur on clients that work together with the file server.
  • Connectivity issues occur on clients that work together with the file server.
  • Performance issues occur in applications or other components that run locally on the file server.
  • The file server appears to stop responding.

Cause


This problem occurs because Windows takes more than 15 seconds (the default warning threshold) to complete an I/O operation against the file system.

In this situation, the SMB server has encountered a stalled I/O. This indicates that severe problems are affecting the underlying file system instead of the SMB itself.

On a file server that is performing well, you should see single-digit millisecond response times from its file system.

The exact duration of the delay and the SMB command code that encountered the delay can be retrieved from the event entry information. You can find a list of SMB2 command codes at 2.2.1.2 SMB2 Packet Header – SYNC.

File system delays in the magnitude of several seconds can be caused by malfunctioning file system filter drivers.

Other possible causes include severe performance problems that affect the physical storage. These include the following:

  • Overload of the physical disks
  • Prolonged disk freeze operations, such as those run by VSS or other backup solutions
  • The network or storage stack of an underlaying hypervisor
  • The network connections to the storage
  • The SAN or NAS storage appliance itself

File system delays below the 15-second threshold do not produce a Warning event but are still harmful to file server performance.

Troubleshooting


Collect trace logs

To further diagnose whether the problem originates from inside Windows (for example, filter drivers) or from outside the operating system (for example, hardware, hypervisor, network, or storage), take a Storport trace, and then check the disk response times by using tools such as StorPortPacman.

Storport traces the lower end of the Windows storage stack, and the file server or any other application encounters the delays at the upper end of the stack. For more information about the StorPortPacman tool, see Deciphering Storport Traces 101 and StorPortPacman.

If you notice high maximum response times at the Storport level, this indicates that the cause of the performance problem is outside the operating systemTo determine the latencies that the system encounters in its logical disks at the application (file server) level, you can enable Perfmon or WPR tracing. Doing this also shows latencies below the 15-second warning threshold. For more information, see Measuring Disk Latency with Windows Performance Monitor (Perfmon).

Collect kernel dump file

For extreme delays (10 minutes or more), and if some other conditions exist, the file server tries to create a live kernel dump file to assist the troubleshooting.

The file is logged by the following event entries in the Microsoft-Windows-SMBServer/Operational event log:

Event 1031: “The server detected a problem and has captured a live kernel dump to collect debug information.

Event 1032: “The server detected a problem but was unable to capture a live kernel dump to collect debug information.”

If a dump file was successfully created, it can be found under %SystemRoot%\LiveKernelReports. This information can be immensely valuable for troubleshooting.

More information


Because the Server Message Block (SMB) server is accessing the local file system on behalf of its SMB clients, performance problems that occur on the SMB server directly affect the clients.

Accumulated delays on individual operations can mean very large wait times for client applications if several operations are run sequentially.

Additionally, long I/O response times can cause problems when you try to access shares, and can make the file server seem to crash.

Other applications and components that are running on this server and that access the same file system may also be adversely affected by the long response times. These applications and components may not have their own logging or monitoring for high I/O response times.

Notes

  • Not all disks are affected. 
  • Not all disks are necessarily affected at the same time. 
  • Not all disks are necessarily affected to the same degree.

Check the SMB-Server 1020 Events for detailed information and patterns.

windows,microsoft

via Microsoft Support – Windows Server 2012 R2 https://ift.tt/2LdKAZ9

July 28, 2020 at 02:15AM