Is enabling SMB Signing on your NetApp a non-disruptive change?

We received the following alert from our ActiveIQ Unified Management Appliance (and a similiar one in ActiveIQ / AutoSupport): Alert from Active IQ Unified Manager: Advisory ID: NTAP-20160412-0001

You can find more details here: https://security.netapp.com/advisory/ntap-20160412-0001/

After reviewing it, fixing it seemed like a straight forward change but I wanted to know, is enabling SMB signing on your NetApp a non-disruptive change?

Everything I’ve read says it has been supported since Windows 98 and if you’ve disabled SMBv1 (which you hopefully have) everyone should be using it anyway with SMBv2 and newer which signs by default. On top of that, Domain Controllers use signing by default for things like SysVol and I assume DFS if you have that on your Domain Controllers. Windows also negotiates whether or not to use SMB signing based on client/server settings and by default it prefers more the more secure use of signing unless someone is man-in-the-middling you and downgrading your connection or you’re using…. Windows 95?

Since I couldn’t find any kind of answer to my question I figured I’d post something to hopefully help the next person wondering the same thing and faced with this security alert.

So, is enabling SMB signing on your NetApp a non-disruptive change? He asked again, out loud, like a crazy person.

Short answer: No.

Long answer: Nope but it’s probably not that bad.

I enabled SMB signing on our NetApp (OnTap 9.7P14) and about 95% of clients didn’t even notice but 5% did.

The 5% of clients that had a problem with SMB signing immediately lost access to all shares hosted on the NetApp and would get a “You do not have permissions to access this” error messages.

For remote workers it was easy, disconnect/reconnect your VPN and that solved it. On-premise workers had to logoff/on or reboot. Servers though, they had to be rebooted.

The kicker? Clients that had problems ranged from Windows 7 (I KNOW) to Windows 10. Servers that had problems? Server 2008 R2 (I KNOW) up to 2012 R2. Surprising none of our 2016 or 2019 servers had a problem but we have significantly less of those so plan accordingly if you’re doing this.

Here is an example: We had two identical 2012 R2 servers, one worked post change, one didn’t. We had to reboot one with the issue and then everything was good again.

My advice if you are tasked with implementing this in your organization?

For desktops: Ask your clients to logoff when they go for the day and make the change in the evening.

For servers: Had I been smarter I could have enabled SMB signing on Patch Tuesday right before server reboots. That would have caused the lease disruption and folded in nicely to our existing maintenance window. If that isn’t an option for you have a quick test plan to check if each server can access a share and if it can’t, reboot it.

There is potentially another option I was exploring but abandoned. You could build a GPO that makes SMB signing required and apply it to your Desktops/Servers ahead of time. After the GPO has propagated, in theory, you should be able to enable SMB signing on the NetApp and since all systems are already required to use it, there should be no disruption.

There you go. My lessons learned from this experience. Good luck. Hopefully this helps someone.

NetApp provides documentation here on how to enable SMB signing: https://library.netapp.com/ecmdocs/ECMP1366834/html/GUID-9C1135BA-5DEB-4E0F-9F58-3AED83DA1DD3-copy.html

How to remove the NetApp Host Utilities hotfix check

 

Update 2018-10-18Check out this comment. Looks like a much easier method than my MSI editing one.

 

Sick of the hotfix requirements for NetApp Host Utilities? Me to.

On a test/dev server I have I’ve been experimenting with Veeam, iSCSI, ReFS and Server 2016. I wanted to run the NetApp Host Utilities against my Server 2016 box and was told I was missing Q3197954:

Problem is this 2016 server is fully patched and if I go manually download Q3197954 (which appears to be a previous rollup) I can’t install it on my fully patch server.

On a hunch I figured I might be able to crack open the MSI and find a way around the hotfix check and I was right. Here is how you do it:

  1. Download and installĀ Orca MSI Editor (Alternate Download)
  2. Download the NetApp Host Utilities package from https://now.netapp.com/ (v7.1 as of this writing)
  3. Launch Orca and open the MSI file with it
  4. Select ‘InstallUISequence’ on the left and then click the ‘Sequence’ column on the top right to sort by lowest value first
  5. Locate ‘CheckMSHotfixes’, right click on it and choose ‘Drop row’
  6. Click ‘Save’
  7. Close Orca
  8. Copy the edited MSI to your server and run the installation as per normal

Digging around the MSI a bit I found references to “IGNORE_HOTFIX_CHECK” listed. I suspect it’s some kind of flag or environment variable that could be set to accomplish the same thing but I was unable to find any documentation on how and a few guesses around using environment variables didn’t pan out so I decided to stick with the above solution.

A note of caution. This is likely 100% unsupported by NetApp and if they found out you did this to get the Host Utilities on your server you might run into support issues.

Single Mailbox Recovery can’t connect to Exchange

Our AD structure contains three domains, one root domain (int.mydom.com) and two sub-domains, one for servers/services (it.int.mydom.com) and one for user accounts (users.int.mydom.com).

  • int.mydom.com
    • users.int.mydom.com (Users live here)
    • it.int.mydom.com (Privileged IT accounts and Exchange live here)

We are running into problems with NetApp Single Mailbox Recovery where it is unable to connect to the target mailbox for drag/drop restores. In the mean time we’re just exporting PST files and letting the user do the work.

We started with SMR 7.1 where you could specify connecting to all Exchange mailboxes instead of an individual users mailbox. We found we had to change the domain controller SMR was using to find the mailboxes to one of the users.int.mydom.com DCs. By default SMR would select a DC in it.int.mydom.com where Exchange and our privileged IT accounts are.

We’re running Exchange 2016 so there were other issues when trying to do a drag/drop restore so we had to upgrade to SMR 7.2 which added support for Exchange 2016.

SMR 7.2 removed the ability to open all mailboxes and specify a domain controller so we’re stuck again.

After opening a case with NetApp they confirmed this is a know issue and SMR 7.2.1 should resolve it:

Please note that after checking with additional resources from SnapManager for Exchange we confirm there is a known issue with this function, there will be changes to this in the next version of SMBR, the Next SMBR release and version 7.2.1 should come soon but I am afraid at this point there is no a workaround we can follow.

So now we wait.

If you’re really desperate for drag/drop and have an environment like ours you might be able to create privileged accounts inĀ users.int.mydom.com and use them for the restores. That might work.

NetApp SnapManager for Exchange failing with VSS_E_WRITERERROR_RETRYABLE

Running into a problem with Exchange 2010, a FAS2240-4 and SnapManager for Exchange 7.1 where backups would randomly fail every now and then started failing consistently.

Our DFM server would send us an e-mail when the failure occurred that looked like this:

CLIENT APP ERROR Backup: SME Version 7.1: (111) on dora02 at Sun Sep 04 22:09:31 PDT 2016

The backup failure would also knock the databases offline and require us to re-sync them the next day.

Digging into the SME logs we found the following:

[22:10:42.635]  *****BACKUP DETAIL SUMMARY*****

 [22:10:42.635]  Backup group set #1: 

 [22:10:42.635]  Backup SG/DB [FK] Error: SnapManager detected the following Exchange writer error. Please retry SnapManager operation.
 VSS_E_WRITERERROR_RETRYABLE: The writer failed due to an error that might not occur if another snapshot copy is created.


 [22:10:42.635]  Backup SG/DB [LQ] Error: SnapManager detected the following Exchange writer error. Please retry SnapManager operation.
 VSS_E_WRITERERROR_RETRYABLE: The writer failed due to an error that might not occur if another snapshot copy is created.


 [22:10:42.635]  Backup SG/DB [RZ] Error: SnapManager detected the following Exchange writer error. Please retry SnapManager operation.
 VSS_E_WRITERERROR_RETRYABLE: The writer failed due to an error that might not occur if another snapshot copy is created.


 [22:10:42.635]  Backup SG/DB [AE] Error: SnapManager detected the following Exchange writer error. Please retry SnapManager operation.
 VSS_E_WRITERERROR_RETRYABLE: The writer failed due to an error that might not occur if another snapshot copy is created.


 [22:10:42.635]  Backup SG/DB [Public Folders (<SERVER>)] Error: SnapManager detected the following Exchange writer error. Please retry SnapManager operation.
 VSS_E_WRITERERROR_RETRYABLE: The writer failed due to an error that might not occur if another snapshot copy is created.




 [22:10:42.635]  ***SNAPMANAGER BACKUP JOB ENDED AT: [09-04-2016_22.10.42]

 [22:10:42.635]  Failed to backup storage groups/databases.

NetApp provides this page for what they call “Common VSS errors”: https://kb.netapp.com/support/index?page=content&id=1010785&locale=en_US

None of the suggestions there helped us.

In the end I found this forum post for a different product: https://community.emc.com/thread/168678?tstart=0 and applied the registry edits they suggested here: https://community.emc.com/message/705346#705346

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\TcpWindowSize=256000
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\GlobalMaxTcpWindowSize=16777216
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveInterval=1000
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveTime=600000

and then rebooted our Exchange server running SME.

Since making the change roughly 20 days ago we haven’t had a single failed backup.

NetApp SnapManager for Exchange – Error Code: 0xc004146f

We use SnapManager for Exchange and NetApp Single Mailbox Recovery to backup our Exchange 2010 environment.

We’ve noticed that randomly our backups will fail with a verification error. The backups are still good and the failure usually happens 2-3 times in a row and then doesn’t happen again for a month.

NetApp has a KB article (login required) for this which basically tells you to verify that a set of applicable Exchange DLL and EXE files are all the same version on all of our Exchange servers that SME runs on.

We did this, they are and in our case we actually only involve a single server when it comes to SME so the versions on our other Exchange servers are irrelevant (I think).

Here is a snippet of what our log looks like:

[02:47:01.595]  ErrCheckLogs: failed with ERROR -1811

[02:47:01.595]  Operation terminated with error -1811 ChecUnknown, in 0.0 seconds.

[02:47:01.595]  WARNING: Database/log consistency checking returned with error code 0xFFFFF8ED.

[02:47:01.611]  Error Code: 0xc004146f
Transaction log verification failed.


[02:47:01.611]  Error Code: 0xc004146f
Transaction log verification failed.


[02:47:01.611]  Error Code: 0xc004146f
Transaction log verification failed.

[02:47:01.611]  Updating SnapInfo file at C:\Exchange\Logs\lun_Exchange_Logs_AE\sme_snapinfo\EXCH__MYSERVER\SG__AE\08-31-2015_22.00.12__Daily\SnapInfo__08-31-2015_22.00.12.sme with LogConsistentCheck=Failure

[02:47:02.126]  Verification failed. Error code: 0xc004146f, LUN: C:\Exchange\Databases\lun_Exchange_AE\
[02:47:02.126]  An autosupport message is sent on failure to the storage system of LUN [C:\Exchange\Databases\lun_Exchange_AE\].

[02:47:02.126]  ***LOG VERIFICATION STATUS: AE

[02:47:02.126]  Failed to verify physical integrity of the transaction logs.

[02:47:02.126]  ***LOG CONSISTENT VERIFICATION: Public Folders (MYSERVER)

[02:47:02.126]  Start to verify the logs in SnapInfo directory...

I’ve opened a case with NetApp and it appears they have an internal bug for this issue which you can learn nothing about here: http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=132153

 

Update 2015-09-18

Heard back from NetApp and they recommend the following:

  1. Disable circular logging (this is enabled in our case)
  2. There are log files with bad generation numbers and you can fix the issue by running the SnapManager for Exchange backup without verification which will commit and truncate the transaction logs, removing the log files with the bad generation number then you can Re-run the job again with verification enabled.

 

Update 2015-09-22

After finally digging into the circular logging configuration on our Exchange system it turns out we only had it enabled for 2 of our 5 mail stores. I then went back through all of our failure logs and the only databases that ever generated a verification error were the 2 configured for circular logging.

I’ve disabled circular logging on both of those mail stores and we’ll see if that solves this problem.