NetApp SnapManager for Hyper-V won’t install because KB2263829 isn’t installed

I’m deploying a new Hyper-V cluster on Windows Server 2008 R2 with SP1 and trying to install NetApp SnapManager for Hyper-V 1.1.

As part of the installation pre-requisites NetApp wants you to install the following KBs:

  • KB2406705
  • KB2531907
  • KB2263829
  • KB2494016
  • KB2637197
  • KB2517329

Some of these KBs have been replaced with newer ones if you’ve already run Windows Updates on your system before getting to this list of pre-reqs. For the most part that seems fine with one exception.

KB2263829 is now obsolete (or at least in my deployment it was) and would not install telling me “The update is not applicable to your computer.”

A uninstall and re-install of the Hyper-V role still would not allow me to install KB2263829 and because it wasn’t installed SnapManager for Hyper-V won’t install complaining the hotfix is missing.

I called up NetApp and was told to call Microsoft to find out why the hotfix won’t install. They didn’t seem to like my explanation that the hotfix seemed to be obsolete now. A bit of Googling turned up this discussion thread: https://communities.netapp.com/message/91200

The short answer is that the problem should be resolved in SnapManager for Hyper-V 1.2. In the mean time you can install KB2586470 on your Hyper-V server and then SnapManager for Hyper-V will install… or I guess you could use non-SP1 media for your Server 2008 R2 deployment and make sure you get KB2263829 installed before running Windows Updates.

Symantec Backup Exec NDMP backups fail the verify stage when backing up from a NetApp

We are using Symantec Backup Exec 2010 R3 to perform NDMP backups of our NetApp FAS2020 running Data OnTap 7.3.7.

All was working well until we upgraded from Data OnTap 7.3.6 to 7.3.7. Since then Backup Exec 2010 R3 is reporting every NDMP backup as a failure with the below error during the verify stage of the backup:

Job ended: Wednesday, September 12, 2012 at 3:52:55 AM
Completed status: Failed
Final error: 0xe000fe0d - A device-specific error occurred.
Final error category: Resource Errors

For additional information regarding this error refer to link V-79-57344-65037

After contacting NetApp they told us Symantec was at fault.

Symantec did some digging into job logs and came across this error in the NDMP jobs log file taken using Backup Execs built in logging tools:

BENGINE: [07/09/12 10:58:57] [8864] [ndmp\ndmpcomm] - ERROR: 7 Error: I/O error
BENGINE: [07/09/12 10:58:57] [8864] [loops] - NDMP Log Message: Storing of nlist entries failed.
BENGINE: [07/09/12 10:58:57] [8864] [loops] - NDMP Notify Data Halted: Aborted
BENGINE: [07/09/12 10:58:57] [8864] [loops] - NDMP Log Message: Aborted by client

Symantec then recommended contacting NetApp again after reviewing their ticketing system and seeing other NetApp customers had this exact same problem and were told to contact NetApp.

After getting a hold of NetApp again with the above information they have now told me this is a known issue and there is an internal bug report at NetApp for it. There is supposedly a known fix but it is not yet available for any shipped versions of Data OnTap. The internal bug report lists the following work arounds for NetBackup (which I assume will work on Backup Exec):

  1. Restore the directory to another location and extract the file after the restore completes.
  2. To perform a single file restore without using DAR, set the value of the environment variable EXTRACT to e or E. However, the single file restore reads the whole backup stream on the tape and this restore operation might be slow.
  3. Set the NDMP version on the storage system to version 3 and then perform the restore.

I’ve tested option 3 by running the following commands on our FAS2020

filer> ndmpd off
filer> ndmpd version 3
filer> ndmpd on

and backups are now failing with a different error:

Job ended: Wednesday, September 12, 2012 at 2:49:51 PM
Completed status: Failed
Final error: 0xe000feb9 - The NDMP subsystem reports that a request cannot be processed because it is in the wrong state to service the request.
Final error category: Resource Errors

For additional information regarding this error refer to link V-79-57344-65209

I reverted the NDMP version back to 4 and will now wait for a conference call with NetApp and Symantec to get to the bottom of this.

Despite the reported failures the backups do appear to still be good.

Update – September 14th, 2012

After a conference call with Symantec and NetApp the final conclusion is that this is a bug that only exists in Data OnTap 7.3.7 and it will be fixed in Data OnTap 7.3.7p1 which should be released sometime in the near future. No exact dates were provided.

The public bug report for this on NetApps site is 613414. You can subscribe to that bug with your NetApp account and when it’s resolved (the release of 7.3.7p1) you will receive an e-mail. The NetApp rep wasn’t certain if general e-mails go out to NetApp custers on ‘p’ releases of Data OnTap and subscribing to the bug should guarantee notification when the new version is released.

That public bug report lists the problem only effects Data OnTap 7.3.8. That is incorrect and it should read 7.3.7.

In the mean time the workarounds remain almost the same:

  1. Create CIFS shares for the volumes you backup via NDMP and change your backups to use the CIFS share instead of NDMP
  2. Disable backup verification in Backup Exec for your NDMP jobs
  3. Downgrade to Data OnTap 7.3.6
  4. Wait for Data OnTap 7.3.7p1

 

Update – October 4th, 2012

Sam in the comments got this update from NetApp

This fix will be included in 7.3.7P1. We are expecting 7.3.7P1 currently has a target release date of Oct 29th.

Here’s hoping.

 

Update – October 30th, 2012

Data OnTap 7.3.7P1 is out! We have one confirmation that this patch has fixed the verify problem.

Release Notes and Download: http://support.netapp.com/NOW/download/software/ontap/7.3.7P1/

 

Update – November 27th, 2012

I can confirm that the 7.3.7P1 patch has corrected this problem for us.

Got a NetApp and inheritance isn’t working properly?

I’m in the middle of a file server migration and attempting to consolidate two old File Servers onto our FAS2040 running Data OnTap 7.3.6.

Using a tool called Beyond Compare I begin by syncing the content from one file server over to the NetApp. Beyond Compare did the bulk of my sanity checking for me by verifying the contents now on the NetApp matched the contents on the old file server. I then manually began to verify NTFS permissions were copied properly. To my surprise I found that random directories I had synced to the NetApp had inheritance enabled on them while the original directories on the old file server have inheritance disabled.

Here’s the basic structure of our shares:

Old File Server
- Root Folder (Inheritance enabled)
-- [A-Z] Folders (Inheritance enabled)
--- User's Folder (Inheritance disabled)

Real-ish Example:
- User Drives Folder (Inheritance enabled)
-- T Folder (Inheritance enabled)
--- Test User 1 (Inheritance disabled)
--- Test User 2 (Inheritance disabled)
--- Test User 3 (Inheritance disabled)

 

Using Beyond Compare I synced the ‘T’ folder and here is what I ended up with:

File Server (Source):
- User Drives Folder (Inheritance enabled)
-- T Folder (Inheritance enabled)
--- Test User 1 (Inheritance disabled)
--- Test User 2 (Inheritance disabled)
--- Test User 3 (Inheritance disabled)
--- Test User 4 (Inheritance disabled)

NetApp (Destination): 
- User Drives Folder (Inheritance enabled)
-- T Folder (Inheritance enabled)
--- Test User 1 (Inheritance disabled)
--- Test User 2 (Inheritance enabled)
--- Test User 3 (Inheritance enabled)
--- Test User 4 (Inheritance disabled)

 

At first I thought this was a bug with Beyond Compare so I grabbed another file copying utility and tried syncing the ‘T’ folder again. Same results.

I checked the NetApp’s support site and came across Bug 8209 which is titled “ACL Inheritance for New Directories in Error”. According to the bug text (which is very brief) “Prior to OnTap 6.1.1, new directories sometimes incorrectly inherited their parent directories permissions.”

This bug has been resolved in the following versions of Data OnTap:

  • Data ONTAP 6.1.1 (First Fixed)
  • Data ONTAP 6.1.3R2 (GA)
  • Data ONTAP 6.4.5 (GA)
  • Data ONTAP 6.5.7 (GA)
  • Data ONTAP 7.0.7 (GD)
  • Data ONTAP 7.1.3 (GD)
  • Data ONTAP 7.2.7 (GD)
  • Data ONTAP 7.3.3 (GD)
  • Data ONTAP 7.3.7 (GA)
  • Data ONTAP 8.0.3 (GA)
  • Data ONTAP 8.1 (GA)
  • Data ONTAP 8.1.1RC1 (RC)

So if you’re not running one of those versions you’re probably going to need to upgrade OR manually check every folder you migrate/create by hand to verify inheritance is disabled.

 

Update – August 21st, 2012

So I’ve finally just upgraded our NetApp to Data OnTap 7.3.7 and this problem is still occurring for me. It seems less frequent now and so far it’s only been on empty directories but not all empty directories.

Update – August 22nd, 2012

Problem occurs when using either Beyond Compare and RichCopy 4.0.217 to try and migrate the files.

I did some experimenting and it looks like the problem is some kind of corruption on the source directory itself. Using Beyond Compare I tried copying one of the problem directories to brand new post Data OnTap 7.3.7 upgrade volume on our FAS2020 and inheritance was enabled. Same directory to our FAS2040 on a pre Data OnTap 7.3.7 upgrade volume resulted in inheritance still being enabled. I then tried copying the problem directory to a share created on a regular Windows 2003 Standard server and inheritance was once again enabled.

On the source file server I re-applied the permissions on one of the problem directories (Right Click directory, Properties, Security, Advanced, Check “Replace permission entries on all child objects with entries shown here that apply to child objects”, Apply, Ok) and then tried copying it again with Beyond Compare. No more inheritance problems for that directory.

It looks like at this point we may just have some corrupt/damaged NTFS permissions on some of our directories.

So what am I going to do about it? I managed to cobble together a PowerShell script from a few sources that can check for inheritance on directories and spit out a list. I ran it against the source file server and it didn’t find any problems so I can’t fix the problem pre-migration. That means you’ll have to run this script post migration on your destination and manually fix inheritance for the directories the script finds. I’m sure PowerShell could do it for you manually to but I haven’t gotten there yet.

Here is the script:

# Change Z:\ to the directory you want to scan

Get-ChildItem z:\ | ? {$_.PSIsContainer} | ? {
	Get-Acl $_.FullName | % {
		$_.GetAccessRules($true, $true, 'System.Security.Principal.NTAccount') |
		? {-not !$_.IsInherited}
	}
}

You’ll want to change ‘Z:\’ after ‘Get-ChildItem’ to the path of what you want to scan. I’m using map drives to keep things simple.

Original code source: http://www.vistax64.com/powershell/269430-finding-folders-where-acl-inheritance-off.html#post1227572

Access denied when changing NTFS permissions on a NetApp CIFS share from Windows 2008

I suspect this problem isn’t limited to just managing CIFS shares on NetApp’s. I bet if you’ve got a Windows File Server and you’re trying to edit NTFS permissions on shares via a Windows 2008 Computer Management MMC you’ll get this error message.

In our case we’ve got a NetApp FAS2040 joined to our new AD forest with a few CIFS shares on it. In this new forest we’re using Windows 2008 R2 domain controllers and our forrest is set to a 2008 functional level.

When we want to manage the NTFS permissions of CIFS shares exported from our old NetApp FAS2020 in our old forrest (which is at a 2003 functional level with 2003 domain controllers) we’d typically login to a Windows 2003 server, load up the Computer Management MMC and connect to our NetApp.

Today I created a new CIFS share on our FAS2040, logged into a Windows 2008 R2 server, fired up the Computer Management MMC, connected to the NetApp and tried to change the NTFS permissions on the share. This is what I got:

This shouldn’t be happening. The account I’m using is a Domain Administrator and the Domain Admins group has been added to the NetApps local Administrator group.

If you click ‘Cancel’ all the way out and then go back and view the NTFS permissions it will turn out that the changes did take effect despite the “Access Denied” error message.

For some odd reason I thought to try using a Windows 2003 Server from our old forest to manage the NTFS permissions. It worked perfectly with no access denied error. What gives?

Turns out this does: http://support.microsoft.com/kb/972299

Microsoft doesn’t explicitly state this but to “solve” the problem just create an empty folder, a blank text file or anything in the share first and then edit the permissions… or you can fire up a Windows 2003 Server and just use it’s MMC.

How to securely erase your data on a NetApp

When drives in a NetApp are being obsoleted and replaced we need to make sure we securely erase all data that used to be on them. Unless you’re just going to crush your disks.

In this example we’ve got an aggregate of 14 disks (aggr0) that need to be wiped and removed from our NetApp so they can be replaced with new, much larger disks.

There are two methods that you can use to wipe disks using your NetApp. The first is to simply delete the aggregate they are a member of, turning them into spares and then running “disk zero spares” from the command line on your NetApp. This only does a single pass and only zero’s the disks. There are arguments I’ve seen where some people say this is enough. I honestly don’t know and we have a requirement to do a 7 pass wipe in our enterprise. You could run the zero command 7 times but I don’t imagine that would be as effective as option number two. The second option is to run the ‘disk sanitize’ command which allows you to specify which disks you want to erase and how many passes to perform. This is what we’re going to use.

The first thing you’ll need to do is get a license for your NetApp to enable the ‘disk sanitize’. It’s a free license (so I’ve been told) and you can contact your sales rep to get one. We got ours for free and I’ve seen forum posts from other NetApp owners saying the same thing.

There is a downside to installing the disk sanitization license. Once it’s installed on a NetApp it cannot be removed. It also restricts the use of three commands once installed:

  • dd (to copy blocks of data)
  • dumpblock (to print dumps of disk blocks)
  • setflag wafl_metadata_visible (to allow access to internal WAFL files)

There are also a few limitations regarding disk sanitization you should know about:

  • It is not supported in takeover mode for systems in an HA configuration. (If a storage system is disabled, it remains disabled during the disk sanitization process.)
  • It cannot be carried out on disks that were failed due to readability or writability problems.
  • It does not perform its formatting phase on ATA drives.
  • If you are using the random pattern, it cannot be performed on more than 100 disks at one time.
  • It is not supported on array LUNs.
  • It is not supported on SSDs.
  • If you sanitize both SES disks in the same ESH shelf at the same time, you see errors on the console about access to that shelf, and shelf warnings are not reported for the duration of the sanitization. However, data access to that shelf is not interrupted.
I’ve also read that you shouldn’t sanitize more then 6 disks at once. I’m going to sanitize our disks in batches of 5, 5 and 4 (14 total). I’ve also read you do not want to sanitize disks across shelves at the same time.

 

Licensing disk sanitization

Once you’ve got your license you’ll need to install it. Login to your NetApp via SSH and run the following:

netapp> license add <DISK SANTIZATION LICENSE>

You will not be able to remove this license, are you sure you
wish to continue? [no] yes
A disk_sanitization site license has been installed.
        Disk Sanitization enabled.

Thu Apr 19 10:00:28 PDT [rc:notice]: disk_sanitization licensed

 

Sanitizing your disks

1. Identify what disks you want to sanitize

netapp> sysconfig -r

Aggregate aggr0 (online, raid_dp) (block checksums)
  Plex /aggr0/plex0 (online, normal, active)
    RAID group /aggr0/plex0/rg0 (normal)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   0a.16   0a    1   0   FC:A   -  ATA   7200 211377/432901760  211921/434014304
      parity    0a.17   0a    1   1   FC:A   -  ATA   7200 211377/432901760  211921/434014304
      data      0a.18   0a    1   2   FC:A   -  ATA   7200 211377/432901760  211921/434014304
      data      0a.19   0a    1   3   FC:A   -  ATA   7200 211377/432901760  211921/434014304
      data      0a.20   0a    1   4   FC:A   -  ATA   7200 211377/432901760  211921/434014304
      data      0a.21   0a    1   5   FC:A   -  ATA   7200 211377/432901760  211921/434014304
      data      0a.22   0a    1   6   FC:A   -  ATA   7200 211377/432901760  211921/434014304
      data      0a.23   0a    1   7   FC:A   -  ATA   7200 211377/432901760  211921/434014304
      data      0a.24   0a    1   8   FC:A   -  ATA   7200 211377/432901760  211921/434014304
      data      0a.25   0a    1   9   FC:A   -  ATA   7200 211377/432901760  211921/434014304
      data      0a.26   0a    1   10  FC:A   -  ATA   7200 211377/432901760  211921/434014304
      data      0a.29   0a    1   13  FC:A   -  ATA   7200 211377/432901760  211921/434014304
      data      0a.28   0a    1   12  FC:A   -  ATA   7200 211377/432901760  211921/434014304

Spare disks

RAID Disk       Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
---------       ------  ------------- ---- ---- ---- ----- --------------    --------------
Spare disks for block or zoned checksum traditional volumes or aggregates
spare           0a.27   0a    1   11  FC:A   -  ATA   7200 211377/432901760  211921/434014304 (not zeroed)

Here I’ve got 13 disks in aggr0 and the 14th acting as a spare. I need to delete aggr0 to free up the disks to be sanitized.

 

2. Delete the aggregate the disks are part of

netapp> aggr offline aggr0
Aggregate 'aggr0' is now offline.

netapp> aggr destroy aggr0
Are you sure you want to destroy this aggregate? yes
Aggregate 'aggr0' destroyed.

 

3. Verify all the disks you want to sanitize are now spares

netapp> sysconfig -r

Spare disks

RAID Disk       Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
---------       ------  ------------- ---- ---- ---- ----- --------------    --------------
Spare disks for block or zoned checksum traditional volumes or aggregates
spare           0a.16   0a    1   0   FC:A   -  ATA   7200 211377/432901760  211921/434014304 (not zeroed)
spare           0a.17   0a    1   1   FC:A   -  ATA   7200 211377/432901760  211921/434014304 (not zeroed)
spare           0a.18   0a    1   2   FC:A   -  ATA   7200 211377/432901760  211921/434014304 (not zeroed)
spare           0a.19   0a    1   3   FC:A   -  ATA   7200 211377/432901760  211921/434014304 (not zeroed)
spare           0a.20   0a    1   4   FC:A   -  ATA   7200 211377/432901760  211921/434014304 (not zeroed)
spare           0a.21   0a    1   5   FC:A   -  ATA   7200 211377/432901760  211921/434014304 (not zeroed)
spare           0a.22   0a    1   6   FC:A   -  ATA   7200 211377/432901760  211921/434014304 (not zeroed)
spare           0a.23   0a    1   7   FC:A   -  ATA   7200 211377/432901760  211921/434014304 (not zeroed)
spare           0a.24   0a    1   8   FC:A   -  ATA   7200 211377/432901760  211921/434014304 (not zeroed)
spare           0a.25   0a    1   9   FC:A   -  ATA   7200 211377/432901760  211921/434014304 (not zeroed)
spare           0a.26   0a    1   10  FC:A   -  ATA   7200 211377/432901760  211921/434014304 (not zeroed)
spare           0a.27   0a    1   11  FC:A   -  ATA   7200 211377/432901760  211921/434014304 (not zeroed)
spare           0a.28   0a    1   12  FC:A   -  ATA   7200 211377/432901760  211921/434014304 (not zeroed)
spare           0a.29   0a    1   13  FC:A   -  ATA   7200 211377/432901760  211921/434014304 (not zeroed)

 

4. Sanitize the first batch of disks (7 passes)

netapp> disk sanitize start -c 7 0a.16 0a.17 0a.18 0a.19 0a.20

WARNING:  The sanitization process may include a disk format.
If the system is power cycled or rebooted during a disk format
the disk may become unreadable. The process will attempt to
restart the format after 10 minutes.

The time required for the sanitization process may be quite long
depending on the size of the disk and the number of patterns and
cycles specified.
Do you want to continue (y/n)? y

The disk sanitization process has been initiated.  You will be notified via the system log when it is complete.
Thu Apr 19 11:10:41 PDT [disk.failmsg:error]: Disk 0a.20 (XXXXXXXX): message received.
Thu Apr 19 11:10:41 PDT [disk.failmsg:error]: Disk 0a.19 (XXXXXXXX): message received.
Thu Apr 19 11:10:41 PDT [disk.failmsg:error]: Disk 0a.18 (XXXXXXXX): message received.
Thu Apr 19 11:10:41 PDT [disk.failmsg:error]: Disk 0a.17 (XXXXXXXX): message received.
Thu Apr 19 11:10:41 PDT [disk.failmsg:error]: Disk 0a.16 (XXXXXXXX): message received.
Thu Apr 19 11:10:41 PDT [raid.disk.unload.done:info]: Unload of Disk 0a.20 Shelf 1 Bay 4 [NETAPP   X262_SGLXY250SSX AQNZ] S/N [XXXXXXXX] has completed successfully
Thu Apr 19 11:10:41 PDT [raid.disk.unload.done:info]: Unload of Disk 0a.19 Shelf 1 Bay 3 [NETAPP   X262_SGLXY250SSX AQNZ] S/N [XXXXXXXX] has completed successfully
Thu Apr 19 11:10:41 PDT [raid.disk.unload.done:info]: Unload of Disk 0a.18 Shelf 1 Bay 2 [NETAPP   X262_SGLXY250SSX AQNZ] S/N [XXXXXXXX] has completed successfully
Thu Apr 19 11:10:41 PDT [raid.disk.unload.done:info]: Unload of Disk 0a.17 Shelf 1 Bay 1 [NETAPP   X262_SGLXY250SSX AQNZ] S/N [XXXXXXXX] has completed successfully
Thu Apr 19 11:10:41 PDT [raid.disk.unload.done:info]: Unload of Disk 0a.16 Shelf 1 Bay 0 [NETAPP   X262_SGLXY250SSX AQNZ] S/N [XXXXXXXX] has completed successfully

 

You can periodically check the status of the sanitization by running:

netapp> disk sanitize status
sanitization for 0a.16 is 2 % complete
sanitization for 0a.18 is 2 % complete
sanitization for 0a.19 is 2 % complete
sanitization for 0a.17 is 2 % complete
sanitization for 0a.20 is 2 % complete

 

When the disks have been sanitized if you want to re-use them instead of replace them run this command:

netapp> disk sanitize release disk_list

Example
netapp> disk sanitize release 0a.16 0a.17 0a.18 0a.19 0a.20

This will add the sanitized disks to the spare pool.

 

There are a few options you can customize when ‘disk santize’ command.

disk sanitize start [-p pattern1|-r [-p pattern2|-r [-p pattern3|-r]]] [-c cycle_count] disk_list

-p pattern1 -p pattern2 -p pattern3 specifies a cycle of one to three user-defined hex byte overwrite patterns that can be applied in succession to the disks being sanitized. The default pattern is three passes, using 0x55 for the first pass, 0xaa for the second pass, and 0x3c for the third pass.

-r replaces a patterned overwrite with a random overwrite for any or all of the passes.

-c cycle_count specifies the number of times the specified overwrite patterns will be applied. The default value is one cycle. The maximum value is seven cycles.

disk_list specifies a space-separated list of the IDs of the spare disks to be sanitized.

 

References (NetApp login require)