Skip to main content

Cloud

SharePoint 2010: Site Collection Scrubber

Situation

You’re looking for a way to semi-automate the back-loading of of scrubbed data from your production environment to your development/test/staging/whatever environments. There are several reasons you’d want to do this that I don’t need to get into here. The problem is that your production environment is well-used and full of a bunch of data, both personal and proprietary. This data must therefore be scrubbed to shrink its size and remove anything that shouldn’t be in a non-production environment.

Solution

You need a site collection scrubber that will rip out and dump the data you don’t need or don’t want to keep. Fortunately, this step is relatively simple, there’s just a ton of looping. You need to loop through each item in each list/library in each web in the site collection and match either the file or the list item against a set of heuristics that helps identify what should stay and what should go. These heuristics could even allow you to change data that you want to keep but that has personally identifiable data.
To the code:

 1: public static class Program
 2: {
 3:     private static Regex exp; //Matches target elements
 4:     private static string siteName; //Path to the site
 5:     private static bool remove = false; //Whether to remove elements or not
 6:
 7:     public static void Main(string[] args)
 8:     {
 9:         using (SPSite site = new SPSite(siteName))
 10:         {
 11:             foreach (SPList list in site.RootWeb.Lists)
 12:             {
 13:                 TryProcessList(list);
 14:             }
 15:
 16:             try
 17:             {
 18:                 foreach (SPWeb web in site.AllWebs)
 19:                 {
 20:                     foreach (SPWeb subWeb in web.Webs)
 21:                     {
 22:                         ProcessSubWeb(subWeb);
 23:                     }
 24:
 25:                     foreach (SPList list in web.Lists)
 26:                     {
 27:                         TryProcessList(list);
 28:                     }
 29:
 30:                     try
 31:                     {
 32:                         site.RecycleBin.DeleteAll();
 33:                     }
 34:                     catch (Exception e)
 35:                     {
 36:                         Console.WriteLine("Error emptying recycling bin: {0}", e.Message);
 37:                     }
 38:                 }
 39:             }
 40:             catch (Exception e)
 41:             {
 42:                 Console.WriteLine("Error processing subsite: {0}", e.Message);
 43:             }
 44:
 45:             try
 46:             {
 47:                 site.RecycleBin.DeleteAll();
 48:             }
 49:             catch (Exception e)
 50:             {
 51:                 Console.WriteLine("Error emptying recycling bin: {0}", e.Message);
 52:             }
 53:         }
 54:         Console.WriteLine("Finished processing site: {0}", siteName);
 55:         Console.Read();
 56:     }
 57:
 58:     private static void TryProcessList(SPList list)
 59:     {
 60:         try
 61:         {
 62:             SPDocumentLibrary library = (SPDocumentLibrary)list;
 63:
 64:             try
 65:             {
 66:                 foreach (SPCheckedOutFile file in library.CheckedOutFiles)
 67:                 {
 68:                     TryTakeOverFile(file);
 69:                 }
 70:             }
 71:             catch (SPException e)
 72:             {
 73:                 Console.WriteLine("Could not take over file: {0}", e.Message);
 74:             }
 75:
 76:             foreach (SPListItem item in list.Items)
 77:             {
 78:                 if (item.File.CheckOutType != SPFile.SPCheckOutType.None)
 79:                 {
 80:                     TryUndoCheckOut(item);
 81:                 }
 82:                 if (remove && exp.IsMatch(item.File.Name))
 83:                 {
 84:                     TryDeleteFile(item);
 85:                 }
 86:             }
 87:         }
 88:         catch(InvalidCastException)
 89:         {
 90:
 91:         }
 92:     }
 93:
 94:     private static void TryDeleteFile(SPListItem item)
 95:     {
 96:         try
 97:         {
 98:             item.File.Delete();
 99:         }
 100:         catch (SPException e)
 101:         {
 102:             Console.WriteLine("Could not delete file {0}: {1}", item.File.Name, e.Message);
 103:         }
 104:     }
 105:
 106:     private static void ProcessSubWeb(SPWeb subwebs)
 107:     {
 108:         foreach (SPList list in subwebs.Lists)
 109:         {
 110:             TryProcessList(list);
 111:         }
 112:     }
 113:
 114:     private static void TryTakeOverFile(SPCheckedOutFile file)
 115:     {
 116:         try
 117:         {
 118:             file.TakeOverCheckOut();
 119:         }
 120:         catch (SPException)
 121:         {
 122:             Console.WriteLine("Could not take over file: {0}", file.LeafName);
 123:         }
 124:     }
 125:
 126:     private static void TryUndoCheckOut(SPListItem item)
 127:     {
 128:         try
 129:         {
 130:             item.File.UndoCheckOut();
 131:             item.File.Update();
 132:         }
 133:         catch (SPException)
 134:         {
 135:             item.File.CheckIn(string.Empty);
 136:             Console.WriteLine("Could not undo checkout for file: {0}", item.Title);
 137:         }
 138:     }
 139: }

As you can see, although the code is relatively long, it’s not very difficult. Your heuristics can be as simple as a regular expression match against a file’s extension (as above) or as complex as you’d like to make it. The only downside of looping through a large site collection is that it can take a significant amount of time to complete the scrubbing, especially if the recycle bin is in place. I recommend running the scrubber without the recycle bin in place, since the scrubber empties the recycle bin anyway. The reason this happen is because the goal is to shrink the amount of data you’re moving or keeping in what is likely a limited size environment.
Keep in mind that the scrubber is not perfect and cannot completely replace a human looking through the data. Using the scrubber as a blunt instrument that can whack a good deal of data from the site collection and then cleaning up the last few bits of remaining data is a great plan and can save you considerable amounts of time. The end result is a site collection that contains your site structure and branding but does not contain any of the unnecessary files that you don’t need. This creates the perfect testing environment and can be repeated much more often, making you, your environments, and your change management board happy.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Andrew Schwenker

Andrew Schwenker is a Sr. Technical Consultant within Perficient’s Microsoft National Business Unit East's SharePoint practice. Andrew has nearly 2 years of experience in consulting and has participated in projects that have touched nearly every aspect of SharePoint 2010. Andrew earned his Bachelor’s degree in Computer Science as well as Master’s degrees in Computer Science and Information Systems from Indiana University. He’s interested in creating winning solutions to generate business innovation using SharePoint. Prior to starting at Perficient, Andrew completed internships with General Electric, ExactTarget, and Great American Financial Resources. During his studies, he actively participated in Alpha Phi Omega National Service Fraternity and competed in the first annual Cluster Challenge at SC07 in Reno, NV. Andrew was a part of the dynamic PointBridge team that was acquired by Perficient.

More from this Author

Follow Us