Situation
You’re looking for a way to semi-automate the back-loading of of scrubbed data from your production environment to your development/test/staging/whatever environments. There are several reasons you’d want to do this that I don’t need to get into here. The problem is that your production environment is well-used and full of a bunch of data, both personal and proprietary. This data must therefore be scrubbed to shrink its size and remove anything that shouldn’t be in a non-production environment.
Solution
You need a site collection scrubber that will rip out and dump the data you don’t need or don’t want to keep. Fortunately, this step is relatively simple, there’s just a ton of looping. You need to loop through each item in each list/library in each web in the site collection and match either the file or the list item against a set of heuristics that helps identify what should stay and what should go. These heuristics could even allow you to change data that you want to keep but that has personally identifiable data.
To the code:
1: public static class Program
2: {
3: private static Regex exp; //Matches target elements
4: private static string siteName; //Path to the site
5: private static bool remove = false; //Whether to remove elements or not
6:
7: public static void Main(string[] args)
8: {
9: using (SPSite site = new SPSite(siteName))
10: {
11: foreach (SPList list in site.RootWeb.Lists)
12: {
13: TryProcessList(list);
14: }
15:
16: try
17: {
18: foreach (SPWeb web in site.AllWebs)
19: {
20: foreach (SPWeb subWeb in web.Webs)
21: {
22: ProcessSubWeb(subWeb);
23: }
24:
25: foreach (SPList list in web.Lists)
26: {
27: TryProcessList(list);
28: }
29:
30: try
31: {
32: site.RecycleBin.DeleteAll();
33: }
34: catch (Exception e)
35: {
36: Console.WriteLine("Error emptying recycling bin: {0}", e.Message);
37: }
38: }
39: }
40: catch (Exception e)
41: {
42: Console.WriteLine("Error processing subsite: {0}", e.Message);
43: }
44:
45: try
46: {
47: site.RecycleBin.DeleteAll();
48: }
49: catch (Exception e)
50: {
51: Console.WriteLine("Error emptying recycling bin: {0}", e.Message);
52: }
53: }
54: Console.WriteLine("Finished processing site: {0}", siteName);
55: Console.Read();
56: }
57:
58: private static void TryProcessList(SPList list)
59: {
60: try
61: {
62: SPDocumentLibrary library = (SPDocumentLibrary)list;
63:
64: try
65: {
66: foreach (SPCheckedOutFile file in library.CheckedOutFiles)
67: {
68: TryTakeOverFile(file);
69: }
70: }
71: catch (SPException e)
72: {
73: Console.WriteLine("Could not take over file: {0}", e.Message);
74: }
75:
76: foreach (SPListItem item in list.Items)
77: {
78: if (item.File.CheckOutType != SPFile.SPCheckOutType.None)
79: {
80: TryUndoCheckOut(item);
81: }
82: if (remove && exp.IsMatch(item.File.Name))
83: {
84: TryDeleteFile(item);
85: }
86: }
87: }
88: catch(InvalidCastException)
89: {
90:
91: }
92: }
93:
94: private static void TryDeleteFile(SPListItem item)
95: {
96: try
97: {
98: item.File.Delete();
99: }
100: catch (SPException e)
101: {
102: Console.WriteLine("Could not delete file {0}: {1}", item.File.Name, e.Message);
103: }
104: }
105:
106: private static void ProcessSubWeb(SPWeb subwebs)
107: {
108: foreach (SPList list in subwebs.Lists)
109: {
110: TryProcessList(list);
111: }
112: }
113:
114: private static void TryTakeOverFile(SPCheckedOutFile file)
115: {
116: try
117: {
118: file.TakeOverCheckOut();
119: }
120: catch (SPException)
121: {
122: Console.WriteLine("Could not take over file: {0}", file.LeafName);
123: }
124: }
125:
126: private static void TryUndoCheckOut(SPListItem item)
127: {
128: try
129: {
130: item.File.UndoCheckOut();
131: item.File.Update();
132: }
133: catch (SPException)
134: {
135: item.File.CheckIn(string.Empty);
136: Console.WriteLine("Could not undo checkout for file: {0}", item.Title);
137: }
138: }
139: }
As you can see, although the code is relatively long, it’s not very difficult. Your heuristics can be as simple as a regular expression match against a file’s extension (as above) or as complex as you’d like to make it. The only downside of looping through a large site collection is that it can take a significant amount of time to complete the scrubbing, especially if the recycle bin is in place. I recommend running the scrubber without the recycle bin in place, since the scrubber empties the recycle bin anyway. The reason this happen is because the goal is to shrink the amount of data you’re moving or keeping in what is likely a limited size environment.
Keep in mind that the scrubber is not perfect and cannot completely replace a human looking through the data. Using the scrubber as a blunt instrument that can whack a good deal of data from the site collection and then cleaning up the last few bits of remaining data is a great plan and can save you considerable amounts of time. The end result is a site collection that contains your site structure and branding but does not contain any of the unnecessary files that you don’t need. This creates the perfect testing environment and can be repeated much more often, making you, your environments, and your change management board happy.