You’re looking for a way to semi-automate the back-loading of of scrubbed data from your production environment to your development/test/staging/whatever environments. There are several reasons you’d want to do this that I don’t need to get into here. The problem is that your production environment is well-used and full of a bunch of data, both personal and proprietary. This data must therefore be scrubbed to shrink its size and remove anything that shouldn’t be in a non-production environment.
You need a site collection scrubber that will rip out and dump the data you don’t need or don’t want to keep. Fortunately, this step is relatively simple, there’s just a ton of looping. You need to loop through each item in each list/library in each web in the site collection and match either the file or the list item against a set of heuristics that helps identify what should stay and what should go. These heuristics could even allow you to change data that you want to keep but that has personally identifiable data.
To the code:
As you can see, although the code is relatively long, it’s not very difficult. Your heuristics can be as simple as a regular expression match against a file’s extension (as above) or as complex as you’d like to make it. The only downside of looping through a large site collection is that it can take a significant amount of time to complete the scrubbing, especially if the recycle bin is in place. I recommend running the scrubber without the recycle bin in place, since the scrubber empties the recycle bin anyway. The reason this happen is because the goal is to shrink the amount of data you’re moving or keeping in what is likely a limited size environment.
Keep in mind that the scrubber is not perfect and cannot completely replace a human looking through the data. Using the scrubber as a blunt instrument that can whack a good deal of data from the site collection and then cleaning up the last few bits of remaining data is a great plan and can save you considerable amounts of time. The end result is a site collection that contains your site structure and branding but does not contain any of the unnecessary files that you don’t need. This creates the perfect testing environment and can be repeated much more often, making you, your environments, and your change management board happy.