Setting up a web crawler in the Google Search Appliance is a piece of cake — you enter a starting URL and some boundaries and let it rip. The GSA will spider its way around until it finds every reachable page in the site. For a well-structure site, this usually produces very good results, but not all sites are created equally. While the GSA does have features for detecting cyclical loops and excessively redundant pages, it will often find significantly more pages than you expect. This can cause a reduction in the quality of your search results. The extra pages can dilute the relevancy of higher-quality pages, making it difficult to find the desired results. In the worst-case scenario, the GSA index size will reach the licensed limit, resulting in only part of your site being indexed and searchable. In this situation, the GSA starts evicting pages to make room for presumably better pages, but in practice, the eviction algorithm is not perfect and can result in essentially random pages being removed. Regardless, eviction or truncation is not a happy thing and you will want to take action to fix the problem.
Houston, we have a problem
First, you need to detect that you have this problem. The GSA’s Graph of Crawled and Found URLs on the Crawl Status page is a quick way to check. You should always watch this graph (and other data points) carefully when you add a new site to the GSA. Ideally, after adding a new site, the graph of Crawled and Found URL’s will rise and then gradually taper off flat as the site is completely indexed. If you see the number of crawled and found URL’s increasing like a hockey stick, you’ve got a problem. Another quick test is to compare the delta between the number of URLs crawled vs. the number of URLs found. If the gap is getting increasingly larger, you likely have a runaway situation. If the gap is constant or shrinking, that is a good sign. There will almost always be a gap between the number of URLs crawled vs. found for various reasons — we are only interested in the rate of change in the gap. [edit 7/17 – The gap between crawled and found URLs can indicate problems — things like crawl failures or unsupported file formats. It can also be innocuous. The GSA sometimes finds links, or what it erroneously thinks are links, to other pages that do not exist. It adds them to queue, tries to fetch them, and fails. You will see these as 404 errors in the Crawl Diagnostics, at least for a few weeks or so until the GSA give up and removes them.]
Fixing the hockey stick problem is an art, not a science. It involves intuition and detective work. The problems will be buried among hundreds of thousands or millions of pages. I find that you have to squint your eyes, so to speak, and back away to see the problems. Looking at every URL, one by one, is not practical.
Crawl Diagnostics
The first place I go to find problems is the Crawl Diagnostics page. Crawl Diagnostics provides a hierarchical view of the index, showing the number of pages in each site, and then the number of pages in each folder, and so on. Just like folders on a hard drive, you can drill down one level at a time. The Crawl Diagnostics page can be sorted or scanned by count, allowing you to focus only on the sites or folders with the largest number of pages. Why spend time troubleshooting a site or folder with a few hundred pages when you have something at the top of the list with 50,000 more pages than its nearest neighbor? Sites or folders with the highest counts are where you should focus first — they offer potentially the best bang for your buck.
Once you find a site or folder with a suspiciously large number of pages, you need to do some research and then make a judgment call. Are the pages suspicious or legitimate? Use the Crawl Diagnostics page to dig around a bit and open a few of them. Are they all unique and valuable, or do you find a lot of redundant or low-value content?
I tend to use this first technique to broadly eliminate entire sites or folders. For example, I have found sites full of mirrors of publicly available UNIX software. Or folders full of old log files or data files. In those cases, I had to poke around and analyze what I was finding, and then make a decision about whether to keep it in the index or not. If you do not feel qualified to make the final decision, mark it down in a list for review by the appropriate people in your organization. Sometimes, though, it will be hard to find the owner, and you should make the decision yourself in the best interest of the search index — if something seems like it provides little or no value to the typical search user, it is probably best to remove it. You can always add it back later if someone comes asking for it.
Export All URLs
Once I have reviewed the most egregious sites and folders, I am usually left with a different kind of problem — death by a thousand cuts. At the site or folder level, there might not be any smoking guns left. Maybe there is single large folder with a ton of different types of URLs in it — all with various URL parameters, but with no folder structure to organize them. This is quite common with dynamically generated web sites. Or maybe there are hundreds of folders, all with relatively equal amounts of content.
The Crawl Diagnostics view does not provide the best way to find and fix these kind of problems. You can only view 100 pages at a time, and 10,000 total. If a folder has more than 10,000 pages, you won’t even be able to view all of the URLs. And opening hundreds of different subfolder is impractical. I prefer to export the All URLs list for further analysis. If you have already hit the GSAs licensed limit, you can export a list of both indexed and un-indexed URLs, giving you a sneak peek of pages that haven’t even been indexed yet.
I start by striping out all but the first column (the URL) and sort the list of URLs alphabetically. Here is a UNIX script that will do this quickly:
perl -pi -w -e 's/^(.*?\s).*$/$1/g' exported_urls
sort exported_urls > exported_urls_sorted
This allows me to quickly scan all the URLs in hierarchical order. Here is where squinting your eyes really helps. As you scroll through the list of URLs page by page, patterns will emerge and problems will jump out. I am basically looking for large swaths of URLs that A) look similar and B) do not look like each one is unique and important. At a glance, if the URLs are going in and out (in length) randomly like the Rocky Mountains, I usually assume that is just normal behavior. You don’t even need to see the URLs in the image below to see that those pages are probably OK.
On the other hand, if I see a large repetative section, I pause and take a closer look. For example, I might see something like this:
.../article/12345 .../article/12345?nav=bydate .../article/12345?nav=bymonth .../article/12345?nav=byyear .../article/23456 .../article/23456?nav=bydate .../article/23456?nav=bymonth
Assuming the site has hundreds of articles, and each article is appearing multiple times, this pattern will be obvious when you scroll through the list of URLs. I can quickly tell that each article is being indexed multiple times with different parameters — in this case possibly referring to how the user navigated to the article. If I am not sure, I can open a couple of the URLs and confirm that they all point to the same content. I can mitigate this problem with a Do Not Crawl pattern that eliminates the URL with the nav parameter, while keeping the primary article:
regexpIgnoreCare:.*\\/article\\/[0-9]+\\?nav=.*
Other problems can jump out quickly with this technique. Here are a few more examples:
The Land Bridge problem
You do need to be careful when removing things from the index. You might see a bunch of pages in a series that are part of a table of contents or index. They might differ only by page number or tag or filter or sort order, which makes them look redundant. If you remove all of them, you could prevent the GSA from reaching all of the pages that they provide links to. And even worse, you might not realize the problem for a while. If the GSA has already indexed the child pages, it will not forget them, which is good — once crawled the child pages will remain in the index as long as they are valid. However, if you perform an upgrade or get a replacement GSA and you have to completely re-index the site, the GSA will not be able to find the child pages next time. You will have cut off the land bridge, and there might not be any other way for the GSA to find those pages.
As you remove pages from the index, give some thought to their purpose, which might be to help the GSA find other pages. If you need the pages to be indexed only to provide links to other pages, consider adding a ROBOTS metatag that instructs the GSA to FOLLOW, NOINDEX the page.
<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
The GSA will follow any links found on the pages, but the table of contents pages themselves will not count towards your licensed limit.
Lather, rinse, and repeat
Because of the nature of the GSA crawl queue and the license limit, the GSA might stop crawling new pages until you remove some items from the index. Once you clear the logjam, the GSA will be free to index and discover even more content. One lonely page waiting in the queue might be the sole gateway to hundreds or thousands more URLs. You won’t even know these new pages exist until the GSA is able to index that first page. So as you clear up space in the index, you might find it quickly fills up again. Repeat the techniques described above until the number of URLs being indexes has stopped increasing and remains below the license limit.
I hope this article has given you a few tricks for cleaning up your GSA index. There are many ways to approach the problem. If you need help, call the experts. The Google Search Appliance team at Perficient can help you with this type of analysis and cleanup work. Contact us at GooglePractice@perficient.com for more information.
Hey Chad,
Good article. We too ran into the 10,000 export limit. Do you know if there are external scripts of any kind that can export the full amount?
Cheers.
Barry
The only option I know of is to export the All URL list, and parse them into collections / tree view manually.