Summit 7 Team Blogs

SharePoint Search 2010 25M Item Crawl DB Limit Solution (Also for FAST Search 2010 for SharePoint)

The crawl database used by SharePoint 2010 Search crawlers has a limit of 25 million items. Actually it is a performance limit, not a hard limit, but still creates issues for organizations that place large amounts of content under the same hostname URL.
Why? Because SharePoint distributes content crawl lists to crawl databases according to the hostname in the URL and does not provide a method of spreading content from site collections under the same hostname across multiple crawl databases.
If one could use multiple crawl databases, then additional crawler components on different servers could be associated with different crawl databases thereby spreading the workload of crawling the same URL across a larger set of application servers in the search farm.
Since FAST Search 2010 for SharePoint uses the same SharePoint crawling architecture, it suffers the same limitation.
The solution is actually quite simple: Lie!
In my test environment, I extended my (supposedly) large application obviously using a different hostname and Alternate Access Mapping(AAM). Once it was established that I could access the same content via two hostnames with both using NTLM authentication, the rest was easy but somewhat tedious.
FIRST, I created a second crawl database. This is important because, once placed in a crawl database, a host cannot easily be moved to another. I also created the crawl components associated with the new crawl database.
Next, I created a set of crawl rules excluding certain site collections under Hostname A. I also needed a crawl rule adding the site collections under Hostname B followed by a crawl rule excluding all of Hostname B. This set of rules prevented content from being crawled twice.

Crawl Rules Crawl Rules

Then, I created a content source for the new AAM which would be placed automatically in the new crawl database when the crawl was started. I entered the site collections excluded from Hostname A as start addresses but with Hostname B from the extended AAM. These must be entered as “Only crawl the Site Collection of each start address.”

Content Source Start URL Content Source Start URL

Since entering a long list of individual site collections (as well as creating a long list of exclusion rules) will be somewhat tedious if not painful, some PowerShell scripts will be needed here.
Finally, to complete my deception to the users, I created a Server Name mapping from Hostname B to Hostname A.

Server Name Mapping Example Server Name Mapping

For SharePoint Search and FS4SP, I created this on the same search service application not the FAST Query SSA for FS4SP. At least for FS4SP, this needs to be created before the content is crawled to modify URLs in the results. I tested creating it afterwards and had to perform a full crawl to get it to work.
Search results appear as if they were from although crawl logs show that they were actually crawled as by a different crawler than items crawled in the remainder of
Obviously, if you are using authentication other than NTLM, this whole process will be somewhat more cumbersome. I have not had a chance to test out the steps there but it should be possible.
One more test: I used FAST Search Site Promotion to promote the site collection using the mapped URL instead of the crawled URL. Worked perfectly so even Site Collection Administrators do not have to know the URL under which the content is actually crawled.