Summit 7 Team Blogs

Beware of Duplicates in SharePoint Online's Content Search Web Part

Recently, I faced a really thorny issue regarding the Content Search Web Part in SharePoint Online. The customer is configured with a one-way, outbound hybrid topology with SharePoint Online/Office 365. We had Search Federation setup and working great on-premises. They setup a Content Search Web Part on their SharePoint Online home page to display all sites the user belongs to, and they wanted that same Content Search Web Part in their on-premises farm. What they found, however, was a puzzler: the exact same query configured the exact same way would return more results in SharePoint Online than it did on-premises.

We verified that the user had the proper permissions, the sites were being indexed, etc. A site that we found missing in the on-premises results could actually be returned if we narrowed the search just to its URL. We couldn’t explain the behavior, and understandably, it was quite a concern for the customer. However, I was finally able to crack the nut and figured others might be struggling with the same issue. The problem: duplicates and a probable bug in SharePoint Online.

Before moving into the details of the solution, let me give you some more background information.

Reproducing the Issue

As a test, I created an identical page in both SharePoint Online (SPO) and on-premises (OP). In it, I setup a variety of search web parts, both Content Search Web Part (CSWP) and core Search Result web parts. The Search Result web parts were helpful because they gave a results count at the bottom. They also served to validate the results of the CSWPs. When I ran this particular query OP against the SPO result source (in the Query Builder since it gives a count), it would return 27 sites. The exact same query in SPO would return 254 sites. On SPO, the exact same queries in the CSWP and Search Results web parts were giving different results – on the same page!

Eh? As another test, I used the query “FileExtension:aspx”. I wanted a query that would give me a lot of results. The OP query builder would return 23,405 results but SPO would return 180,702 (this number would fluctuate quite a lot, BTW). That’s no small difference. The same query in a Search Results web part would return very similar results (23,405 vs. 23,332), and they returned a near-identical number of results for the team sites query (85 vs. 86). Ultimately, however, the Search Results web part returned the same results. After paging through the results, the count of records ultimately ended up being identical (both 124). I believe this is due to more accurate security trimming as further results were retrieved.


Duplicate Results in the Content Search Web Part

So it ends up that the difference between the results/counts is due to the CSWP in SPO not removing duplicate results. The Search Results web parts were able give matching results because the query builder gives a Remove Duplicates option in the Settings tab (on my default), and the option actually works in SPO. Unfortunately, the CSWP does not give this option on its Settings tab. The setting is, however, actually available in the web part definition. If you were to export the web parts, you’d find at the bottom of the file (it’s just XML) a property named DataProviderJSON. It contains the configurations set in the query builder (including ones not visible). One of the elements in the string is “TrimDuplicates,” and it’s the equivalent of the Remove Duplicates option in the Search Results web part (read more about TrimDuplicates here). It is enabled by default (“TrimDuplicates”:true).

So why are we still getting duplicates? Well, I believe we’re seeing a bug in SharePoint Online. The fact of the matter is that the CSWP trims out duplicates OP but doesn’t in SPO, even though TrimDuplicates is enabled. It appears that SPO is simply disregarding the TrimDuplicates flag in a CSWP query (but not in a Search Results query). Looks like a bug, smells like a bug – it’s probably a bug.

So How Do I Fix It?

You can’t exactly fix it, but you can implement a workaround that does the next best thing.

1. In the CSWP query builder, go to the Refiners tab

2. Click the little “Show more” link at the bottom

3. In the “Group By” dropdown, select “—Show all properties—“ and then select DocumentSignature

You should now see the correct number of results in the Search Result Preview pane.


The Reason for the Madness

While coming up with a workaround, I found this blog posted on TechNet. In it, the author explains that this whole duplicate thing is actually expected behavior based on how SharePoint stores content text (the Document stream). SharePoint breaks the file into same-size chunks, stores them with a hash, and produces a series of cumulative hashes. These hashes are stored in the search database for each document in the index. When a query is run, if Remove Duplicates is on, that table is searched to find any near duplicates (not exact duplicates) and filter them out. The document’s overall hash is the DocumentSignature managed property.

So, normally, the Remove Duplicates is turned on and only the unique items are returned. If you turn off the Remove Duplicates option in a regular Search Results web part, you can see that the number of results matches the buggy CSWP’s count. It looks like grouping by the DocumentSignature property provides what is effectively the same functionality, since it rolls up all instances of a DocumentSignature into a single result.


Final Thoughts

Search-driven solutions are becoming more and more important and prevalent in SharePoint. In short, search is a really big deal in SharePoint 2013. The crown jewel of this capability is the Content Search Web Part, since it gives us an easy way to present data from anywhere in whatever user experience we want. As such, it’s unfortunate to see this flaw in the gem. Although it is a pain, takes more time, is easily missed, and simply shouldn’t be needed, the workaround can cover it up. Hopefully, though, Microsoft will be able to repair the web part soon.

Thanks for reading!