Mining the long tail: extracting insights from search data

The first step in improving the user experience of a product, service or website is understanding its users. The best way to understand what users want and what keeps them busy is still to seek them out and listen to them. When you don’t know who your users are yet or can’t reach out to them, it’s time to get creative and look for other sources of information. I’ve recently been working on redesigning the website of a large car manufacturer and found the data from the current site’s search function to be particularly insightful. It’s hard to find a clearer statement of intent than the words a user typed into a search box. The fact that they are searching for something means that at the very least they are interested in it. Moreover, frequent searches for a particular topic could indicate a problem with the structure of the site or its contents. Of course not all queries will end up in the search box, and these are all just pieces of a bigger puzzle, but there are some good insights buried in this sort of data and I want to share how I get them to the surface.

The unhelpful “top search terms”

On this particular website, site-searches were tracked using Google Analytics. GA shows the top search terms, which is useful to see what was typed into the search box most frequently, but doesn’t help much in understanding what people are really searching for. A search box allows for free text input, so the possibilities are endless. Users might be using many different terms to search for the same thing and none of them would show up in the top search terms.


From the list of top search terms, it would seem that most people on the site are searching for “finance”, “warranty” and a few car models. It’s tempting to conclude from this that these terms are representative for what people are really trying to find. However, that would be ignoring the fact that there may very well be some much bigger questions hidden in the long tail, for which users simply worded their search in many different ways. Surfacing those questions is key to understanding what users are really looking for, rather than just how they look for it.

Extracting insights through grouping and clustering

To make sense of data like this, I tend to borrow some tricks from the Grounded Theory method – a systematic approach for developing theories through the analysis of data. Rather than starting with a set of predefined hypotheses, I start with the facts in the data and work my way up to an understanding of what they represent. In this case, the raw facts were search terms and frequencies, and I wanted to get to the underlying user needs they represent by systematically grouping them into concepts and themes of increasing levels of abstraction. This causes patterns to emerge from the data, which help reveal underlying themes and concepts. If you’ve ever used card sorting to find unexpected relationships in a mess of terms, this will sound familiar. For this project, my approach was roughly as follows:

To start, I went through the top 1000 search queries (that’s all you can get out of Google Analytics) and grouped all queries that were nearly identical. A clever search engine understands that “cx5”, “cx-5” and “CX 5” all mean the same thing. Unfortunately, Google Analytics doesn’t know how the site’s search function works, so it doesn’t make any assumptions and lists all these variations separately. If you have a bigger data set, it’s probably a good idea to automate this first step, but for smaller sets it’s a good warming up exercise to work your way through the whole list once and get an overview of what’s in there. I like to work in OmniOutliner, because it has good keyboard shortcuts, can export to Excel and OmniGraffle and is great at sorting and summarising nested hierarchical data on the fly.

Next, I grouped search queries that were worded differently, but meant the same into a single concept. For example, “gps”, “sat nav” and “tomtom” all refer to navigation systems. In some cases, when grouping similar terms, I already stumbled upon higher level questions: “careers”, “jobs” and “employment” all relate to “finding a job”. I keep repeating this step, every time trying to group existing concepts into higher level categories and themes until I reach a manageable number of groups. In this case, I was able to account for about 97% of searches with 30 main themes (with 20 themes accounting for 90%).

While categorising the data, it’s important to keep in mind that the goal of the analysis is not to reach some sort of ‘correct’ taxonomy of concepts, but to find what user needs are expressed in the data. The intended outcome is a list of activities or things the users were trying to accomplish while searching. Examples are “contacting a dealer”, “renewing warranty”, “buying a used car” etc. Of course, some searches are too ambiguous to interpret at this point, but “searching by location” is a better category name than “cities in Europe”.

It’s OK to revisit existing categories and break them up or reorganise their internal grouping based on existing or new understanding of the data and what it represents. For example, I started with a single group named “(looking for) cars”, which I then split into “(looking for) current models” and “(looking for) discontinued models”. Upon further examination, not all of the “current models” were available for sale in this particular market, so I regrouped that category into “available models” and “unavailable models”. Unsurprisingly, this revealed that people searching for unavailable models were much more likely to leave the site. I had discovered a user need that I was previously unaware of and that was currently not adequately met by the existing site. These sorts of insights are hard to get from a flat list of search terms, but naturally emerge as patterns begin to appear in the data.

taxonomy of car-related keywords

Interpreting the results

Extracting themes from data through repeated grouping and clustering is a great first step towards a better understanding of users’ needs. The list of categories that emerges should be a decent indication of the most important things people were trying to accomplish using the site search. It’s important to note that although this analysis is grounded in data, it is still a qualitative method. It’s convenient that grouping similar terms allows for aggregation of related metrics, but it’s tempting to read too much into the numbers. A high number of searches for a particular topic could mean that topic is important, but it might just as well mean that it’s difficult to find information about it on the site through other means. The analysis of one source of data often raises questions that can only be answered using another.

On its own, this method serves as a good way to generate initial working theories or hypotheses to test against other sources of information. It can form the basis of a targeted analysis site usage statistics or help find the right groups of users to interview. The purpose is not to prove anything, but to obtain an informed basis for design decisions that truly improve the user experience of a product, service or website.

Leave a Reply

Your email address will not be published. Required fields are marked *