When people use search engines, they do so to meet some information need. It might seem obvious how one would establish what that need was, by looking at the query terms, exploring the pages users actually click on (that’s why google directs clicks through a URL with the query terms referring to the target page, for example), looking at dwell time, looking at the last page selected, etc. We might also look query reformulation to see if users attempt a number of different queries to meet the same information need, or/and to look for search sessions in which multiple information needs might be associated, for example when we explore a holiday destination each topic (hotels, attractions, weather) is an individual need but might appropriately be grouped within a session.
So, exploring query reformulation is interesting, but how do we do that? Well, really common and fairly basic textual methods including looking at Levenshtein distance (n of characters different) or just term difference (e.g. of the terms present how many are new v. used in the earlier query. Looking for other topical features is another interesting method, but challenging. If we take an example discussed in this paper on query reformulation:
“Take the queries ‘hotels in New York City’ and ‘weather in New York City’ as an example. The two queries are very likely to have been issued by a user who is planning to travel to New York City. The two queries have 5 words each, 4 of them are shared by the two queries. Hence, most of the solutions proposed in previous work for this problem will incorrectly assume that the second query is a reformulation of the ﬁrst due to the high word overlap ratio and the small edit distance”
This of course has implications for spotting how users are engaging with information, and for things like “suggested search”. In that paper, they use a query segmentation method which uses a head-modifier structure to separate out elements of a query, so in the above example one is about weather, the other about hotels, and they share a modifier (new york city). Then we can look at various things like shared concepts, partial concept matches, etc. Reading the paper it’s more complex than that, but I’m thinking about whether I could recreate some sort of dummy method in R…