Yext Data Science Presents Research at IEEE Big Data 2022 Conference

Yext presented research conducted on machine learning clustering algorithms at the Institute of Electrical and Electronics Engineers (IEEE) Big Data 2022 conference.

Ariana Martino

Mar 1, 2023

4 min

In December, Yext presented research conducted on machine learning clustering algorithms at the Institute of Electrical and Electronics Engineers (IEEE) Big Data 2022 conference. The paper, A Hybrid Score to Optimize Clustering Hyperparameters for Online Search Term Data, authored by Yext data scientists Allison Rossetto and Ariana Martino represents the first published piece of peer-reviewed research from Yext's research and development function.

Read on to see key takeaways from the paper.

Clustering at Yext

Yext Search helps customers build AI-powered Search experiences with natural language understanding. This includes search analytics features like Search Term Clustering, which identifies clusters of related terms to show website administrators trends in their users' searches. Search terms that are semantically similar (have a similar meaning or intent) get clustered or grouped together so you can focus on the big picture of trends in your users' searches.

At Yext, we cluster search terms as part of your search analytics. The clusters allow you to identify and monitor trends in the intents of users searching on your site, even when users phrase similar intents differently.

In Yext Search, search term clustering analysis is performed for hundreds of different websites' data weekly. The websites vary in traffic, such that a week's worth of search terms may be anywhere from hundreds to hundreds of thousands of data points.

To make these clustering reports, we turn queries into points in space using a neural network model, which transforms queries to embeddings. So, the more similar two queries are, the closer they'll be in space.

We can imagine this in two dimensions:

Terms like "outage map" and "outage near me" are more similar to one another — and thus closer together to each other on a plane — than they are to terms unrelated to outages like "tv lineup." While it is easiest to visualize the concept of embedding in two dimensions, in reality, Yext's embeddings include over 700 dimensions to represent each query.

Based on these points represented by search term embeddings, queries are clustered together using Density-based spatial clustering of applications with noise (DBSCAN), which has a hyperparameter called epsilon. Epsilon represents the minimum distance between two points for them to be considered part of the same DBSCAN cluster. When epsilon is lower, a higher density of points is required for a cluster to form. On the other hand, increasing epsilon tends to increase the average size of clusters. You can see this in an illustrative example from scikit-learn:

In this example, increasing epsilon from 0.5 to 2.0 creates broader clusters where more distant points are grouped together. You can also tell by eye that, when epsilon=2.0, the clustering seems to fit the data more accurately, because there really are 4 bigger clusters. However, when we do search term clustering in high-dimensional space, it can be trickier to figure out which epsilon value is most appropriate.

The Research

In order to improve search term clustering, we developed a score called the Hybrid Cluster Score (HCS) that indicates how well a round of clustering fits the distribution of the data. With this, we are able to test out lots of different epsilon values and choose the best one.

The HCS is called a "hybrid" score because it combines three popular metrics of clustering quality:

  • Silhouette Coefficient (SC) – a measure of how well each point fits in its own cluster

  • Calinski-Harabasz Score (CH) – a measure of inter-cluster vs. between cluster dispersion

  • Davies-Bouldin Index (DBI) – a measure of how close clusters are to their nearest neighbor

These three scores are combined to form the HCS, which allows us to optimize for all three of these good clustering qualities at once:

To test out the HCS, we used it to select what we hypothesized would be the optimal epsilon values for some test datasets and compared the clusters to those that were formed before the optimization exercise. Then, we asked a search domain expert to label the clusters based on whether they were better or worse than before.

In this experiment, we found that using the HCS to optimize hyperparameters improved the cluster nearly 80% of the time.

Presenting at IEEE

I was lucky enough to be able to present this research at the IEEE Big Data 2022 conference in Osaka, Japan this past December.

It was an honor to be able to speak to an audience of data scientists, engineers, and researchers from all over the world about how Yext is using machine learning and natural language processing to improve search experiences across the web.

We hope this will be the first of many Yext papers to be published and presented to the scientific community.

Learn More

For more details on our clustering optimization research, the full paper is available through IEEE. Plus, check out this video explaining the findings.

Share this Article

Read Next

loading icon