Semantic keyword clustering can help you take your keyword research to the next level.
In this article, you’ll learn how to use the Google Colaboratory worksheet shared exclusively with Search Engine Journal readers.
This article walks you through the use of Google Colab sheets, a high-level view of how it works in the background, and how to adjust it to suit your needs.
But first, Why cluster keywords?
Common Use Cases for Keyword Clustering
Here are some use cases for clustering keywords.
Faster keyword research:
- Filtered brand keywords or keywords with no commercial value.
- Group related keywords together to create more in-depth articles.
- Group related questions and answers together to create FAQs.
Paid Search Campaigns:
- create negative keywords Faster ad listings with large datasets – stop wasting money on junk keywords!
- Group similar keywords into your ad’s campaign creative.
Here’s an example of a script that groups similar problems together, perfect for an in-depth article!
Problems with earlier versions of this tool
If you’ve been following my work on Twitter, you’ll know that I’ve been experimenting with keyword clustering for a while.
An earlier version of this script was based on the excellent PolyFuzz library use TF-IDF matching.
While it gets the job done, there are always some headaches with clusters that I feel the original result could be improved upon.
Words with similar letter patterns are clustered even if they are not semantically related.
For example, it cannot cluster words like “Bike” with “Bicycle”.
Earlier versions of the script had other issues:
- It doesn’t work well in languages other than English.
- It creates a large number of groups that cannot be clustered.
- There is not much control over how the cluster is created.
- Due to lack of resources, the script was limited to about 10,000 lines before timing out.
Semantic Keyword Clustering Using Deep Learning Natural Language Processing (NLP)
Fast forward four months to the latest version, which has been completely rewritten to take advantage of state-of-the-art deep learning sentence embeddings.
Check out some of these great semantic clusters!
Notice how heat, heat, and warmth are included in the same set of keywords?
Screenshot of Microsoft Excel, February 2022Or how about wholesale and bulk?
Screenshot of Microsoft Excel, February 2022Dogs and Dachshunds, Christmas and Christmas?
Screenshot of Microsoft Excel, February 2022It can even aggregate keywords from over a hundred different languages together!
Screenshot of Microsoft Excel, February 2022Features of new scripts and early iterations
also Semantic Keywords Grouping, the following improvements have been added in the latest version of this script.
- Supports clustering of more than 10,000 keywords at a time.
- Reduce no-cluster groups.
- Ability to choose a different pretrained model (although the default model works fine!).
- Ability to choose how closely related clusters are.
- Choose the minimum number of keywords to use per cluster.
- Automatic detection of character encoding and CSV delimiters.
- Multilingual clustering.
- Works with many common keyword exports out of the box. (Search Console Data, AdWords, or third-party keyword tools like Ahrefs and Semrush).
- Works with any CSV file that has a column named “Keywords”.
- Simple to use (the script works by inserting a new column called Cluster Name into any uploaded keyword list).
How to Use Scripts in Five Steps (Quick Start)
To get started, you need click this linkthen select Options, Open in Colab, as shown below.
Screenshot of Google Colaboratory, February 2022Change the runtime type to GPU by selecting run > Change runtime type.
Screenshot of Google Colaboratory, February 2022choose run > run All from Google Colaboratory’s top navigation, (or just hit Ctrl+F9).
Screenshot of Google Colaboratory, February 2022Upload a .csv file with a column named “Keywords” when prompted.
Screenshot of Google Colaboratory, February 2022Clustering should be fairly fast, but ultimately depends on the number of keywords and the model used.
In general, you should be good at 50,000 keywords.
If you see a Cuda Out of Memory error, you are trying to aggregate too many keywords at once!
(It’s worth noting that this script can easily be adapted to run on a local machine without the limitations of Google Colaboratory.)
script output
The script will run and append the cluster to your original file into a new column named cluster name.
Cluster names are assigned using the shortest-length keyword in the cluster.
For example, the cluster name for the following set of keywords has been set to “alpaca socks” because this is the shortest keyword in the cluster.
Screenshot of Microsoft Excel, February 2022When the clustering is complete, a new file is automatically saved and the clusters are added to the original file in a new column.
How key clustering tools work
The script is based on Fast clustering algorithm And use models that have been pretrained at scale on large amounts of data.
This makes it easy to compute semantic relationships between keywords using off-the-shelf models.
(You don’t have to be a data scientist to use it!)
In fact, while I’ve customized it for those who like to tinker and experiment, I’ve chosen some balanced defaults that should be reasonable for most people’s use cases.
Different models can be swapped in and out of scripts as needed (faster clustering, better multilingual support, better semantic performance, etc.).
After extensive testing, I found that using the full MiniLM-L6-v2 transformer gave the perfect balance of speed and accuracy, providing a good balance between speed and accuracy.
If you prefer to use your own, you can experiment, you can replace the existing pretrained model with any of the models listed here or in Hug Face Model Center.
Swap pretrained models
Swapping models is as simple as replacing variables with the names of your favorite transformers.
For example, you can change the default model all-miniLM-L6-v2 to all-mpnet-base-v2 by editing:
transformer = ‘all-miniLM-L6-v2’
arrive
transformer = ‘full mpnet-base-v2‘
You can edit it in the Google Colaboratory sheet here.
Screenshot of Google Colaboratory, February 2022The trade-off between clustering accuracy and cluster-free groups
A common complaint with previous iterations of the script was that it resulted in a large number of non-clustered results.
Unfortunately, it’s always a balancing act between cluster accuracy and number of clusters.
Higher clustering precision settings will result in more non-clustered results.
There are two variables that can directly affect the size and accuracy of all clusters:
min_cluster_size
and
Clustering accuracy
I set a default value of 85 (/100) for cluster precision and a minimum cluster size of 2.
In testing, I found this to be the sweet spot, but feel free to try it!
This is where these variables are set in the script.
Screenshot of Google Colaboratory, February 2022That’s it! I hope this keyword clustering script is useful for your work.
More resources:
Featured Image: Graphic Grid/Shutterstock
!function(f,b,e,v,n,t,s) {if(f.fbq)return;n=f.fbq=function(){n.callMethod? n.callMethod.apply(n,arguments):n.queue.push(arguments)}; if(!f._fbq)f._fbq=n;n.push=n;n.loaded=!0;n.version='2.0'; n.queue=[];t=b.createElement(e);t.async=!0; t.src=v;s=b.getElementsByTagName(e)[0]; s.parentNode.insertBefore(t,s)}(window,document,'script', 'https://connect.facebook.net/en_US/fbevents.js');
if( typeof sopp !== "undefined" && sopp === 'yes' ){ fbq('dataProcessingOptions', ['LDU'], 1, 1000); }else{ fbq('dataProcessingOptions', []); }
fbq('init', '1321385257908563');
fbq('track', 'PageView');
fbq('trackSingle', '1321385257908563', 'ViewContent', { content_name: 'semantic-keyword-clustering-python', content_category: 'seo ' });



