Semantic keyword clustering for over 10,000 keywords [With Script]

Semantic keyword clustering can help you take your keyword research to the next level.

In this article, you’ll learn how to use the Google Colaboratory worksheet shared exclusively with Search Engine Journal readers.

This article walks you through the use of Google Colab sheets, a high-level view of how it works in the background, and how to adjust it to suit your needs.

But first, Why cluster keywords?

Common Use Cases for Keyword Clustering

Here are some use cases for clustering keywords.

Faster keyword research:

Filtered brand keywords or keywords with no commercial value.
Group related keywords together to create more in-depth articles.
Group related questions and answers together to create FAQs.

Paid Search Campaigns:

create negative keywords Faster ad listings with large datasets – stop wasting money on junk keywords!
Group similar keywords into your ad’s campaign creative.

Here’s an example of a script that groups similar problems together, perfect for an in-depth article!

Screenshot of Microsoft Excel, February 2022

Problems with earlier versions of this tool

If you’ve been following my work on Twitter, you’ll know that I’ve been experimenting with keyword clustering for a while.

An earlier version of this script was based on the excellent PolyFuzz library use TF-IDF matching.

While it gets the job done, there are always some headaches with clusters that I feel the original result could be improved upon.

Words with similar letter patterns are clustered even if they are not semantically related.

For example, it cannot cluster words like “Bike” with “Bicycle”.

Earlier versions of the script had other issues:

It doesn’t work well in languages other than English.
It creates a large number of groups that cannot be clustered.
There is not much control over how the cluster is created.
Due to lack of resources, the script was limited to about 10,000 lines before timing out.

Semantic Keyword Clustering Using Deep Learning Natural Language Processing (NLP)

Fast forward four months to the latest version, which has been completely rewritten to take advantage of state-of-the-art deep learning sentence embeddings.

Check out some of these great semantic clusters!

Notice how heat, heat, and warmth are included in the same set of keywords?

Excel table showing examples of semantic keyword clustering

Screenshot of Microsoft Excel, February 2022

Or how about wholesale and bulk?

Excel sheet showing another example of semantic keyword clustering

Screenshot of Microsoft Excel, February 2022

Dogs and Dachshunds, Christmas and Christmas?

Excel sheet showing another example of semantic keyword clustering. Shows that dachshunds and dogs have been grouped together.

Screenshot of Microsoft Excel, February 2022

It can even aggregate keywords from over a hundred different languages together!

Excel table showing another example of French semantic keyword clustering

Screenshot of Microsoft Excel, February 2022

Features of new scripts and early iterations

also Semantic Keywords Grouping, the following improvements have been added in the latest version of this script.

Supports clustering of more than 10,000 keywords at a time.
Reduce no-cluster groups.
Ability to choose a different pretrained model (although the default model works fine!).
Ability to choose how closely related clusters are.
Choose the minimum number of keywords to use per cluster.
Automatic detection of character encoding and CSV delimiters.
Multilingual clustering.
Works with many common keyword exports out of the box. (Search Console Data, AdWords, or third-party keyword tools like Ahrefs and Semrush).
Works with any CSV file that has a column named “Keywords”.
Simple to use (the script works by inserting a new column called Cluster Name into any uploaded keyword list).

How to Use Scripts in Five Steps (Quick Start)

To get started, you need click this linkthen select Options, Open in Colab, as shown below.

Screenshot of Google Colaboratory, February 2022

Change the runtime type to GPU by selecting run > Change runtime type.

Google Collab, how to change settings to use GPU

Screenshot of Google Colaboratory, February 2022

choose run > run All from Google Colaboratory’s top navigation, (or just hit Ctrl+F9).

Screenshot of Google Colaboratory, February 2022

Upload a .csv file with a column named “Keywords” when prompted.

Screenshot of Google Colaboratory, February 2022

Clustering should be fairly fast, but ultimately depends on the number of keywords and the model used.

In general, you should be good at 50,000 keywords.

If you see a Cuda Out of Memory error, you are trying to aggregate too many keywords at once!

(It’s worth noting that this script can easily be adapted to run on a local machine without the limitations of Google Colaboratory.)

script output

The script will run and append the cluster to your original file into a new column named cluster name.

Cluster names are assigned using the shortest-length keyword in the cluster.

For example, the cluster name for the following set of keywords has been set to “alpaca socks” because this is the shortest keyword in the cluster.

A demo showing the example output of the script for alpaca socks has been put together

Screenshot of Microsoft Excel, February 2022

When the clustering is complete, a new file is automatically saved and the clusters are added to the original file in a new column.

How key clustering tools work

The script is based on Fast clustering algorithm And use models that have been pretrained at scale on large amounts of data.

This makes it easy to compute semantic relationships between keywords using off-the-shelf models.

(You don’t have to be a data scientist to use it!)

In fact, while I’ve customized it for those who like to tinker and experiment, I’ve chosen some balanced defaults that should be reasonable for most people’s use cases.

Different models can be swapped in and out of scripts as needed (faster clustering, better multilingual support, better semantic performance, etc.).

After extensive testing, I found that using the full MiniLM-L6-v2 transformer gave the perfect balance of speed and accuracy, providing a good balance between speed and accuracy.

If you prefer to use your own, you can experiment, you can replace the existing pretrained model with any of the models listed here or in Hug Face Model Center.

Swap pretrained models

Swapping models is as simple as replacing variables with the names of your favorite transformers.

For example, you can change the default model all-miniLM-L6-v2 to all-mpnet-base-v2 by editing:

transformer = ‘all-miniLM-L6-v2’

arrive

transformer = ‘full mpnet-base-v2‘

You can edit it in the Google Colaboratory sheet here.

How to choose a sentence transformer for keyword clustering

Screenshot of Google Colaboratory, February 2022

The trade-off between clustering accuracy and cluster-free groups

A common complaint with previous iterations of the script was that it resulted in a large number of non-clustered results.

Unfortunately, it’s always a balancing act between cluster accuracy and number of clusters.

Higher clustering precision settings will result in more non-clustered results.

There are two variables that can directly affect the size and accuracy of all clusters:

min_cluster_size

and

Clustering accuracy

I set a default value of 85 (/100) for cluster precision and a minimum cluster size of 2.

In testing, I found this to be the sweet spot, but feel free to try it!

This is where these variables are set in the script.

How to set minimum sentence size and keyword clustering accuracy

Screenshot of Google Colaboratory, February 2022

That’s it! I hope this keyword clustering script is useful for your work.

More resources:

Featured Image: Graphic Grid/Shutterstock

Source link

Semantic keyword clustering for over 10,000 keywords [With Script]

Common Use Cases for Keyword Clustering

Problems with earlier versions of this tool

Semantic Keyword Clustering Using Deep Learning Natural Language Processing (NLP)

Features of new scripts and early iterations

How to Use Scripts in Five Steps (Quick Start)

script output

How key clustering tools work

Swap pretrained models

The trade-off between clustering accuracy and cluster-free groups

Related articles

Most Popular Baby Names 2024: Top Picks

Most Popular Baby Names 2024: Top Picks

How to Settle a Colic Baby: Proven Tips

What Is Colic in Babies: Key Facts Revealed

The 7 Best Ways to Gain Popularity

LEAVE A REPLY Cancel reply

EDITOR PICKS

5 Best PR Agencies for Building Investor Credibility

From Regional Brand to National Presence Through Strategic Publishing

Social Media Content Calendars for Busy Business Owners

POPULAR POSTS

AJ Mizes: Why Smart People Don’t Get Promoted Faster (And What Actually Works)

The Token That Didn’t Deliver: Inside the Lawsuit Targeting a Trump Crypto Project

The Token That Didn’t Deliver: Inside the Lawsuit Targeting a Trump Crypto Project

ABOUT US

FOLLOW US