Competitor Backlink Analysis Using Python [Complete Script]

exist my last postwe analyzed our backlinks using data from Ahrefs.

This time, we used the same Ahrefs data source to include competitor backlinks in the analysis for comparison.

Like last time, we define the value of website backlinks as a product of quality and quantity.

Quality is the domain authority (or Ahrefs’ equivalent domain rating) and quantity is the number of referring domains.

Again, we will use available data to evaluate link quality before evaluating quantity.

Time to code.

import re
import time
import random
import pandas as pd
import numpy as np
import datetime
from datetime import timedelta
from plotnine import *
import matplotlib.pyplot as plt
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
import uritools  

pd.set_option('display.max_colwidth', None)
%matplotlib inline

root_domain = 'johnsankey.co.uk'
hostdomain = 'www.johnsankey.co.uk'
hostname="johnsankey"
full_domain = 'https://www.johnsankey.co.uk'
target_name="John Sankey"

Data import and cleaning

We set up the file directory to read multiple Ahrefs exported data files in one folder, which is much faster, less boring, and more efficient than reading each file individually.

Especially when you have more than 10!

ahrefs_path="data/"

The listdir( ) function of the OS module allows us to list all files in a subdirectory.

ahrefs_filenames = os.listdir(ahrefs_path)
ahrefs_filenames.remove('.DS_Store')
ahrefs_filenames

File names now listed below:

['www.davidsonlondon.com--refdomains-subdomain__2022-03-13_23-37-29.csv',
 'www.stephenclasper.co.uk--refdomains-subdoma__2022-03-13_23-47-28.csv',
 'www.touchedinteriors.co.uk--refdomains-subdo__2022-03-13_23-42-05.csv',
 'www.lushinteriors.co--refdomains-subdomains__2022-03-13_23-44-34.csv',
 'www.kassavello.com--refdomains-subdomains__2022-03-13_23-43-19.csv',
 'www.tulipinterior.co.uk--refdomains-subdomai__2022-03-13_23-41-04.csv',
 'www.tgosling.com--refdomains-subdomains__2022-03-13_23-38-44.csv',
 'www.onlybespoke.com--refdomains-subdomains__2022-03-13_23-45-28.csv',
 'www.williamgarvey.co.uk--refdomains-subdomai__2022-03-13_23-43-45.csv',
 'www.hadleyrose.co.uk--refdomains-subdomains__2022-03-13_23-39-31.csv',
 'www.davidlinley.com--refdomains-subdomains__2022-03-13_23-40-25.csv',
 'johnsankey.co.uk-refdomains-subdomains__2022-03-18_15-15-47.csv']

After listing the files, we will now use a for loop to read each file individually and add them to the dataframe.

When reading the file, we’ll use some string manipulation to create a new column with the site name of the data we’re importing.

ahrefs_df_lst = list()
ahrefs_colnames = list()

for filename in ahrefs_filenames:
    df = pd.read_csv(ahrefs_path + filename)
    df['site'] = filename
    df['site'] = df['site'].str.replace('www.', '', regex = False)    
    df['site'] = df['site'].str.replace('.csv', '', regex = False)
    df['site'] = df['site'].str.replace('-.+', '', regex = True)
    ahrefs_colnames.append(df.columns)
    ahrefs_df_lst.append(df)

ahrefs_df_raw = pd.concat(ahrefs_df_lst)
ahrefs_df_raw

Image via Ahrefs, May 2022

Now we have the raw data from each site in a single data frame. The next step is to clean up the column names and make them easier to use.

While it’s possible to eliminate duplication with custom functions or list comprehensions, it’s a good practice for beginner SEO Pythonistas, and it’s easier to get a step-by-step understanding of what’s going on. As they say, “repetition is the mother of mastery”, so start practicing!

competitor_ahrefs_cleancols = ahrefs_df_raw
competitor_ahrefs_cleancols.columns = [col.lower() for col in competitor_ahrefs_cleancols.columns]
competitor_ahrefs_cleancols.columns = [col.replace(' ','_') for col in competitor_ahrefs_cleancols.columns]
competitor_ahrefs_cleancols.columns = [col.replace('.','_') for col in competitor_ahrefs_cleancols.columns]
competitor_ahrefs_cleancols.columns = [col.replace('__','_') for col in competitor_ahrefs_cleancols.columns]
competitor_ahrefs_cleancols.columns = [col.replace('(','') for col in competitor_ahrefs_cleancols.columns]
competitor_ahrefs_cleancols.columns = [col.replace(')','') for col in competitor_ahrefs_cleancols.columns]
competitor_ahrefs_cleancols.columns = [col.replace('%','') for col in competitor_ahrefs_cleancols.columns]

Count columns and columns with a single value (‘project’) are useful for groupby and aggregation operations.

competitor_ahrefs_cleancols['rd_count'] = 1
competitor_ahrefs_cleancols['project'] = target_name

competitor_ahrefs_cleancols

Image via Ahrefs, May 2022

The columns are cleaned up, so now we’ll clean up the row data.

competitor_ahrefs_clean_dtypes = competitor_ahrefs_cleancols

For reference fields, we replace hyphens with zeros and set the data type to integer (i.e. integer).

This will also be repeated for linking domains.

competitor_ahrefs_clean_dtypes['dofollow_ref_domains'] = np.where(competitor_ahrefs_clean_dtypes['dofollow_ref_domains'] == '-',
                                                           0, competitor_ahrefs_clean_dtypes['dofollow_ref_domains'])
competitor_ahrefs_clean_dtypes['dofollow_ref_domains'] = competitor_ahrefs_clean_dtypes['dofollow_ref_domains'].astype(int)



# linked_domains

competitor_ahrefs_clean_dtypes['dofollow_linked_domains'] = np.where(competitor_ahrefs_clean_dtypes['dofollow_linked_domains'] == '-',
                                                           0, competitor_ahrefs_clean_dtypes['dofollow_linked_domains'])
competitor_ahrefs_clean_dtypes['dofollow_linked_domains'] = competitor_ahrefs_clean_dtypes['dofollow_linked_domains'].astype(int)

First seen gives us a date point at which the link was found, which we can use for time series plotting and export link age.

We will use the to_datetime function to convert to date format.

# first_seen
competitor_ahrefs_clean_dtypes['first_seen'] = pd.to_datetime(competitor_ahrefs_clean_dtypes['first_seen'], 
                                                              format="%d/%m/%Y %H:%M")
competitor_ahrefs_clean_dtypes['first_seen'] = competitor_ahrefs_clean_dtypes['first_seen'].dt.normalize()
competitor_ahrefs_clean_dtypes['month_year'] = competitor_ahrefs_clean_dtypes['first_seen'].dt.to_period('M')

To calculate link_age, we just subtract the first seen date from today’s date and convert the difference to a number.

# link age
competitor_ahrefs_clean_dtypes['link_age'] = dt.datetime.now() - competitor_ahrefs_clean_dtypes['first_seen']
competitor_ahrefs_clean_dtypes['link_age'] = competitor_ahrefs_clean_dtypes['link_age']
competitor_ahrefs_clean_dtypes['link_age'] = competitor_ahrefs_clean_dtypes['link_age'].astype(int)
competitor_ahrefs_clean_dtypes['link_age'] = (competitor_ahrefs_clean_dtypes['link_age']/(3600 * 24 * 1000000000)).round(0)

The target bar helps us differentiate “customer” sites from competitors, which visualization after.

competitor_ahrefs_clean_dtypes['target'] = np.where(competitor_ahrefs_clean_dtypes['site'].str.contains('johns'),
                                                                                            1, 0)
competitor_ahrefs_clean_dtypes['target'] = competitor_ahrefs_clean_dtypes['target'].astype('category')

competitor_ahrefs_clean_dtypes

Image via Ahrefs, May 2022

Now we are ready to clean the data in terms of column headers and row values and start analyze.

link quality

We start with link quality, which we measure by Accepted Domain Rating (DR).

Let’s first examine the distributional properties of DR by plotting the distribution using the geom_bokplot function.

comp_dr_dist_box_plt = (
    ggplot(competitor_ahrefs_analysis.loc[competitor_ahrefs_analysis['dr'] > 0], 
           aes(x = 'reorder(site, dr)', y = 'dr', colour="target")) + 
    geom_boxplot(alpha = 0.6) +
    scale_y_continuous() +   
    theme(legend_position = 'none', 
          axis_text_x=element_text(rotation=90, hjust=1)
         ))

comp_dr_dist_box_plt.save(filename="images/4_comp_dr_dist_box_plt.png", 
                           height=5, width=10, units="in", dpi=1000)
comp_dr_dist_box_plt

Image via Ahrefs, May 2022

The graph compares the statistical properties of websites side-by-side and, most notably, shows the interquartile range in terms of domain ratings for most referring domains.

We also see that John Sankey has the fourth highest median domain rating, which compares well to the link quality of other sites.

William Garvey has the most diverse DR compared to other fields, suggesting that link acquisition standards are a bit looser. who knows.

link volume

This is quality. What is the volume of links from referring domains?

To solve this problem, we will use the groupby function to calculate the running sum of the reference fields.

competitor_count_cumsum_df = competitor_ahrefs_analysis

competitor_count_cumsum_df = competitor_count_cumsum_df.groupby(['site', 'month_year'])['rd_count'].sum().reset_index()

The extension function allows the calculation window to grow as the number of rows increases, which is how we implement the running sum.

competitor_count_cumsum_df['count_runsum'] = competitor_count_cumsum_df['rd_count'].expanding().sum()

competitor_count_cumsum_df

Image via Ahrefs, May 2022

The result is a data frame with site, month_year, and count_runsum (the running sum), perfectly formatted to provide the graph.

competitor_count_cumsum_plt = (
    ggplot(competitor_count_cumsum_df, aes(x = 'month_year', y = 'count_runsum', 
                                           group = 'site', colour="site")) + 
    geom_line(alpha = 0.6, size = 2) +
    labs(y = 'Running Sum of Referring Domains', x = 'Month Year') + 
    scale_y_continuous() + 
    scale_x_date() +
    theme(legend_position = 'right', 
          axis_text_x=element_text(rotation=90, hjust=1)
         ))

competitor_count_cumsum_plt.save(filename="images/5_count_cumsum_smooth_plt.png", 
                           height=5, width=10, units="in", dpi=1000)

competitor_count_cumsum_plt

Image via Ahrefs, May 2022

The graph shows the number of referring domains per site since 2014.

As each site started getting links, I found the different starting locations for each site very interesting.

For example, William Garvey started out with over 5,000 domains. I would love to know who their PR firm is!

We can also see the growth rate. For example, while Hadley Rose started getting links in 2018, things didn’t really take off until around mid-2021.

more, more, more

You can always do a more scientific analysis.

For example, an immediate and natural extension of the above is to combine quality (DR) and quantity (volume) to get a more complete picture of how sites compare in terms of off-site SEO.

Other extensions are to simulate the quality of those referring domains for your own and your competitor sites to see which link features (such as word count or relevance of linked content) explain the difference in visibility between you and your competitors .

This model extension would be a great application These machine learning techniques.

More resources:

Featured Image: F8 Studio/Shutterstock

Source link

Competitor Backlink Analysis Using Python [Complete Script]

Data import and cleaning

link quality

link volume

more, more, more

Related articles

Most Popular Baby Names 2024: Top Picks

Most Popular Baby Names 2024: Top Picks

How to Settle a Colic Baby: Proven Tips

What Is Colic in Babies: Key Facts Revealed

The 7 Best Ways to Gain Popularity

LEAVE A REPLY Cancel reply

EDITOR PICKS

How to Build a Personal Brand That Gets You Speaking Gigs

Top 5 Google Business Profile Services for Chiropractors in Sioux Falls

5 Best PR Agencies for Building Investor Credibility

POPULAR POSTS

How Accident Reconstruction Helps Fort Myers Injury Victims

What Is Product Animation and When Does a Pitch Need One?

AJ Mizes: Why Smart People Don’t Get Promoted Faster (And What Actually Works)

ABOUT US

FOLLOW US