exist my last postwe analyzed our backlinks using data from Ahrefs.
This time, we used the same Ahrefs data source to include competitor backlinks in the analysis for comparison.
Like last time, we define the value of website backlinks as a product of quality and quantity.
Quality is the domain authority (or Ahrefs’ equivalent domain rating) and quantity is the number of referring domains.
Again, we will use available data to evaluate link quality before evaluating quantity.
Time to code.
import re
import time
import random
import pandas as pd
import numpy as np
import datetime
from datetime import timedelta
from plotnine import *
import matplotlib.pyplot as plt
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
import uritools
pd.set_option('display.max_colwidth', None)
%matplotlib inline
root_domain = 'johnsankey.co.uk' hostdomain = 'www.johnsankey.co.uk' hostname="johnsankey" full_domain = 'https://www.johnsankey.co.uk' target_name="John Sankey"
Data import and cleaning
We set up the file directory to read multiple Ahrefs exported data files in one folder, which is much faster, less boring, and more efficient than reading each file individually.
Especially when you have more than 10!
ahrefs_path="data/"
The listdir( ) function of the OS module allows us to list all files in a subdirectory.
ahrefs_filenames = os.listdir(ahrefs_path)
ahrefs_filenames.remove('.DS_Store')
ahrefs_filenames
File names now listed below:
['www.davidsonlondon.com--refdomains-subdomain__2022-03-13_23-37-29.csv',
'www.stephenclasper.co.uk--refdomains-subdoma__2022-03-13_23-47-28.csv',
'www.touchedinteriors.co.uk--refdomains-subdo__2022-03-13_23-42-05.csv',
'www.lushinteriors.co--refdomains-subdomains__2022-03-13_23-44-34.csv',
'www.kassavello.com--refdomains-subdomains__2022-03-13_23-43-19.csv',
'www.tulipinterior.co.uk--refdomains-subdomai__2022-03-13_23-41-04.csv',
'www.tgosling.com--refdomains-subdomains__2022-03-13_23-38-44.csv',
'www.onlybespoke.com--refdomains-subdomains__2022-03-13_23-45-28.csv',
'www.williamgarvey.co.uk--refdomains-subdomai__2022-03-13_23-43-45.csv',
'www.hadleyrose.co.uk--refdomains-subdomains__2022-03-13_23-39-31.csv',
'www.davidlinley.com--refdomains-subdomains__2022-03-13_23-40-25.csv',
'johnsankey.co.uk-refdomains-subdomains__2022-03-18_15-15-47.csv']
After listing the files, we will now use a for loop to read each file individually and add them to the dataframe.
When reading the file, we’ll use some string manipulation to create a new column with the site name of the data we’re importing.
ahrefs_df_lst = list()
ahrefs_colnames = list()
for filename in ahrefs_filenames:
df = pd.read_csv(ahrefs_path + filename)
df['site'] = filename
df['site'] = df['site'].str.replace('www.', '', regex = False)
df['site'] = df['site'].str.replace('.csv', '', regex = False)
df['site'] = df['site'].str.replace('-.+', '', regex = True)
ahrefs_colnames.append(df.columns)
ahrefs_df_lst.append(df)
ahrefs_df_raw = pd.concat(ahrefs_df_lst)
ahrefs_df_raw
Image via Ahrefs, May 2022
Now we have the raw data from each site in a single data frame. The next step is to clean up the column names and make them easier to use.
While it’s possible to eliminate duplication with custom functions or list comprehensions, it’s a good practice for beginner SEO Pythonistas, and it’s easier to get a step-by-step understanding of what’s going on. As they say, “repetition is the mother of mastery”, so start practicing!
competitor_ahrefs_cleancols = ahrefs_df_raw
competitor_ahrefs_cleancols.columns = [col.lower() for col in competitor_ahrefs_cleancols.columns]
competitor_ahrefs_cleancols.columns = [col.replace(' ','_') for col in competitor_ahrefs_cleancols.columns]
competitor_ahrefs_cleancols.columns = [col.replace('.','_') for col in competitor_ahrefs_cleancols.columns]
competitor_ahrefs_cleancols.columns = [col.replace('__','_') for col in competitor_ahrefs_cleancols.columns]
competitor_ahrefs_cleancols.columns = [col.replace('(','') for col in competitor_ahrefs_cleancols.columns]
competitor_ahrefs_cleancols.columns = [col.replace(')','') for col in competitor_ahrefs_cleancols.columns]
competitor_ahrefs_cleancols.columns = [col.replace('%','') for col in competitor_ahrefs_cleancols.columns]
Count columns and columns with a single value (‘project’) are useful for groupby and aggregation operations.
competitor_ahrefs_cleancols['rd_count'] = 1 competitor_ahrefs_cleancols['project'] = target_name competitor_ahrefs_cleancols
Image via Ahrefs, May 2022The columns are cleaned up, so now we’ll clean up the row data.
competitor_ahrefs_clean_dtypes = competitor_ahrefs_cleancols
For reference fields, we replace hyphens with zeros and set the data type to integer (i.e. integer).
This will also be repeated for linking domains.
competitor_ahrefs_clean_dtypes['dofollow_ref_domains'] = np.where(competitor_ahrefs_clean_dtypes['dofollow_ref_domains'] == '-', 0, competitor_ahrefs_clean_dtypes['dofollow_ref_domains']) competitor_ahrefs_clean_dtypes['dofollow_ref_domains'] = competitor_ahrefs_clean_dtypes['dofollow_ref_domains'].astype(int) # linked_domains competitor_ahrefs_clean_dtypes['dofollow_linked_domains'] = np.where(competitor_ahrefs_clean_dtypes['dofollow_linked_domains'] == '-', 0, competitor_ahrefs_clean_dtypes['dofollow_linked_domains']) competitor_ahrefs_clean_dtypes['dofollow_linked_domains'] = competitor_ahrefs_clean_dtypes['dofollow_linked_domains'].astype(int)
First seen gives us a date point at which the link was found, which we can use for time series plotting and export link age.
We will use the to_datetime function to convert to date format.
# first_seen
competitor_ahrefs_clean_dtypes['first_seen'] = pd.to_datetime(competitor_ahrefs_clean_dtypes['first_seen'],
format="%d/%m/%Y %H:%M")
competitor_ahrefs_clean_dtypes['first_seen'] = competitor_ahrefs_clean_dtypes['first_seen'].dt.normalize()
competitor_ahrefs_clean_dtypes['month_year'] = competitor_ahrefs_clean_dtypes['first_seen'].dt.to_period('M')
To calculate link_age, we just subtract the first seen date from today’s date and convert the difference to a number.
# link age competitor_ahrefs_clean_dtypes['link_age'] = dt.datetime.now() - competitor_ahrefs_clean_dtypes['first_seen'] competitor_ahrefs_clean_dtypes['link_age'] = competitor_ahrefs_clean_dtypes['link_age'] competitor_ahrefs_clean_dtypes['link_age'] = competitor_ahrefs_clean_dtypes['link_age'].astype(int) competitor_ahrefs_clean_dtypes['link_age'] = (competitor_ahrefs_clean_dtypes['link_age']/(3600 * 24 * 1000000000)).round(0)
The target bar helps us differentiate “customer” sites from competitors, which visualization after.
competitor_ahrefs_clean_dtypes['target'] = np.where(competitor_ahrefs_clean_dtypes['site'].str.contains('johns'),
1, 0)
competitor_ahrefs_clean_dtypes['target'] = competitor_ahrefs_clean_dtypes['target'].astype('category')
competitor_ahrefs_clean_dtypes
Image via Ahrefs, May 2022Now we are ready to clean the data in terms of column headers and row values and start analyze.
link quality
We start with link quality, which we measure by Accepted Domain Rating (DR).
Let’s first examine the distributional properties of DR by plotting the distribution using the geom_bokplot function.
comp_dr_dist_box_plt = ( ggplot(competitor_ahrefs_analysis.loc[competitor_ahrefs_analysis['dr'] > 0], aes(x = 'reorder(site, dr)', y = 'dr', colour="target")) + geom_boxplot(alpha = 0.6) + scale_y_continuous() + theme(legend_position = 'none', axis_text_x=element_text(rotation=90, hjust=1) )) comp_dr_dist_box_plt.save(filename="images/4_comp_dr_dist_box_plt.png", height=5, width=10, units="in", dpi=1000) comp_dr_dist_box_plt
Image via Ahrefs, May 2022The graph compares the statistical properties of websites side-by-side and, most notably, shows the interquartile range in terms of domain ratings for most referring domains.
We also see that John Sankey has the fourth highest median domain rating, which compares well to the link quality of other sites.
William Garvey has the most diverse DR compared to other fields, suggesting that link acquisition standards are a bit looser. who knows.
link volume
This is quality. What is the volume of links from referring domains?
To solve this problem, we will use the groupby function to calculate the running sum of the reference fields.
competitor_count_cumsum_df = competitor_ahrefs_analysis competitor_count_cumsum_df = competitor_count_cumsum_df.groupby(['site', 'month_year'])['rd_count'].sum().reset_index()
The extension function allows the calculation window to grow as the number of rows increases, which is how we implement the running sum.
competitor_count_cumsum_df['count_runsum'] = competitor_count_cumsum_df['rd_count'].expanding().sum() competitor_count_cumsum_df
Image via Ahrefs, May 2022The result is a data frame with site, month_year, and count_runsum (the running sum), perfectly formatted to provide the graph.
competitor_count_cumsum_plt = ( ggplot(competitor_count_cumsum_df, aes(x = 'month_year', y = 'count_runsum', group = 'site', colour="site")) + geom_line(alpha = 0.6, size = 2) + labs(y = 'Running Sum of Referring Domains', x = 'Month Year') + scale_y_continuous() + scale_x_date() + theme(legend_position = 'right', axis_text_x=element_text(rotation=90, hjust=1) ))
competitor_count_cumsum_plt.save(filename="images/5_count_cumsum_smooth_plt.png", height=5, width=10, units="in", dpi=1000) competitor_count_cumsum_plt
Image via Ahrefs, May 2022The graph shows the number of referring domains per site since 2014.
As each site started getting links, I found the different starting locations for each site very interesting.
For example, William Garvey started out with over 5,000 domains. I would love to know who their PR firm is!
We can also see the growth rate. For example, while Hadley Rose started getting links in 2018, things didn’t really take off until around mid-2021.
more, more, more
You can always do a more scientific analysis.
For example, an immediate and natural extension of the above is to combine quality (DR) and quantity (volume) to get a more complete picture of how sites compare in terms of off-site SEO.
Other extensions are to simulate the quality of those referring domains for your own and your competitor sites to see which link features (such as word count or relevance of linked content) explain the difference in visibility between you and your competitors .
This model extension would be a great application These machine learning techniques.
More resources:
Featured Image: F8 Studio/Shutterstock
!function(f,b,e,v,n,t,s) {if(f.fbq)return;n=f.fbq=function(){n.callMethod? n.callMethod.apply(n,arguments):n.queue.push(arguments)}; if(!f._fbq)f._fbq=n;n.push=n;n.loaded=!0;n.version='2.0'; n.queue=[];t=b.createElement(e);t.async=!0; t.src=v;s=b.getElementsByTagName(e)[0]; s.parentNode.insertBefore(t,s)}(window,document,'script', 'https://connect.facebook.net/en_US/fbevents.js');
if( typeof sopp !== "undefined" && sopp === 'yes' ){ fbq('dataProcessingOptions', ['LDU'], 1, 1000); }else{ fbq('dataProcessingOptions', []); }
fbq('init', '1321385257908563');
fbq('track', 'PageView');
fbq('trackSingle', '1321385257908563', 'ViewContent', { content_name: 'competitor-backlinks-python', content_category: 'linkbuilding marketing-analytics seo' });




