News sitemaps use different and unique sitemap protocols to provide more information to news search engines.
The News Sitemap contains news published in the past 48 hours.
News sitemap tags include the news publication’s title, language, name, genre, publication date, keywords, and even a stock symbol.
How do you leverage these sitemaps for content research and competitive analysis?
In this Python tutorial, you’ll learn a 10-step process for analyzing a news sitemap and visualizing the top trends found in it.
Housekeeping Considerations Let’s Get Started
This tutorial was written during the Russian invasion of Ukraine.
Using machine learning, we can even label news sources and articles based on which news sources are “objective” and which ones are “sarcastic.”
But for simplicity, we will focus on the following topics frequency analysis.
We will use 10+ global news sources from the US and UK
Note: We would like to include Russian news sources, but they don’t have a proper news sitemap. Even if they had, they would block external requests.
Compare occurrences of words “Invasion” and “Liberation” News sources from both the West and the East show the benefits of distributive frequency text analysis methods.
What does it take to analyze news content with Python
Listed below are relevant Python libraries for reviewing news sitemaps for newsfeed content strategy:
- advertising tools.
- panda.
- Plotly Express, subplots and graph objects.
- re(regex).
- String.
- NLTK (corpus, stopwords, Ngrams).
- Unicode data.
- Matplotlib.
- Basic Python syntax understanding.
10 Steps to News Sitemap Analysis Using Python
Are all set up? let’s start.
1. Get news URL from news sitemap
We selected The Guardian, New York Times, Washington Post, Daily Mail, Sky News, BBC, and CNN to examine news URLs in news sitemaps.
df_guardian = adv.sitemap_to_df("http://www.theguardian.com/sitemaps/news.xml")
df_nyt = adv.sitemap_to_df("https://www.nytimes.com/sitemaps/new/news.xml.gz")
df_wp = adv.sitemap_to_df("https://www.washingtonpost.com/arcio/news-sitemap/")
df_bbc = adv.sitemap_to_df("https://www.bbc.com/sitemaps/https-index-com-news.xml")
df_dailymail = adv.sitemap_to_df("https://www.dailymail.co.uk/google-news-sitemap.xml")
df_skynews = adv.sitemap_to_df("https://news.sky.com/sitemap-index.xml")
df_cnn = adv.sitemap_to_df("https://edition.cnn.com/sitemaps/cnn/news.xml")
2. Inspect a sample news sitemap using Python
I’ll use the BBC as an example to show what we’ve just pulled from these news sitemaps.
df_bbc
The BBC sitemap has the following sections.
df_bbc.columns
News sitemap labels as dataframe columnsThe general data structure of these columns is as follows.
df_bbc.info()
News Sitemap Columns and Data TypesThe BBC does not use the “news_publication” column and other columns.
3. Find the most common words from the URL of a news publication
To see the most common words in news site URLs, we need to use the “str”, “explode” and “split” methods.
df_dailymail["loc"].str.split("/").str[5].str.split("-").explode().value_counts().to_frame()
loc |
|
|---|---|
article |
176 |
Russian |
50 |
Ukraine |
50 |
says |
38 |
reveals |
38 |
... |
... |
readers |
1 |
Red |
1 |
Cross |
1 |
provide |
1 |
weekend.html |
1 |
5445 rows × 1 column
We see that for the Daily Mail, “Russia and Ukraine” is the main topic.
4. Find the most commonly used languages in news publications
The URL structure or “languages” section of a news publication can be used to view the most commonly used languages in a news publication.
In this example, we use “BBC” to see their language priority.
df_bbc["publication_language"].head(20).value_counts().to_frame()
| language of publication | |
en |
698 |
fa |
52 |
sr |
52 |
ar |
47 |
mr |
43 |
hi |
43 |
gu |
41 |
ur |
35 |
pt |
33 |
te |
31 |
ta |
31 |
cy |
30 |
ha |
29 |
tr |
28 |
es |
25 |
sw |
22 |
cpe |
22 |
ne |
21 |
pa |
21 |
yo |
20 |
20 rows × 1 column
In order to reach the Russian people through Google News, every Western news source should be in Russian.
Some international news organizations are starting to enforce this view.
If you are a news SEO, it can be helpful to watch competitors’ Russian-language publications to get objective news to Russia and compete in the news industry.
5. Review the word frequency of news headlines
We use the BBC to see “news headlines” and which words are more frequent.
df_bbc["news_title"].str.split(" ").explode().value_counts().to_frame()
news_title |
|
|---|---|
to |
232 |
in |
181 |
- |
141 |
of |
140 |
for |
138 |
... |
... |
ፊልም |
1 |
ብላክ |
1 |
ባንኪ |
1 |
ጕሒላ |
1 |
niile |
1 |
11916 rows × 1 columns
The problem here is that we have “all types of words in news headlines” such as “context-free stop words”.
We need to clean up these types of non-categorical terms to better understand their point.
from nltk.corpus import stopwords
stop = stopwords.words('english')
df_bbc_news_title_most_used_words = df_bbc["news_title"].str.split(" ").explode().value_counts().to_frame()
pat = r'b(?:{})b'.format('|'.join(stop))
df_bbc_news_title_most_used_words.reset_index(drop=True, inplace=True)
df_bbc_news_title_most_used_words["without_stop_words"] = df_bbc_news_title_most_used_words["words"].str.replace(pat,"")
df_bbc_news_title_most_used_words.drop(df_bbc_news_title_most_used_words.loc[df_bbc_news_title_most_used_words["without_stop_words"]==""].index, inplace=True)
df_bbc_news_title_most_used_words
The “without_stop_words” column refers to sanitized text values.With the help of Pandas’ “regex” and “replace” methods, we removed most of the stop words.
The second problem is removing “punctuation”.
For this, we’ll use Python’s “strings” module.
import string
df_bbc_news_title_most_used_words["without_stop_word_and_punctation"] = df_bbc_news_title_most_used_words['without_stop_words'].str.replace('[{}]'.format(string.punctuation), '')
df_bbc_news_title_most_used_words.drop(df_bbc_news_title_most_used_words.loc[df_bbc_news_title_most_used_words["without_stop_word_and_punctation"]==""].index, inplace=True)
df_bbc_news_title_most_used_words.drop(["without_stop_words", "words"], axis=1, inplace=True)
df_bbc_news_title_most_used_words
news_title |
without_stop_word_and_punctation |
|
|---|---|---|
Ukraine |
110 |
Ukraine |
v |
83 |
v |
de |
61 |
de |
Ukraine: |
60 |
Ukraine |
da |
51 |
da |
... |
... |
... |
ፊልም |
1 |
ፊልም |
ብላክ |
1 |
ብላክ |
ባንኪ |
1 |
ባንኪ |
ጕሒላ |
1 |
ጕሒላ |
niile |
1 |
niile |
11767 rows × 2 columns
Alternatively, use “df_bbc_news_title_most_used_words[“news_title”].to_frame()” to take a clearer picture of the data.
news_title |
|
|---|---|
Ukraine |
110 |
v |
83 |
de |
61 |
Ukraine: |
60 |
da |
51 |
... |
... |
ፊልም |
1 |
ብላክ |
1 |
ባንኪ |
1 |
ጕሒላ |
1 |
niile |
1 |
11767 rows × 1 columns
We saw 11,767 unique words in the BBC’s URL, with Ukraine being the most popular, appearing 110 times.
There are different Ukraine-related phrases in the dataframe, such as “Ukraine:”.
“NLTK Tokenize” can be used to unify different variants of these types.
The next section will combine them using a different approach.
Note: If you want to make things easier, use Adverttools as shown below.
adv.word_frequency(df_bbc["news_title"],phrase_len=2, rm_words=adv.stopwords.keys())
The results are as follows.
Text Analysis with Adverttools“adv.word_frequency” has attributes “phrase_len” and “rm_words” to determine the length of phrase occurrences and to remove stop words.
You might tell me why I didn’t use it in the first place?
I want to show you an educational example on “Regular Expressions, NLTK, and Strings” so you can see what’s going on behind the scenes.
6. Visualize the most common words in news headlines
To visualize the most common words in news headlines, you can use the code block below.
df_bbc_news_title_most_used_words["news_title"] = df_bbc_news_title_most_used_words["news_title"].astype(int) df_bbc_news_title_most_used_words["without_stop_word_and_punctation"] = df_bbc_news_title_most_used_words["without_stop_word_and_punctation"].astype(str) df_bbc_news_title_most_used_words.index = df_bbc_news_title_most_used_words["without_stop_word_and_punctation"] df_bbc_news_title_most_used_words["news_title"].head(20).plot(title="The Most Used Words in BBC News Titles")
News Ngrams VisualizationYou realize there is a “broken wire”.
Do you remember “Ukraine” and “Ukraine:” in the data frame?
When we remove “punctuation”, the second and first values become the same.
That’s why the line chart shows that Ukraine appears 60 times and 110 times, respectively.
To prevent such data discrepancies, use the code block below.
df_bbc_news_title_most_used_words_1 = df_bbc_news_title_most_used_words.drop_duplicates().groupby('without_stop_word_and_punctation', sort=False, as_index=True).sum()
df_bbc_news_title_most_used_words_1
news_title |
|
|---|---|
without_stop_word_and_punctation |
|
Ukraine |
175 |
v |
83 |
de |
61 |
da |
51 |
и |
41 |
... |
... |
ፊልም |
1 |
ብላክ |
1 |
ባንኪ |
1 |
ጕሒላ |
1 |
niile |
1 |
11109 rows × 1 columns
Duplicate rows are removed and their values are added together.
Now, let’s visualize it again.
7. Extract the most popular N-Gram from news headlines
Extracting n-grams from news headlines or normalizing URL words and forming n-grams to understand overall topicality is useful for understanding which news releases are close to which topic. That’s it.
import nltk import unicodedata import re def text_clean(content):
lemmetizer = nltk.stem.WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english')
content = (unicodedata.normalize('NFKD', content)
.encode('ascii', 'ignore')
.decode('utf-8', 'ignore')
.lower())
words = re.sub(r'[^ws]', '', content).split()
return [lemmetizer.lemmatize(word) for word in words if word not in stopwords]
raw_words = text_clean(''.join(str(df_bbc['news_title'].tolist())))
raw_words[:10]
OUTPUT>>> ['oneminute', 'world', 'news', 'best', 'generation', 'make', 'agyarkos', 'dream', 'fight', 'card']
The output shows that we have “lemmatized” all the words in the news headlines and put them in a list.
List comprehensions provide a quick shortcut to easily filter each stopword.
Use “nltk.corpus.stopwords.words(“english”)” to provide all stopwords in English.
But you can add additional stop words to the list to widen the exclusion of words.
“unicodedata” is for normalizing characters.
The characters we see are actually Unicode bytes, eg “U+2160 ROMAN NUMERAL ONE” and the Roman character “U+0049 LATIN CAPITAL LETTER I” are actually the same.
“unicodedata.normalize” differentiates character differences so that lemmatizer can differentiate between different words with similar characters.
pd.set_option("display.max_colwidth",90)
bbc_bigrams = (pd.Series(ngrams(words, n = 2)).value_counts())[:15].sort_values(ascending=False).to_frame()
bbc_trigrams = (pd.Series(ngrams(words, n = 3)).value_counts())[:15].sort_values(ascending=False).to_frame()
Below, you’ll see the most popular “n-grams” from BBC News.
NGrams data frame from BBCTo simply visualize the most popular n-grams in a newsfeed, use the code block below.
bbc_bigrams.plot.barh(color="red", width=.8,figsize=(10 , 7))
“Ukraine, war” is the top news.
You can also filter n-grams for “Ukraine” and create “entity-attribute” pairs.
News sitemap Ngram from the BBCCrawl these URLs and identify “person-type entities” to give you an idea of how the BBC handles newsworthy situations.
But it goes beyond “news sitemaps”. So it’s another day.
To visualize popular n-grams from a newsfeed’s sitemap, you can create a custom python function as shown below.
def ngram_visualize(dataframe:pd.DataFrame, color:str="blue") -> pd.DataFrame.plot:
dataframe.plot.barh(color=color, width=.8,figsize=(10 ,7))
ngram_visualize(ngram_extractor(df_dailymail))
The results are as follows.
News Sitemap Trigram VisualizationTo make it interactive, add an extra parameter as shown below.
def ngram_visualize(dataframe:pd.DataFrame, backend:str, color:str="blue", ) -> pd.DataFrame.plot:
if backend=="plotly":
pd.options.plotting.backend=backend
return dataframe.plot.bar()
else:
return dataframe.plot.barh(color=color, width=.8,figsize=(10 ,7))
ngram_visualize(ngram_extractor(df_dailymail), backend="plotly")
As a simple example, check below.
8. Create your own custom function to analyze newsfeed sitemaps
A small Python package will be required when you are reviewing your news sitemaps repeatedly.
Below you can find four different chains of fast Python functions that use each of the previous functions as callbacks.
To clean up text content items, use the following function.
def text_clean(content):
lemmetizer = nltk.stem.WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english')
content = (unicodedata.normalize('NFKD', content)
.encode('ascii', 'ignore')
.decode('utf-8', 'ignore')
.lower())
words = re.sub(r'[^ws]', '', content).split()
return [lemmetizer.lemmatize(word) for word in words if word not in stopwords]
To extract n-grams from the news headlines of a specific news website’s sitemap, use the following function.
def ngram_extractor(dataframe:pd.DataFrame|pd.Series):
if "news_title" in dataframe.columns:
return dataframe_ngram_extractor(dataframe, ngram=3, first=10)
Convert the extracted n-grams to a dataframe using the function below.
def dataframe_ngram_extractor(dataframe:pd.DataFrame|pd.Series, ngram:int, first:int):
raw_words = text_clean(''.join(str(dataframe['news_title'].tolist())))
return (pd.Series(ngrams(raw_words, n = ngram)).value_counts())[:first].sort_values(ascending=False).to_frame()
To extract sitemaps for multiple news sites, use the function below.
def ngram_df_constructor(df_1:pd.DataFrame, df_2:pd.DataFrame):
df_1_bigrams = dataframe_ngram_extractor(df_1, ngram=2, first=500)
df_1_trigrams = dataframe_ngram_extractor(df_1, ngram=3, first=500)
df_2_bigrams = dataframe_ngram_extractor(df_2, ngram=2, first=500)
df_2_trigrams = dataframe_ngram_extractor(df_2, ngram=3, first=500)
ngrams_df = {
"df_1_bigrams":df_1_bigrams.index,
"df_1_trigrams": df_1_trigrams.index,
"df_2_bigrams":df_2_bigrams.index,
"df_2_trigrams": df_2_trigrams.index,
}
dict_df = (pd.DataFrame({ key:pd.Series(value) for key, value in ngrams_df.items() }).reset_index(drop=True)
.rename(columns={"df_1_bigrams":adv.url_to_df(df_1["loc"])["netloc"][1].split("www.")[1].split(".")[0] + "_bigrams",
"df_1_trigrams":adv.url_to_df(df_1["loc"])["netloc"][1].split("www.")[1].split(".")[0] + "_trigrams",
"df_2_bigrams": adv.url_to_df(df_2["loc"])["netloc"][1].split("www.")[1].split(".")[0] + "_bigrams",
"df_2_trigrams": adv.url_to_df(df_2["loc"])["netloc"][1].split("www.")[1].split(".")[0] + "_trigrams"}))
return dict_df
Below, you can see an example use case.
ngram_df_constructor(df_bbc, df_guardian)
Comparison of popular Ngrams to see the focus of news sites.Only with these four nested custom python functions you can do the following.
- It’s easy to visualize these n-grams and count of news sites to check.
- You can see what news sites are following on the same topic or on different topics.
- You can compare their wording or vocabulary on the same subject.
- You can see how many different subtopics from the same topic or entity are processed comparatively.
I didn’t enter numbers for n-gram frequencies.
However, number one is the most popular among that particular news source.
To check the next 500 lines, click here.
9. Extract the most frequently used news keywords from the news sitemap
Speaking of news keywords, surprisingly, they are still very active on Google.
E.g, Microsoft Bing and Google Don’t think “meta keywords” are no longer useful signals, unlike Yandex.
However, news keywords from the news sitemap are still used.
Of all these news sources, only The Guardian used news keywords.
It’s useful to know how they use news keywords to provide relevance.
df_guardian["news_keywords"].str.split().explode().value_counts().to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})
You can see the most used words in The Guardian’s news keywords.
news_keyword_occurence |
|
|---|---|
news, |
250 |
World |
142 |
and |
142 |
Ukraine, |
127 |
UK |
116 |
... |
... |
Cumberbatch, |
1 |
Dune |
1 |
Saracens |
1 |
Pearson, |
1 |
Thailand |
1 |
1409 rows × 1 column
The visualization is as follows.
(df_guardian["news_keywords"].str.split().explode().value_counts()
.to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})
.head(25).plot.barh(figsize=(10,8),
title="The Guardian Most Used Words in News Keywords", xlabel="News Keywords",
legend=False, ylabel="Count of News Keyword"))
Most Popular Words in News KeywordsThe “,” at the end of a news keyword indicates whether it is a separate value or part of another value.
I recommend that you don’t remove “punctuation” or “stop words” from your news keywords to better understand their news keyword usage style.
For different parsing, you can use “,” as delimiter.
df_guardian["news_keywords"].str.split(",").explode().value_counts().to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})
The results differ as follows.
news_keyword_occurence |
|
|---|---|
World news |
134 |
Europe |
116 |
UK news |
111 |
Sport |
109 |
Russia |
90 |
... |
... |
Women's shoes |
1 |
Men's shoes |
1 |
Body image |
1 |
Kae Tempest |
1 |
Thailand |
1 |
1080 rows × 1 column
Focus on “split(“,”).
(df_guardian["news_keywords"].str.split(",").explode().value_counts()
.to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})
.head(25).plot.barh(figsize=(10,8),
title="The Guardian Most Used Words in News Keywords", xlabel="News Keywords",
legend=False, ylabel="Count of News Keyword"))
You can see the difference in results visualized below.
Most Popular Keywords in News SitemapsFrom “Chelsea” to “Vladamir Putin” to “Ukraine War” and “Roma Abramovich”, most of these phrases coincide with the early days of Russia’s invasion of Ukraine.
Use the code block below to interactively visualize news keywords for two different news website sitemaps.
df_1 = df_guardian["news_keywords"].str.split(",").explode().value_counts().to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})
df_2 = df_nyt["news_keywords"].str.split(",").explode().value_counts().to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})
fig = make_subplots(rows = 1, cols = 2)
fig.add_trace(
go.Bar(y = df_1["news_keyword_occurence"][:6].index, x = df_1["news_keyword_occurence"], orientation="h", name="The Guardian News Keywords"), row=1, col=2
)
fig.add_trace(
go.Bar(y = df_2["news_keyword_occurence"][:6].index, x = df_2["news_keyword_occurence"], orientation="h", name="New York Times News Keywords"), row=1, col=1
)
fig.update_layout(height = 800, width = 1200, title_text="Side by Side Popular News Keywords")
fig.show()
fig.write_html("news_keywords.html")
You can see the results below.
To interact with the live graph, click here.
In the next section, you will find two different subplots to compare the n-grams of news sites.
10. Create a subplot for comparing news sources
Use the code block below to place the most popular n-grams from news sources into subplots from news headlines.
import matplotlib.pyplot as plt
import pandas as pd
df1 = ngram_extractor(df_bbc)
df2 = ngram_extractor(df_skynews)
df3 = ngram_extractor(df_dailymail)
df4 = ngram_extractor(df_guardian)
df5 = ngram_extractor(df_nyt)
df6 = ngram_extractor(df_cnn)
nrow=3
ncol=2
df_list = [df1 ,df2, df3, df4, df5, df6] #df6
titles = ["BBC News Trigrams", "Skynews Trigrams", "Dailymail Trigrams", "The Guardian Trigrams", "New York Times Trigrams", "CNN News Ngrams"]
fig, axes = plt.subplots(nrow, ncol, figsize=(25,32))
count=0
i = 0
for r in range(nrow):
for c in range(ncol):
(df_list[count].plot.barh(ax = axes[r,c],
figsize = (40, 28),
title = titles[i],
fontsize = 10,
legend = False,
xlabel = "Trigrams",
ylabel = "Count"))
count+=1
i += 1
You can see the results below.
Most Popular Ngrams from News SourcesThe example data visualization above is completely static and does not provide any interactivity.
recent, Elias DabasThe creators of Advertools have shared a new script for getting article counts, n-grams and their counts from news sources.
Check here for better, more detailed and interactive data dashboards.
The above example is from Elias Dabbas, who demonstrates how to interactively get the total number of articles, top words and n-grams from a news website.
Final Thoughts on News Sitemap Analysis with Python
This tutorial aims to provide an educational Python coding course to get keywords, n-grams, phrase patterns, languages, and other types of SEO related information from news websites.
News SEO relies heavily on quick response and always-on article creation.
Tracking your competitor’s perspective and method of covering topics can show how quickly your competitors are reacting to search trends.
It would be better to create a Google Trends Dashboard and News Feed Ngram Tracker for comparative and complementary news SEO analysis.
From time to time in this article, I include custom functions or advanced for loops, and sometimes, I keep it simple.
Beginners to advanced Python practitioners can benefit from this to improve their approach to tracking, reporting, and analytics for news SEO and beyond.
More resources:
Featured Image: BestForBest/Shutterstock
!function(f,b,e,v,n,t,s) {if(f.fbq)return;n=f.fbq=function(){n.callMethod? n.callMethod.apply(n,arguments):n.queue.push(arguments)}; if(!f._fbq)f._fbq=n;n.push=n;n.loaded=!0;n.version='2.0'; n.queue=[];t=b.createElement(e);t.async=!0; t.src=v;s=b.getElementsByTagName(e)[0]; s.parentNode.insertBefore(t,s)}(window,document,'script', 'https://connect.facebook.net/en_US/fbevents.js');
if( typeof sopp !== "undefined" && sopp === 'yes' ){ fbq('dataProcessingOptions', ['LDU'], 1, 1000); }else{ fbq('dataProcessingOptions', []); }
fbq('init', '1321385257908563');
fbq('track', 'PageView');
fbq('trackSingle', '1321385257908563', 'ViewContent', { content_name: 'news-seo-analysis-python', content_category: 'technical-seo digital-marketing-tools ' });



