Wednesday, June 24, 2026

Visualizing Trending Topics Using Python to Analyze News Sitemaps


News sitemaps use different and unique sitemap protocols to provide more information to news search engines.

The News Sitemap contains news published in the past 48 hours.

News sitemap tags include the news publication’s title, language, name, genre, publication date, keywords, and even a stock symbol.

How do you leverage these sitemaps for content research and competitive analysis?

In this Python tutorial, you’ll learn a 10-step process for analyzing a news sitemap and visualizing the top trends found in it.

Housekeeping Considerations Let’s Get Started

This tutorial was written during the Russian invasion of Ukraine.

Using machine learning, we can even label news sources and articles based on which news sources are “objective” and which ones are “sarcastic.”

But for simplicity, we will focus on the following topics frequency analysis.

We will use 10+ global news sources from the US and UK

Note: We would like to include Russian news sources, but they don’t have a proper news sitemap. Even if they had, they would block external requests.

Compare occurrences of words “Invasion” and “Liberation” News sources from both the West and the East show the benefits of distributive frequency text analysis methods.

What does it take to analyze news content with Python

Listed below are relevant Python libraries for reviewing news sitemaps for newsfeed content strategy:

  • advertising tools.
  • panda.
  • Plotly Express, subplots and graph objects.
  • re(regex).
  • String.
  • NLTK (corpus, stopwords, Ngrams).
  • Unicode data.
  • Matplotlib.
  • Basic Python syntax understanding.

10 Steps to News Sitemap Analysis Using Python

Are all set up? let’s start.

1. Get news URL from news sitemap

We selected The Guardian, New York Times, Washington Post, Daily Mail, Sky News, BBC, and CNN to examine news URLs in news sitemaps.

df_guardian = adv.sitemap_to_df("http://www.theguardian.com/sitemaps/news.xml")
df_nyt = adv.sitemap_to_df("https://www.nytimes.com/sitemaps/new/news.xml.gz")
df_wp = adv.sitemap_to_df("https://www.washingtonpost.com/arcio/news-sitemap/")
df_bbc = adv.sitemap_to_df("https://www.bbc.com/sitemaps/https-index-com-news.xml")
df_dailymail = adv.sitemap_to_df("https://www.dailymail.co.uk/google-news-sitemap.xml")
df_skynews = adv.sitemap_to_df("https://news.sky.com/sitemap-index.xml")
df_cnn = adv.sitemap_to_df("https://edition.cnn.com/sitemaps/cnn/news.xml")

2. Inspect a sample news sitemap using Python

I’ll use the BBC as an example to show what we’ve just pulled from these news sitemaps.

df_bbc
News sitemap data frame view

The BBC sitemap has the following sections.

df_bbc.columns
News Sitemap TagsNews sitemap labels as dataframe columns

The general data structure of these columns is as follows.

df_bbc.info()
News sitemap as data frameNews Sitemap Columns and Data Types

The BBC does not use the “news_publication” column and other columns.

3. Find the most common words from the URL of a news publication

To see the most common words in news site URLs, we need to use the “str”, “explode” and “split” methods.

df_dailymail["loc"].str.split("/").str[5].str.split("-").explode().value_counts().to_frame()
loc
article
176
Russian
50
Ukraine
50
says
38
reveals
38
...
...
readers
1
Red
1
Cross
1
provide
1
weekend.html
1
5445 rows × 1 column

We see that for the Daily Mail, “Russia and Ukraine” is the main topic.

4. Find the most commonly used languages ​​in news publications

The URL structure or “languages” section of a news publication can be used to view the most commonly used languages ​​in a news publication.

In this example, we use “BBC” to see their language priority.

df_bbc["publication_language"].head(20).value_counts().to_frame()
language of publication
en
698
fa
52
sr
52
ar
47
mr
43
hi
43
gu
41
ur
35
pt
33
te
31
ta
31
cy
30
ha
29
tr
28
es
25
sw
22
cpe
22
ne
21
pa
21
yo
20
20 rows × 1 column

In order to reach the Russian people through Google News, every Western news source should be in Russian.

Some international news organizations are starting to enforce this view.

If you are a news SEO, it can be helpful to watch competitors’ Russian-language publications to get objective news to Russia and compete in the news industry.

5. Review the word frequency of news headlines

We use the BBC to see “news headlines” and which words are more frequent.

df_bbc["news_title"].str.split(" ").explode().value_counts().to_frame()
news_title
to
232
in
181
-
141
of
140
for
138
...
...
ፊልም
1
ብላክ
1
ባንኪ
1
ጕሒላ
1
niile
1
11916 rows × 1 columns

The problem here is that we have “all types of words in news headlines” such as “context-free stop words”.

We need to clean up these types of non-categorical terms to better understand their point.

from nltk.corpus import stopwords
stop = stopwords.words('english')
df_bbc_news_title_most_used_words = df_bbc["news_title"].str.split(" ").explode().value_counts().to_frame()
pat = r'b(?:{})b'.format('|'.join(stop))
df_bbc_news_title_most_used_words.reset_index(drop=True, inplace=True)
df_bbc_news_title_most_used_words["without_stop_words"] = df_bbc_news_title_most_used_words["words"].str.replace(pat,"")
df_bbc_news_title_most_used_words.drop(df_bbc_news_title_most_used_words.loc[df_bbc_news_title_most_used_words["without_stop_words"]==""].index, inplace=True)
df_bbc_news_title_most_used_words
Remove stop words from text analysisThe “without_stop_words” column refers to sanitized text values.

With the help of Pandas’ “regex” and “replace” methods, we removed most of the stop words.

The second problem is removing “punctuation”.

For this, we’ll use Python’s “strings” module.

import string
df_bbc_news_title_most_used_words["without_stop_word_and_punctation"] = df_bbc_news_title_most_used_words['without_stop_words'].str.replace('[{}]'.format(string.punctuation), '')
df_bbc_news_title_most_used_words.drop(df_bbc_news_title_most_used_words.loc[df_bbc_news_title_most_used_words["without_stop_word_and_punctation"]==""].index, inplace=True)
df_bbc_news_title_most_used_words.drop(["without_stop_words", "words"], axis=1, inplace=True)
df_bbc_news_title_most_used_words
news_title
without_stop_word_and_punctation
Ukraine
110
Ukraine
v
83
v
de
61
de
Ukraine:
60
Ukraine
da
51
da
...
...
...
ፊልም
1
ፊልም
ብላክ
1
ብላክ
ባንኪ
1
ባንኪ
ጕሒላ
1
ጕሒላ
niile
1
niile
11767 rows × 2 columns

Alternatively, use “df_bbc_news_title_most_used_words[“news_title”].to_frame()” to take a clearer picture of the data.

news_title
Ukraine
110
v
83
de
61
Ukraine:
60
da
51
...
...
ፊልም
1
ብላክ
1
ባንኪ
1
ጕሒላ
1
niile
1
11767 rows × 1 columns

We saw 11,767 unique words in the BBC’s URL, with Ukraine being the most popular, appearing 110 times.

There are different Ukraine-related phrases in the dataframe, such as “Ukraine:”.

“NLTK Tokenize” can be used to unify different variants of these types.

The next section will combine them using a different approach.

Note: If you want to make things easier, use Adverttools as shown below.

adv.word_frequency(df_bbc["news_title"],phrase_len=2, rm_words=adv.stopwords.keys())

The results are as follows.

Text Analysis and WordsText Analysis with Adverttools

“adv.word_frequency” has attributes “phrase_len” and “rm_words” to determine the length of phrase occurrences and to remove stop words.

You might tell me why I didn’t use it in the first place?

I want to show you an educational example on “Regular Expressions, NLTK, and Strings” so you can see what’s going on behind the scenes.

6. Visualize the most common words in news headlines

To visualize the most common words in news headlines, you can use the code block below.

df_bbc_news_title_most_used_words["news_title"] = df_bbc_news_title_most_used_words["news_title"].astype(int)
df_bbc_news_title_most_used_words["without_stop_word_and_punctation"] = df_bbc_news_title_most_used_words["without_stop_word_and_punctation"].astype(str)
df_bbc_news_title_most_used_words.index = df_bbc_news_title_most_used_words["without_stop_word_and_punctation"]
df_bbc_news_title_most_used_words["news_title"].head(20).plot(title="The Most Used Words in BBC News Titles")
News Sitemap Python AnalysisNews Ngrams Visualization

You realize there is a “broken wire”.

Do you remember “Ukraine” and “Ukraine:” in the data frame?

When we remove “punctuation”, the second and first values ​​become the same.

That’s why the line chart shows that Ukraine appears 60 times and 110 times, respectively.

To prevent such data discrepancies, use the code block below.

df_bbc_news_title_most_used_words_1 = df_bbc_news_title_most_used_words.drop_duplicates().groupby('without_stop_word_and_punctation', sort=False, as_index=True).sum()
df_bbc_news_title_most_used_words_1
news_title
without_stop_word_and_punctation
Ukraine
175
v
83
de
61
da
51
и
41
...
...
ፊልም
1
ብላክ
1
ባንኪ
1
ጕሒላ
1
niile
1
11109 rows × 1 columns

Duplicate rows are removed and their values ​​are added together.

Now, let’s visualize it again.

7. Extract the most popular N-Gram from news headlines

Extracting n-grams from news headlines or normalizing URL words and forming n-grams to understand overall topicality is useful for understanding which news releases are close to which topic. That’s it.

import nltk
import unicodedata
import re
def text_clean(content):
  lemmetizer = nltk.stem.WordNetLemmatizer()

  stopwords = nltk.corpus.stopwords.words('english')

  content = (unicodedata.normalize('NFKD', content)

    .encode('ascii', 'ignore')

    .decode('utf-8', 'ignore')

    .lower())

  words = re.sub(r'[^ws]', '', content).split()

  return [lemmetizer.lemmatize(word) for word in words if word not in stopwords]
raw_words = text_clean(''.join(str(df_bbc['news_title'].tolist())))
raw_words[:10]
OUTPUT>>>
['oneminute', 'world', 'news', 'best', 'generation', 'make', 'agyarkos', 'dream', 'fight', 'card']

The output shows that we have “lemmatized” all the words in the news headlines and put them in a list.

List comprehensions provide a quick shortcut to easily filter each stopword.

Use “nltk.corpus.stopwords.words(“english”)” to provide all stopwords in English.

But you can add additional stop words to the list to widen the exclusion of words.

“unicodedata” is for normalizing characters.

The characters we see are actually Unicode bytes, eg “U+2160 ROMAN NUMERAL ONE” and the Roman character “U+0049 LATIN CAPITAL LETTER I” are actually the same.

“unicodedata.normalize” differentiates character differences so that lemmatizer can differentiate between different words with similar characters.

pd.set_option("display.max_colwidth",90)

bbc_bigrams = (pd.Series(ngrams(words, n = 2)).value_counts())[:15].sort_values(ascending=False).to_frame()

bbc_trigrams = (pd.Series(ngrams(words, n = 3)).value_counts())[:15].sort_values(ascending=False).to_frame()

Below, you’ll see the most popular “n-grams” from BBC News.

BBCNGrams data frame from BBC

To simply visualize the most popular n-grams in a newsfeed, use the code block below.

bbc_bigrams.plot.barh(color="red", width=.8,figsize=(10 , 7))

“Ukraine, war” is the top news.

You can also filter n-grams for “Ukraine” and create “entity-attribute” pairs.

News Sitemap NgramsNews sitemap Ngram from the BBC

Crawl these URLs and identify “person-type entities” to give you an idea of ​​how the BBC handles newsworthy situations.

But it goes beyond “news sitemaps”. So it’s another day.

To visualize popular n-grams from a newsfeed’s sitemap, you can create a custom python function as shown below.

def ngram_visualize(dataframe:pd.DataFrame, color:str="blue") -> pd.DataFrame.plot:

     dataframe.plot.barh(color=color, width=.8,figsize=(10 ,7))
ngram_visualize(ngram_extractor(df_dailymail))

The results are as follows.

N-Gram visualizationNews Sitemap Trigram Visualization

To make it interactive, add an extra parameter as shown below.

def ngram_visualize(dataframe:pd.DataFrame, backend:str, color:str="blue", ) -> pd.DataFrame.plot:

     if backend=="plotly":

          pd.options.plotting.backend=backend

          return dataframe.plot.bar()

     else:

          return dataframe.plot.barh(color=color, width=.8,figsize=(10 ,7))
ngram_visualize(ngram_extractor(df_dailymail), backend="plotly")

As a simple example, check below.

8. Create your own custom function to analyze newsfeed sitemaps

A small Python package will be required when you are reviewing your news sitemaps repeatedly.

Below you can find four different chains of fast Python functions that use each of the previous functions as callbacks.

To clean up text content items, use the following function.

def text_clean(content):

  lemmetizer = nltk.stem.WordNetLemmatizer()

  stopwords = nltk.corpus.stopwords.words('english')

  content = (unicodedata.normalize('NFKD', content)

    .encode('ascii', 'ignore')

    .decode('utf-8', 'ignore')

    .lower())

  words = re.sub(r'[^ws]', '', content).split()

  return [lemmetizer.lemmatize(word) for word in words if word not in stopwords]

To extract n-grams from the news headlines of a specific news website’s sitemap, use the following function.

def ngram_extractor(dataframe:pd.DataFrame|pd.Series):

     if "news_title" in dataframe.columns:

          return dataframe_ngram_extractor(dataframe,  ngram=3, first=10)

Convert the extracted n-grams to a dataframe using the function below.

def dataframe_ngram_extractor(dataframe:pd.DataFrame|pd.Series, ngram:int, first:int):

     raw_words = text_clean(''.join(str(dataframe['news_title'].tolist())))

     return (pd.Series(ngrams(raw_words, n = ngram)).value_counts())[:first].sort_values(ascending=False).to_frame()

To extract sitemaps for multiple news sites, use the function below.

def ngram_df_constructor(df_1:pd.DataFrame, df_2:pd.DataFrame):

  df_1_bigrams = dataframe_ngram_extractor(df_1, ngram=2, first=500)

  df_1_trigrams = dataframe_ngram_extractor(df_1, ngram=3, first=500)

  df_2_bigrams = dataframe_ngram_extractor(df_2, ngram=2, first=500)

  df_2_trigrams = dataframe_ngram_extractor(df_2, ngram=3, first=500)

  ngrams_df = {

  "df_1_bigrams":df_1_bigrams.index,

  "df_1_trigrams": df_1_trigrams.index,

  "df_2_bigrams":df_2_bigrams.index,

  "df_2_trigrams": df_2_trigrams.index,

  }

  dict_df = (pd.DataFrame({ key:pd.Series(value) for key, value in ngrams_df.items() }).reset_index(drop=True)

  .rename(columns={"df_1_bigrams":adv.url_to_df(df_1["loc"])["netloc"][1].split("www.")[1].split(".")[0] + "_bigrams",

                    "df_1_trigrams":adv.url_to_df(df_1["loc"])["netloc"][1].split("www.")[1].split(".")[0] + "_trigrams",

                    "df_2_bigrams": adv.url_to_df(df_2["loc"])["netloc"][1].split("www.")[1].split(".")[0] + "_bigrams",

                    "df_2_trigrams": adv.url_to_df(df_2["loc"])["netloc"][1].split("www.")[1].split(".")[0] + "_trigrams"}))

  return dict_df

Below, you can see an example use case.

ngram_df_constructor(df_bbc, df_guardian)
Ngram PopularityComparison of popular Ngrams to see the focus of news sites.

Only with these four nested custom python functions you can do the following.

  • It’s easy to visualize these n-grams and count of news sites to check.
  • You can see what news sites are following on the same topic or on different topics.
  • You can compare their wording or vocabulary on the same subject.
  • You can see how many different subtopics from the same topic or entity are processed comparatively.

I didn’t enter numbers for n-gram frequencies.

However, number one is the most popular among that particular news source.

To check the next 500 lines, click here.

9. Extract the most frequently used news keywords from the news sitemap

Speaking of news keywords, surprisingly, they are still very active on Google.

E.g, Microsoft Bing and Google Don’t think “meta keywords” are no longer useful signals, unlike Yandex.

However, news keywords from the news sitemap are still used.

Of all these news sources, only The Guardian used news keywords.

It’s useful to know how they use news keywords to provide relevance.

df_guardian["news_keywords"].str.split().explode().value_counts().to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})

You can see the most used words in The Guardian’s news keywords.

news_keyword_occurence
news,
250
World
142
and
142
Ukraine,
127
UK
116
...
...
Cumberbatch,
1
Dune
1
Saracens
1
Pearson,
1
Thailand
1
1409 rows × 1 column

The visualization is as follows.

(df_guardian["news_keywords"].str.split().explode().value_counts()

.to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})

.head(25).plot.barh(figsize=(10,8),

title="The Guardian Most Used Words in News Keywords", xlabel="News Keywords",

legend=False, ylabel="Count of News Keyword"))

Most Popular Words in News KeywordsMost Popular Words in News Keywords

The “,” at the end of a news keyword indicates whether it is a separate value or part of another value.
I recommend that you don’t remove “punctuation” or “stop words” from your news keywords to better understand their news keyword usage style.

For different parsing, you can use “,” as delimiter.

df_guardian["news_keywords"].str.split(",").explode().value_counts().to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})

The results differ as follows.

news_keyword_occurence
World news
134
Europe
116
UK news
111
Sport
109
Russia
90
...
...
Women's shoes
1
Men's shoes
1
Body image
1
Kae Tempest
1
Thailand
1
1080 rows × 1 column

Focus on “split(“,”).

(df_guardian["news_keywords"].str.split(",").explode().value_counts()

.to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})

.head(25).plot.barh(figsize=(10,8),

title="The Guardian Most Used Words in News Keywords", xlabel="News Keywords",

legend=False, ylabel="Count of News Keyword"))

You can see the difference in results visualized below.

Most Popular Keywords in News SitemapsMost Popular Keywords in News Sitemaps

From “Chelsea” to “Vladamir Putin” to “Ukraine War” and “Roma Abramovich”, most of these phrases coincide with the early days of Russia’s invasion of Ukraine.

Use the code block below to interactively visualize news keywords for two different news website sitemaps.

df_1 = df_guardian["news_keywords"].str.split(",").explode().value_counts().to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})

df_2 = df_nyt["news_keywords"].str.split(",").explode().value_counts().to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})

fig = make_subplots(rows = 1, cols = 2)

fig.add_trace(

     go.Bar(y = df_1["news_keyword_occurence"][:6].index, x = df_1["news_keyword_occurence"], orientation="h", name="The Guardian News Keywords"), row=1, col=2

)

fig.add_trace(

     go.Bar(y = df_2["news_keyword_occurence"][:6].index, x = df_2["news_keyword_occurence"], orientation="h", name="New York Times News Keywords"), row=1, col=1

)

fig.update_layout(height = 800, width = 1200, title_text="Side by Side Popular News Keywords")

fig.show()

fig.write_html("news_keywords.html")

You can see the results below.

To interact with the live graph, click here.

In the next section, you will find two different subplots to compare the n-grams of news sites.

10. Create a subplot for comparing news sources

Use the code block below to place the most popular n-grams from news sources into subplots from news headlines.

import matplotlib.pyplot as plt

import pandas as pd

df1 = ngram_extractor(df_bbc)

df2 = ngram_extractor(df_skynews)

df3 = ngram_extractor(df_dailymail)

df4 = ngram_extractor(df_guardian)

df5 = ngram_extractor(df_nyt)

df6 = ngram_extractor(df_cnn)

nrow=3

ncol=2

df_list = [df1 ,df2, df3, df4, df5, df6] #df6

titles = ["BBC News Trigrams", "Skynews Trigrams", "Dailymail Trigrams", "The Guardian Trigrams", "New York Times Trigrams", "CNN News Ngrams"]

fig, axes = plt.subplots(nrow, ncol, figsize=(25,32))

count=0

i = 0

for r in range(nrow):

    for c in range(ncol):

        (df_list[count].plot.barh(ax = axes[r,c],

        figsize = (40, 28),

        title = titles[i],

        fontsize = 10,

        legend = False,

        xlabel = "Trigrams",

        ylabel = "Count"))        

        count+=1

        i += 1

You can see the results below.

News Feed NgramMost Popular Ngrams from News Sources

The example data visualization above is completely static and does not provide any interactivity.

recent, Elias DabasThe creators of Advertools have shared a new script for getting article counts, n-grams and their counts from news sources.

Check here for better, more detailed and interactive data dashboards.

The above example is from Elias Dabbas, who demonstrates how to interactively get the total number of articles, top words and n-grams from a news website.

Final Thoughts on News Sitemap Analysis with Python

This tutorial aims to provide an educational Python coding course to get keywords, n-grams, phrase patterns, languages, and other types of SEO related information from news websites.

News SEO relies heavily on quick response and always-on article creation.

Tracking your competitor’s perspective and method of covering topics can show how quickly your competitors are reacting to search trends.

It would be better to create a Google Trends Dashboard and News Feed Ngram Tracker for comparative and complementary news SEO analysis.

From time to time in this article, I include custom functions or advanced for loops, and sometimes, I keep it simple.

Beginners to advanced Python practitioners can benefit from this to improve their approach to tracking, reporting, and analytics for news SEO and beyond.

More resources:


Featured Image: BestForBest/Shutterstock





Source link

Related articles

Most Popular Baby Names 2024: Top Picks

Join us as we explore the captivating world of the most popular baby names for 2024! Which name will you choose...

Most Popular Baby Names 2024: Top Picks

Join us as we explore the captivating world of the most popular baby names for 2024! Which name will you choose...

How to Settle a Colic Baby: Proven Tips

Eager to discover effective ways to calm your colicky baby? From soothing techniques to critical consultation cues, let's explore what...

What Is Colic in Babies: Key Facts Revealed

Understanding what colic in babies truly entails can be a challenge for many parents. As the evening wears on, and the baby's cries reach a crescendo, an urgent question looms in the air: what now?

The 7 Best Ways to Gain Popularity

Online searches are often not the starting point...
spot_imgspot_img