Visualizing Trending Topics Using Python to Analyze News Sitemaps

News sitemaps use different and unique sitemap protocols to provide more information to news search engines.

The News Sitemap contains news published in the past 48 hours.

News sitemap tags include the news publication’s title, language, name, genre, publication date, keywords, and even a stock symbol.

How do you leverage these sitemaps for content research and competitive analysis?

In this Python tutorial, you’ll learn a 10-step process for analyzing a news sitemap and visualizing the top trends found in it.

Housekeeping Considerations Let’s Get Started

This tutorial was written during the Russian invasion of Ukraine.

Using machine learning, we can even label news sources and articles based on which news sources are “objective” and which ones are “sarcastic.”

But for simplicity, we will focus on the following topics frequency analysis.

We will use 10+ global news sources from the US and UK

Note: We would like to include Russian news sources, but they don’t have a proper news sitemap. Even if they had, they would block external requests.

Compare occurrences of words “Invasion” and “Liberation” News sources from both the West and the East show the benefits of distributive frequency text analysis methods.

What does it take to analyze news content with Python

Listed below are relevant Python libraries for reviewing news sitemaps for newsfeed content strategy:

advertising tools.
panda.
Plotly Express, subplots and graph objects.
re(regex).
String.
NLTK (corpus, stopwords, Ngrams).
Unicode data.
Matplotlib.
Basic Python syntax understanding.

10 Steps to News Sitemap Analysis Using Python

Are all set up? let’s start.

1. Get news URL from news sitemap

We selected The Guardian, New York Times, Washington Post, Daily Mail, Sky News, BBC, and CNN to examine news URLs in news sitemaps.

df_guardian = adv.sitemap_to_df("http://www.theguardian.com/sitemaps/news.xml")
df_nyt = adv.sitemap_to_df("https://www.nytimes.com/sitemaps/new/news.xml.gz")
df_wp = adv.sitemap_to_df("https://www.washingtonpost.com/arcio/news-sitemap/")
df_bbc = adv.sitemap_to_df("https://www.bbc.com/sitemaps/https-index-com-news.xml")
df_dailymail = adv.sitemap_to_df("https://www.dailymail.co.uk/google-news-sitemap.xml")
df_skynews = adv.sitemap_to_df("https://news.sky.com/sitemap-index.xml")
df_cnn = adv.sitemap_to_df("https://edition.cnn.com/sitemaps/cnn/news.xml")

2. Inspect a sample news sitemap using Python

I’ll use the BBC as an example to show what we’ve just pulled from these news sitemaps.

df_bbc

News sitemap data frame view

The BBC sitemap has the following sections.

df_bbc.columns

News sitemap labels as dataframe columns

The general data structure of these columns is as follows.

df_bbc.info()

News Sitemap Columns and Data Types

The BBC does not use the “news_publication” column and other columns.

3. Find the most common words from the URL of a news publication

To see the most common words in news site URLs, we need to use the “str”, “explode” and “split” methods.

df_dailymail["loc"].str.split("/").str[5].str.split("-").explode().value_counts().to_frame()

loc
article	176
Russian	50
Ukraine	50
says	38
reveals	38
...	...
readers	1
Red	1
Cross	1
provide	1
weekend.html	1

5445 rows × 1 column

We see that for the Daily Mail, “Russia and Ukraine” is the main topic.

4. Find the most commonly used languages in news publications

The URL structure or “languages” section of a news publication can be used to view the most commonly used languages in a news publication.

In this example, we use “BBC” to see their language priority.

df_bbc["publication_language"].head(20).value_counts().to_frame()

language of publication
en	698
fa	52
sr	52
ar	47
mr	43
hi	43
gu	41
ur	35
pt	33
te	31
ta	31
cy	30
ha	29
tr	28
es	25
sw	22
cpe	22
ne	21
pa	21
yo	20

20 rows × 1 column

In order to reach the Russian people through Google News, every Western news source should be in Russian.

Some international news organizations are starting to enforce this view.

If you are a news SEO, it can be helpful to watch competitors’ Russian-language publications to get objective news to Russia and compete in the news industry.

5. Review the word frequency of news headlines

We use the BBC to see “news headlines” and which words are more frequent.

df_bbc["news_title"].str.split(" ").explode().value_counts().to_frame()

news_title
to	232
in	181
-	141
of	140
for	138
...	...
ፊልም	1
ብላክ	1
ባንኪ	1
ጕሒላ	1
niile	1

11916 rows × 1 columns

The problem here is that we have “all types of words in news headlines” such as “context-free stop words”.

We need to clean up these types of non-categorical terms to better understand their point.

from nltk.corpus import stopwords
stop = stopwords.words('english')
df_bbc_news_title_most_used_words = df_bbc["news_title"].str.split(" ").explode().value_counts().to_frame()
pat = r'b(?:{})b'.format('|'.join(stop))
df_bbc_news_title_most_used_words.reset_index(drop=True, inplace=True)
df_bbc_news_title_most_used_words["without_stop_words"] = df_bbc_news_title_most_used_words["words"].str.replace(pat,"")
df_bbc_news_title_most_used_words.drop(df_bbc_news_title_most_used_words.loc[df_bbc_news_title_most_used_words["without_stop_words"]==""].index, inplace=True)
df_bbc_news_title_most_used_words

The “without_stop_words” column refers to sanitized text values.

With the help of Pandas’ “regex” and “replace” methods, we removed most of the stop words.

The second problem is removing “punctuation”.

For this, we’ll use Python’s “strings” module.

import string
df_bbc_news_title_most_used_words["without_stop_word_and_punctation"] = df_bbc_news_title_most_used_words['without_stop_words'].str.replace('[{}]'.format(string.punctuation), '')
df_bbc_news_title_most_used_words.drop(df_bbc_news_title_most_used_words.loc[df_bbc_news_title_most_used_words["without_stop_word_and_punctation"]==""].index, inplace=True)
df_bbc_news_title_most_used_words.drop(["without_stop_words", "words"], axis=1, inplace=True)
df_bbc_news_title_most_used_words

news_title	without_stop_word_and_punctation
Ukraine	110	Ukraine
v	83	v
de	61	de
Ukraine:	60	Ukraine
da	51	da
...	...	...
ፊልም	1	ፊልም
ብላክ	1	ብላክ
ባንኪ	1	ባንኪ
ጕሒላ	1	ጕሒላ
niile	1	niile

11767 rows × 2 columns

Alternatively, use “df_bbc_news_title_most_used_words[“news_title”].to_frame()” to take a clearer picture of the data.

	news_title
Ukraine	110
v	83
de	61
Ukraine:	60
da	51
...	...
ፊልም	1
ብላክ	1
ባንኪ	1
ጕሒላ	1
niile	1

11767 rows × 1 columns

We saw 11,767 unique words in the BBC’s URL, with Ukraine being the most popular, appearing 110 times.

There are different Ukraine-related phrases in the dataframe, such as “Ukraine:”.

“NLTK Tokenize” can be used to unify different variants of these types.

The next section will combine them using a different approach.

Note: If you want to make things easier, use Adverttools as shown below.

adv.word_frequency(df_bbc["news_title"],phrase_len=2, rm_words=adv.stopwords.keys())

The results are as follows.

Text Analysis with Adverttools

“adv.word_frequency” has attributes “phrase_len” and “rm_words” to determine the length of phrase occurrences and to remove stop words.

You might tell me why I didn’t use it in the first place?

I want to show you an educational example on “Regular Expressions, NLTK, and Strings” so you can see what’s going on behind the scenes.

6. Visualize the most common words in news headlines

To visualize the most common words in news headlines, you can use the code block below.

df_bbc_news_title_most_used_words["news_title"] = df_bbc_news_title_most_used_words["news_title"].astype(int)
df_bbc_news_title_most_used_words["without_stop_word_and_punctation"] = df_bbc_news_title_most_used_words["without_stop_word_and_punctation"].astype(str)
df_bbc_news_title_most_used_words.index = df_bbc_news_title_most_used_words["without_stop_word_and_punctation"]
df_bbc_news_title_most_used_words["news_title"].head(20).plot(title="The Most Used Words in BBC News Titles")

News Ngrams Visualization

You realize there is a “broken wire”.

Do you remember “Ukraine” and “Ukraine:” in the data frame?

When we remove “punctuation”, the second and first values become the same.

That’s why the line chart shows that Ukraine appears 60 times and 110 times, respectively.

To prevent such data discrepancies, use the code block below.

df_bbc_news_title_most_used_words_1 = df_bbc_news_title_most_used_words.drop_duplicates().groupby('without_stop_word_and_punctation', sort=False, as_index=True).sum()
df_bbc_news_title_most_used_words_1

news_title
without_stop_word_and_punctation
Ukraine	175
v	83
de	61
da	51
и	41
...	...
ፊልም	1
ብላክ	1
ባንኪ	1
ጕሒላ	1
niile	1

11109 rows × 1 columns

Duplicate rows are removed and their values are added together.

Now, let’s visualize it again.

7. Extract the most popular N-Gram from news headlines

Extracting n-grams from news headlines or normalizing URL words and forming n-grams to understand overall topicality is useful for understanding which news releases are close to which topic. That’s it.

import nltk
import unicodedata
import re
def text_clean(content):

  lemmetizer = nltk.stem.WordNetLemmatizer()

  stopwords = nltk.corpus.stopwords.words('english')

  content = (unicodedata.normalize('NFKD', content)

    .encode('ascii', 'ignore')

    .decode('utf-8', 'ignore')

    .lower())

  words = re.sub(r'[^ws]', '', content).split()

  return [lemmetizer.lemmatize(word) for word in words if word not in stopwords]

raw_words = text_clean(''.join(str(df_bbc['news_title'].tolist())))

raw_words[:10]

OUTPUT>>>
['oneminute', 'world', 'news', 'best', 'generation', 'make', 'agyarkos', 'dream', 'fight', 'card']

The output shows that we have “lemmatized” all the words in the news headlines and put them in a list.

List comprehensions provide a quick shortcut to easily filter each stopword.

Use “nltk.corpus.stopwords.words(“english”)” to provide all stopwords in English.

But you can add additional stop words to the list to widen the exclusion of words.

“unicodedata” is for normalizing characters.

The characters we see are actually Unicode bytes, eg “U+2160 ROMAN NUMERAL ONE” and the Roman character “U+0049 LATIN CAPITAL LETTER I” are actually the same.

“unicodedata.normalize” differentiates character differences so that lemmatizer can differentiate between different words with similar characters.

pd.set_option("display.max_colwidth",90)

bbc_bigrams = (pd.Series(ngrams(words, n = 2)).value_counts())[:15].sort_values(ascending=False).to_frame()

bbc_trigrams = (pd.Series(ngrams(words, n = 3)).value_counts())[:15].sort_values(ascending=False).to_frame()

Below, you’ll see the most popular “n-grams” from BBC News.

NGrams data frame from BBC

To simply visualize the most popular n-grams in a newsfeed, use the code block below.

bbc_bigrams.plot.barh(color="red", width=.8,figsize=(10 , 7))

“Ukraine, war” is the top news.

You can also filter n-grams for “Ukraine” and create “entity-attribute” pairs.

News sitemap Ngram from the BBC

Crawl these URLs and identify “person-type entities” to give you an idea of how the BBC handles newsworthy situations.

But it goes beyond “news sitemaps”. So it’s another day.

To visualize popular n-grams from a newsfeed’s sitemap, you can create a custom python function as shown below.

def ngram_visualize(dataframe:pd.DataFrame, color:str="blue") -> pd.DataFrame.plot:

     dataframe.plot.barh(color=color, width=.8,figsize=(10 ,7))
ngram_visualize(ngram_extractor(df_dailymail))

The results are as follows.

News Sitemap Trigram Visualization

To make it interactive, add an extra parameter as shown below.

def ngram_visualize(dataframe:pd.DataFrame, backend:str, color:str="blue", ) -> pd.DataFrame.plot:

     if backend=="plotly":

          pd.options.plotting.backend=backend

          return dataframe.plot.bar()

     else:

          return dataframe.plot.barh(color=color, width=.8,figsize=(10 ,7))

ngram_visualize(ngram_extractor(df_dailymail), backend="plotly")

As a simple example, check below.

8. Create your own custom function to analyze newsfeed sitemaps

A small Python package will be required when you are reviewing your news sitemaps repeatedly.

Below you can find four different chains of fast Python functions that use each of the previous functions as callbacks.

To clean up text content items, use the following function.

def text_clean(content):

  lemmetizer = nltk.stem.WordNetLemmatizer()

  stopwords = nltk.corpus.stopwords.words('english')

  content = (unicodedata.normalize('NFKD', content)

    .encode('ascii', 'ignore')

    .decode('utf-8', 'ignore')

    .lower())

  words = re.sub(r'[^ws]', '', content).split()

  return [lemmetizer.lemmatize(word) for word in words if word not in stopwords]

To extract n-grams from the news headlines of a specific news website’s sitemap, use the following function.

def ngram_extractor(dataframe:pd.DataFrame|pd.Series):

     if "news_title" in dataframe.columns:

          return dataframe_ngram_extractor(dataframe,  ngram=3, first=10)

Convert the extracted n-grams to a dataframe using the function below.

def dataframe_ngram_extractor(dataframe:pd.DataFrame|pd.Series, ngram:int, first:int):

     raw_words = text_clean(''.join(str(dataframe['news_title'].tolist())))

     return (pd.Series(ngrams(raw_words, n = ngram)).value_counts())[:first].sort_values(ascending=False).to_frame()

To extract sitemaps for multiple news sites, use the function below.

def ngram_df_constructor(df_1:pd.DataFrame, df_2:pd.DataFrame):

  df_1_bigrams = dataframe_ngram_extractor(df_1, ngram=2, first=500)

  df_1_trigrams = dataframe_ngram_extractor(df_1, ngram=3, first=500)

  df_2_bigrams = dataframe_ngram_extractor(df_2, ngram=2, first=500)

  df_2_trigrams = dataframe_ngram_extractor(df_2, ngram=3, first=500)

  ngrams_df = {

  "df_1_bigrams":df_1_bigrams.index,

  "df_1_trigrams": df_1_trigrams.index,

  "df_2_bigrams":df_2_bigrams.index,

  "df_2_trigrams": df_2_trigrams.index,

  }

  dict_df = (pd.DataFrame({ key:pd.Series(value) for key, value in ngrams_df.items() }).reset_index(drop=True)

  .rename(columns={"df_1_bigrams":adv.url_to_df(df_1["loc"])["netloc"][1].split("www.")[1].split(".")[0] + "_bigrams",

                    "df_1_trigrams":adv.url_to_df(df_1["loc"])["netloc"][1].split("www.")[1].split(".")[0] + "_trigrams",

                    "df_2_bigrams": adv.url_to_df(df_2["loc"])["netloc"][1].split("www.")[1].split(".")[0] + "_bigrams",

                    "df_2_trigrams": adv.url_to_df(df_2["loc"])["netloc"][1].split("www.")[1].split(".")[0] + "_trigrams"}))

  return dict_df

Below, you can see an example use case.

ngram_df_constructor(df_bbc, df_guardian)

Comparison of popular Ngrams to see the focus of news sites.

Only with these four nested custom python functions you can do the following.

It’s easy to visualize these n-grams and count of news sites to check.
You can see what news sites are following on the same topic or on different topics.
You can compare their wording or vocabulary on the same subject.
You can see how many different subtopics from the same topic or entity are processed comparatively.

I didn’t enter numbers for n-gram frequencies.

However, number one is the most popular among that particular news source.

To check the next 500 lines, click here.

9. Extract the most frequently used news keywords from the news sitemap

Speaking of news keywords, surprisingly, they are still very active on Google.

E.g, Microsoft Bing and Google Don’t think “meta keywords” are no longer useful signals, unlike Yandex.

However, news keywords from the news sitemap are still used.

Of all these news sources, only The Guardian used news keywords.

It’s useful to know how they use news keywords to provide relevance.

df_guardian["news_keywords"].str.split().explode().value_counts().to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})

You can see the most used words in The Guardian’s news keywords.

news_keyword_occurence
news,	250
World	142
and	142
Ukraine,	127
UK	116
...	...
Cumberbatch,	1
Dune	1
Saracens	1
Pearson,	1
Thailand	1

1409 rows × 1 column

The visualization is as follows.

(df_guardian["news_keywords"].str.split().explode().value_counts()

.to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})

.head(25).plot.barh(figsize=(10,8),

title="The Guardian Most Used Words in News Keywords", xlabel="News Keywords",

legend=False, ylabel="Count of News Keyword"))

Most Popular Words in News Keywords

The “,” at the end of a news keyword indicates whether it is a separate value or part of another value.
I recommend that you don’t remove “punctuation” or “stop words” from your news keywords to better understand their news keyword usage style.

For different parsing, you can use “,” as delimiter.

df_guardian["news_keywords"].str.split(",").explode().value_counts().to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})

The results differ as follows.

news_keyword_occurence
World news	134
Europe	116
UK news	111
Sport	109
Russia	90
...	...
Women's shoes	1
Men's shoes	1
Body image	1
Kae Tempest	1
Thailand	1

1080 rows × 1 column

Focus on “split(“,”).

(df_guardian["news_keywords"].str.split(",").explode().value_counts()

.to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})

.head(25).plot.barh(figsize=(10,8),

title="The Guardian Most Used Words in News Keywords", xlabel="News Keywords",

legend=False, ylabel="Count of News Keyword"))

You can see the difference in results visualized below.

Most Popular Keywords in News Sitemaps

From “Chelsea” to “Vladamir Putin” to “Ukraine War” and “Roma Abramovich”, most of these phrases coincide with the early days of Russia’s invasion of Ukraine.

Use the code block below to interactively visualize news keywords for two different news website sitemaps.

df_1 = df_guardian["news_keywords"].str.split(",").explode().value_counts().to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})

df_2 = df_nyt["news_keywords"].str.split(",").explode().value_counts().to_frame().rename(columns={"news_keywords":"news_keyword_occurence"})

fig = make_subplots(rows = 1, cols = 2)

fig.add_trace(

     go.Bar(y = df_1["news_keyword_occurence"][:6].index, x = df_1["news_keyword_occurence"], orientation="h", name="The Guardian News Keywords"), row=1, col=2

)

fig.add_trace(

     go.Bar(y = df_2["news_keyword_occurence"][:6].index, x = df_2["news_keyword_occurence"], orientation="h", name="New York Times News Keywords"), row=1, col=1

)

fig.update_layout(height = 800, width = 1200, title_text="Side by Side Popular News Keywords")

fig.show()

fig.write_html("news_keywords.html")

You can see the results below.

To interact with the live graph, click here.

In the next section, you will find two different subplots to compare the n-grams of news sites.

10. Create a subplot for comparing news sources

Use the code block below to place the most popular n-grams from news sources into subplots from news headlines.

import matplotlib.pyplot as plt

import pandas as pd

df1 = ngram_extractor(df_bbc)

df2 = ngram_extractor(df_skynews)

df3 = ngram_extractor(df_dailymail)

df4 = ngram_extractor(df_guardian)

df5 = ngram_extractor(df_nyt)

df6 = ngram_extractor(df_cnn)

nrow=3

ncol=2

df_list = [df1 ,df2, df3, df4, df5, df6] #df6

titles = ["BBC News Trigrams", "Skynews Trigrams", "Dailymail Trigrams", "The Guardian Trigrams", "New York Times Trigrams", "CNN News Ngrams"]

fig, axes = plt.subplots(nrow, ncol, figsize=(25,32))

count=0

i = 0

for r in range(nrow):

    for c in range(ncol):

        (df_list[count].plot.barh(ax = axes[r,c],

        figsize = (40, 28),

        title = titles[i],

        fontsize = 10,

        legend = False,

        xlabel = "Trigrams",

        ylabel = "Count"))        

        count+=1

        i += 1

You can see the results below.

Most Popular Ngrams from News Sources

The example data visualization above is completely static and does not provide any interactivity.

recent, Elias DabasThe creators of Advertools have shared a new script for getting article counts, n-grams and their counts from news sources.

Check here for better, more detailed and interactive data dashboards.

The above example is from Elias Dabbas, who demonstrates how to interactively get the total number of articles, top words and n-grams from a news website.

Final Thoughts on News Sitemap Analysis with Python

This tutorial aims to provide an educational Python coding course to get keywords, n-grams, phrase patterns, languages, and other types of SEO related information from news websites.

News SEO relies heavily on quick response and always-on article creation.

Tracking your competitor’s perspective and method of covering topics can show how quickly your competitors are reacting to search trends.

It would be better to create a Google Trends Dashboard and News Feed Ngram Tracker for comparative and complementary news SEO analysis.

From time to time in this article, I include custom functions or advanced for loops, and sometimes, I keep it simple.

Beginners to advanced Python practitioners can benefit from this to improve their approach to tracking, reporting, and analytics for news SEO and beyond.

More resources:

Featured Image: BestForBest/Shutterstock

Source link

Visualizing Trending Topics Using Python to Analyze News Sitemaps

Housekeeping Considerations Let’s Get Started

What does it take to analyze news content with Python

10 Steps to News Sitemap Analysis Using Python

1. Get news URL from news sitemap

2. Inspect a sample news sitemap using Python

3. Find the most common words from the URL of a news publication

4. Find the most commonly used languages in news publications

5. Review the word frequency of news headlines

6. Visualize the most common words in news headlines

7. Extract the most popular N-Gram from news headlines

8. Create your own custom function to analyze newsfeed sitemaps

9. Extract the most frequently used news keywords from the news sitemap

10. Create a subplot for comparing news sources

Final Thoughts on News Sitemap Analysis with Python

Related articles

Most Popular Baby Names 2024: Top Picks

Most Popular Baby Names 2024: Top Picks

How to Settle a Colic Baby: Proven Tips

What Is Colic in Babies: Key Facts Revealed

The 7 Best Ways to Gain Popularity

LEAVE A REPLY Cancel reply

EDITOR PICKS

How to Build a Personal Brand That Gets You Speaking Gigs

Top 5 Google Business Profile Services for Chiropractors in Sioux Falls

5 Best PR Agencies for Building Investor Credibility

POPULAR POSTS

How Accident Reconstruction Helps Fort Myers Injury Victims

What Is Product Animation and When Does a Pitch Need One?

AJ Mizes: Why Smart People Don’t Get Promoted Faster (And What Actually Works)

ABOUT US

FOLLOW US

Visualizing Trending Topics Using Python to Analyze News Sitemaps

Housekeeping Considerations Let’s Get Started

What does it take to analyze news content with Python

10 Steps to News Sitemap Analysis Using Python

1. Get news URL from news sitemap

2. Inspect a sample news sitemap using Python

3. Find the most common words from the URL of a news publication

4. Find the most commonly used languages ​​in news publications

5. Review the word frequency of news headlines

6. Visualize the most common words in news headlines

7. Extract the most popular N-Gram from news headlines

8. Create your own custom function to analyze newsfeed sitemaps

9. Extract the most frequently used news keywords from the news sitemap

10. Create a subplot for comparing news sources

Final Thoughts on News Sitemap Analysis with Python

Related articles

LEAVE A REPLY Cancel reply

EDITOR PICKS

POPULAR POSTS

ABOUT US

FOLLOW US

4. Find the most commonly used languages in news publications