Gathering & Deduplication: Simplifying News Collection

When creating a system to gather news and remove duplicates, a great starting point is understanding RSS (Really Simple Syndication).

What is RSS?
RSS is an XML-based data format that allows users to quickly and conveniently access the latest content from websites. It’s widely used by blogs, news sites, podcasts, and other platforms to share updates efficiently.

By leveraging RSS feeds, you can automate the process of gathering content, ensuring you stay updated with minimal effort. The next step is implementing deduplication techniques to clean and streamline the data, resulting in more manageable and meaningful reports.

Stay tuned as we explore the methods and tools to make this process seamless!

RSS key features

  • Automatic updates: Automatically notify subscribers when new posts are published.
  • Efficient: You can get all the latest information in one place, rather than having to visit multiple websites.
  • User-friendly: Manage multiple feeds at once through an RSS reader (Feedly, Inoreader, etc.).
  • Ad-free: Only original content is served, keeping your information clean.

Gathering

I started with Crawling, then realized it was RSS, so I used the feedparser library.

Versatility of RSS Parsing
RSS parsing is highly flexible, as the source can be a remote RSS URL, a local file, or even a raw string. This adaptability makes RSS a valuable tool for a variety of use cases.

When collecting data through RSS, you’ll typically extract four key pieces of information:

  1. Title: The headline or name of the content.
  2. Description: A brief summary or snippet of the content.
  3. Link: The URL to access the full content.
  4. Published Date: The timestamp indicating when the content was published.

This structured data format simplifies the gathering process and sets the stage for efficient deduplication and organization of information.

Deduplication

Following the previous on difflib post, I gave it a try. I set the similarity to 0.7 and just remove them from the list if they are similar.

RSS Source

https://rss.nytimes.com/services/xml/rss/nyt/World.xml
https://rss.nytimes.com/services/xml/rss/nyt/US.xml
http://rss.cnn.com/rss/edition.rss
http://rss.cnn.com/rss/edition_world.rss

Code

import os
import feedparser
import difflib

def dup_delete(news_lists):
    new_index = -1
    for news in news_lists:
        new_index += 1
        diff_index = 0
        for diff_news in news_lists:
            diff_index += 1
            if new_index < diff_index:
                # titel similarity
                similarity = difflib.SequenceMatcher(None, news[0], diff_news[0]).ratio()
                if 0.7 < similarity:
                    del news_lists[diff_index - 1]
                    diff_index -= 1 # delete list element index re-arrange
                else:
                    pass
    return news_lists

def get_rss_news(url):
    feed = feedparser.parse(url)
    ret_lists = []
    for item in feed.entries:
        try:
            #print(f'Item title: {item.title}')
            #print(f'Item description: {item.description}')
            #print(f'Item link: {item.link}')
            #print(f'Item published: {item.published}')
            ret_lists.append([item.title, item.description])
        except:
            continue
    return ret_lists
"""
The New York Times : https://rss.nytimes.com/
CNN : http://edition.cnn.com/services/rss/
"""

if '__main__' == __name__:
    news_rss_site = [
        "https://rss.nytimes.com/services/xml/rss/nyt/World.xml",
        "https://rss.nytimes.com/services/xml/rss/nyt/US.xml",
        "http://rss.cnn.com/rss/edition.rss",
        "http://rss.cnn.com/rss/edition_world.rss"
        ]
    news_lists = []
    for url in news_rss_site:
        news = get_rss_news(url)

        for new in news:
            news_lists.append(new)
    print(len(news_lists))
    news_lists = dup_delete(news_lists)
    print(len(news_lists))
    print(news_lists)

Output

64
32
[['Video shows moment of deadl .....]...,[],[]]

While working on Gathering & Deduplication, I’ve managed to reduce the data by about 70%.

I prioritized the remaining data by importance, but the code turned out to be a bit too complex to share at this point. I’ll try to simplify it and share it later using other techniques.

That said, a straightforward approach could be to simply mark how many duplicates exist in the list and then sort the data accordingly. This method can provide clarity without adding unnecessary complexity.

By Mark

-_-

Leave a Reply

Your email address will not be published. Required fields are marked *