Gathering & Deduplication: Simplifying News Collection
When creating a system to gather news and remove duplicates, a great starting point is understanding RSS (Really Simple Syndication).
What is RSS?
RSS is an XML-based data format that allows users to quickly and conveniently access the latest content from websites. It’s widely used by blogs, news sites, podcasts, and other platforms to share updates efficiently.
By leveraging RSS feeds, you can automate the process of gathering content, ensuring you stay updated with minimal effort. The next step is implementing deduplication techniques to clean and streamline the data, resulting in more manageable and meaningful reports.
Stay tuned as we explore the methods and tools to make this process seamless!
RSS key features
- Automatic updates: Automatically notify subscribers when new posts are published.
- Efficient: You can get all the latest information in one place, rather than having to visit multiple websites.
- User-friendly: Manage multiple feeds at once through an RSS reader (Feedly, Inoreader, etc.).
- Ad-free: Only original content is served, keeping your information clean.
Gathering
I started with Crawling, then realized it was RSS, so I used the feedparser library.
Versatility of RSS Parsing
RSS parsing is highly flexible, as the source can be a remote RSS URL, a local file, or even a raw string. This adaptability makes RSS a valuable tool for a variety of use cases.
When collecting data through RSS, you’ll typically extract four key pieces of information:
- Title: The headline or name of the content.
- Description: A brief summary or snippet of the content.
- Link: The URL to access the full content.
- Published Date: The timestamp indicating when the content was published.
This structured data format simplifies the gathering process and sets the stage for efficient deduplication and organization of information.
Deduplication
Following the previous on difflib post, I gave it a try. I set the similarity to 0.7 and just remove them from the list if they are similar.
RSS Source
https://rss.nytimes.com/services/xml/rss/nyt/World.xml
https://rss.nytimes.com/services/xml/rss/nyt/US.xml
http://rss.cnn.com/rss/edition.rss
http://rss.cnn.com/rss/edition_world.rss
Code
import os
import feedparser
import difflib
def dup_delete(news_lists):
new_index = -1
for news in news_lists:
new_index += 1
diff_index = 0
for diff_news in news_lists:
diff_index += 1
if new_index < diff_index:
# titel similarity
similarity = difflib.SequenceMatcher(None, news[0], diff_news[0]).ratio()
if 0.7 < similarity:
del news_lists[diff_index - 1]
diff_index -= 1 # delete list element index re-arrange
else:
pass
return news_lists
def get_rss_news(url):
feed = feedparser.parse(url)
ret_lists = []
for item in feed.entries:
try:
#print(f'Item title: {item.title}')
#print(f'Item description: {item.description}')
#print(f'Item link: {item.link}')
#print(f'Item published: {item.published}')
ret_lists.append([item.title, item.description])
except:
continue
return ret_lists
"""
The New York Times : https://rss.nytimes.com/
CNN : http://edition.cnn.com/services/rss/
"""
if '__main__' == __name__:
news_rss_site = [
"https://rss.nytimes.com/services/xml/rss/nyt/World.xml",
"https://rss.nytimes.com/services/xml/rss/nyt/US.xml",
"http://rss.cnn.com/rss/edition.rss",
"http://rss.cnn.com/rss/edition_world.rss"
]
news_lists = []
for url in news_rss_site:
news = get_rss_news(url)
for new in news:
news_lists.append(new)
print(len(news_lists))
news_lists = dup_delete(news_lists)
print(len(news_lists))
print(news_lists)
Output
64
32
[['Video shows moment of deadl .....]...,[],[]]
While working on Gathering & Deduplication, I’ve managed to reduce the data by about 70%.
I prioritized the remaining data by importance, but the code turned out to be a bit too complex to share at this point. I’ll try to simplify it and share it later using other techniques.
That said, a straightforward approach could be to simply mark how many duplicates exist in the list and then sort the data accordingly. This method can provide clarity without adding unnecessary complexity.