BBC web crawling – Python(requests, beautifulsoup)

The main goal of this blog, “I want everyone to code easily,” reflects the desire to create something universally useful and accessible. However, since everyone’s needs and goals differ, finding a common ground can be challenging. To start with something practical, let’s focus on a specific case: working with News data, such as from the BBC.

First, let’s gather the most requested (Python Document) news using a Python script to fetch and organize the data. Once we collect it, we can summarize or process it as needed. Later, we can refine these summaries and put them to use effectively.

First, open the website “https://www.bbc.com”, right-click and select “Inspect”, and you will see the screen like below, and you can check the location of the data.

If you select ‘Copy selector’, it will be saved to the Clipboard (Ctrl+C or Command+C). If you press Ctrl+V or Command+V, you will see something like this.

“main-content > article > section:nth-child(1) > div > div.sc-e70150c3-0.fbvxoY > div.sc-93223220-0.bOZIBp > div:nth-child(3) > div > div > div > div > a > div > div.sc-6781995d-5.dWflPh > div.sc-8ea7699c-1.hxRodh > div > h2”

Here’s the structure of the HTML. If you remove everything except the <main> tag, you can adapt easily to changes in the class attributes. This approach helps focus only on the relevant content within the main section.

“div > div > div > a > div > div > div > div > div > div > h2”

If you write the code as shown below, it will work. However, since the site’s HTML structure can change depending on the audience or updates, you should make your code adaptable to extract and use the necessary content whenever needed.

Code

import requests
from bs4 import BeautifulSoup as HtmlParser

if '__main__' == __name__:
    my_header = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) '
                              'AppleWebKit/537.36 (KHTML, like Gecko) '
                              'Chrome/125.0.0.0 Safari/537.36'}
    response = requests.get('https://www.bbc.com/',headers = my_header)
    soup = HtmlParser(response.text, "html.parser" )
    html = soup.select_one('a > div > div > div > div > h2')  #See image below
    print(html)
    print(html.text)

Running the code this way allows you to extract the desired text, such as titles. Later, you can extend it to extract content using links, but for now, let’s focus on extracting the titles. Keep in mind that titles alone might exaggerate or misrepresent the context, so reviewing the content is essential.

By Mark

-_-

Leave a Reply

Your email address will not be published. Required fields are marked *