How Scrapy and Selenium are used in Analyzing and Scraping News Articles?

Selenium got its start as a web testing tool. Someone, who has never done web testing previously, will find it entertaining to play with — as you will sit there watching your browser being possessed — no, programmatically commanded — to do all sorts of things while sipping coffee with both hands.

Here is the script to get started:

The web driver must be located on the first level of the project folder, which is the same level as the “scrapy.cfg” file, which must be taken care of.

Without JavaScript, the search word would not even appear on CNN, and we would be presented with a blank page -

This, on the other hand, demonstrates the pleasure (and problems) of JavaScript

So, we’ll need to replicate the process of transferring search requests (simply using the “search?q=” string in the URL would serve, but the following will show a more full method of running Selenium from the home page). After that, we’ll look at pagination -

On a side note, the “Date” button will just allow you to rank by date or relevance, it will not allow you to search for a particular date range. The code for scraping CNN is below, along with an explanation in the comments.

Scraping Fox News would be comparable, just like we’re dealing with the Show More button instead of standard pagination -

Only the significant deviations from the CNN spider are discussed this time.

To execute these spiders, simply type the following into Terminal:

Scrapy does not process data in order, thus the data we collected would be in a bizarre sequence. To expedite the procedure, multiple requests are sent at the same time.

For this section, we’ll need the following packages:

Here is a demo of CNN and the Fox News data:

There seem to be a few typical cleaning procedures to consider, which will ultimately depend on our goals. If we only want to look at the content, for example, we can disregard the chaos in other columns entirely.

1. Discard articles in unusual formats, such as slideshows (which result in NAs).

2. Format Dates

In addition, we also included a new month-year column for future aggregate reports. It also aids in the removal of unneeded items released in July (previously scraped with rough page counts).

We now have 644 CNN stories and 738 Fox News articles after cutting. Both media organizations appear to be increasing the number of immigration-related pieces published each month, with Fox showing a noticeable surge in interest in March.

3. Clean Articles

Because the scraping stage had indiscriminately included all the extraneous stuff, such as ad banners, media sources, and markups like “width” or “video closed,” we could do a far finer job cleaning the body of a post. Some of those, on the other hand, would scarcely compromise our textual analysis.

We could perform a far better job cleaning the content of a post because the scraping stage has randomly included those unnecessary stuff, such as ad banners, media sources, and markups like “width” or “video closed.” But at the other side, several of these would hardly impair our text analysis.

Here we will initiate, with headlines to make sense of the variation between two publications.

Here is the keyword in the headline for CNN every month.

All of the words are capitalized, so “us” means “The US” and “ice” means “ICE” (Immigration and Customs Enforcement), and so on.

Another thing we will look at is Bigrams.

There are a few unusual bigrams among the anticipated popular ones, such as “Biden administration” and “Trump administration.”

With the bigram list, we could conduct a conditional relative frequency search for certain keyword pairings. For instance,

It would be interesting to see how word frequency changed over the course of eight months, and hence created a new dataset with monthly word counts:

Though such a dataset is useful when comparing month to month, this would be more convenient to visualize and show the change in Tableau in a lengthy format — therefore the transformation:

Here’s a link to a tutorial on how to animate the Tableau visualisation.

Beginning in the election month, we observe references of Biden rise quickly, while “Trump” falls off the list totally in March, and attention to migrant children rises with “border.”

Since the election, “Biden” has taken the lead, but the attention didn’t build up until the start of 2021 when “crisis” and “surge” began to dominate the media.

To see which words in the articles might have greater meaning, we have used TF-IDF again, which evaluates both the value of a term in the document (in this example, a specific news story) and its relevance in the whole corpus, with the all-too-common words weighted less. We also threw in some stops to the mix.

There are various ways to achieve this, but in this case, we tried to pool the top ten terms (ordered by their TF-IDF weights) across articles to analyze the differences in each publication’s total vocabulary.

We might notice the similarities in keywords this way:

We may utilize — to see which words were adopted by one but not the other.

As shown by the above analysis, Fox News’ keywords include arrest, caravan, legally questionable, wall, Latino, and various states such as Arizona and Texas, whereas CNN’s keywords included American, Black, China, White, Latino, women, campaign, protest, and worker, which did not appear as made a significant impact for Fox News.

Sentiment analysis, topic detection, or more specific content analysis, such as comparing organizations nouns, modals, quotations, or lexical diversity, could be used as a further step.

For any Queries, contact 3i Data scraping!!

Originally published at https://www.3idatascraping.com.

--

--

3i Data Scraping is an Experienced Web Scraping Service Provider in the USA. We offering a Complete Range of Data Extraction from Websites and Online Outsource.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Data Scraping Services and Data Extraction

3i Data Scraping is an Experienced Web Scraping Service Provider in the USA. We offering a Complete Range of Data Extraction from Websites and Online Outsource.