Skip to content
This repository has been archived by the owner on Jul 3, 2024. It is now read-only.
Smruthi Raj Mohan edited this page Apr 10, 2019 · 4 revisions

Short title

Get insights on Raw Data from the Web

Long title

Scrape, Analyze and Visualize insights on Raw Data From the Web using Watson Studio

Author

URLs

Github repo

Other URLs

Summary

The Web is known to have an abundance of information. Nowadays, one of the most trusted sources of information is the Internet. However, it is also known for its raw or unstructured nature. This pattern demonstrates a methodology to extract, give insights and also provide visualisations for any given topic on the Web, taking an example of Analytics on Startups. For example, if we identify a startup that provides ML services in the domain of healthcare, we want to see if they managed to create some noise and appear in a few articles on a popular tech and business portal like Economic Times.

Our application aims to provide a tool that will extract live unstructured data about companies and their impact in the industry with the help of Watson Natural Language Understanding, fed into IBM SPSS Predictive Analytics to get meaningful insights and predictions, finally fed into the Embedded Dashboard which provides insights and visualisation from the provided input.

Technologies

  • Analytics
  • Artificial Intelligence

Description

The Internet is home to tons and tons of web pages all of it containing information mostly in a raw or unstructured manner. Is there a way to ingest such raw data on a given topic and give insights and visualisations for the same?

Our application aims to provide a methodology that will extract this real-time unstructured data taking an example of startups and their impact in the industry, with the help of Watson Natural Language Understanding, fed into IBM SPSS Predictive Analytics to get meaningful insights and predictions, finally fed into the Embedded Dashboard which provides insights and visualisation from the provided input.

As an example, the following views are demonstrated to get insights on the Popularity of Startups-

  • Company's Score based on Relevance: A view showing the most popular companies at a larger size than the smaller ones.
  • Total number of articles appeared in the web of a Company: A view showing the factors affecting the popularity of a startup on the web (amongst News Articles, Tech Blogs, Social Media and so on).
  • News Concept Relevance: Gives a broad overview of main topics of the articles across the companies, by the percentage of its Relevance.
  • News Sentiment Analysis by Company: Gives an overall analysis of the tone in which the article was written, to understand the impact (whether positive or negative or neutral) a given company has in the industry.

Flow

  1. The user creates and runs a Python Notebook on Watson Studio.
  2. The Notebook scrapes the latest news on Startups.
  3. The Scraped Information is sent to Watson Natural Language Understanding to extract Keywords, Entities, Sentiments and its respective confidence scores.
  4. The Results of NLU are compiled into a csv file which is further converted to a table in DB2 Warehouse.
  5. The table created is ingested in SPSS to do some analytics and return a score against each company. The updated table is then saved back to DB2 Warehouse.
  6. Finally, Cognos ingests, the final table generated in DB2 Warehouse giving insightful visualisation.

Instructions

  1. Setup the Notebook on your Watson Studio Project
  2. Setup the SPSS Modeler on your Watson Studio Project
  3. Setup the Embedded Dashboard on your Watson Studio Project

Components and services

  • IBM SPSS Modeler
  • IBM Cognos Dashboard
  • IBM Watson Studio
  • IBM DB2 Warehouse
  • Cloud Object Storage

Related IBM Developer content

Announcement

The World Wide Web or the "Web" is the universe of network-accessible information. All this information present in a raw format on the Web. What if you want a way to ingest raw information on the web for any given topic and provide insights and visualiations for the same. This code pattern does, just that taking an example of performing analytics on Startups.

Being in the age of start-ups. There is a rapid increase in the number of companies providing skilled services. We can scrape information about such companies and evaluate their success stories based on the number of articles or live use cases appeared in news portals.

Suppose, we want to understand the current startups in a particular technology, say Machine Learning, this code pattern will evaluate its impact in the industry, on the basis of-

  • How many times it has appeared on News?
  • Whether it has a Wikipedia page or not?
  • Whether they have Tech blogs or not?
  • Whether they are active on Social Media (Twitter, Medium, etc..)?

This unstructured data once scraped(extract information from the web) is processed through Watson NLU and converted to structured data. This is fed to SPSS, which can be used to understand the data and perform Analytics to determine if all the factors(as mentioned above) appear in a company, thereby computing a popularity score. Once, all the Analytics has performed this Code Pattern also provides a user-friendly and interactive Dashboard visualisation of the data. With this, the pattern provides a methodology that gives insights into the once raw data from the web and complete ease to leverage this data.