Skip to content

Scraping patent data and analyzing data for microbiome patents

Notifications You must be signed in to change notification settings

mrtoronto/patent_scraper_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

patent_scraper_analysis

In order to gain insight into technological trends, I started investigating patents. It seemed like "smart" people had an understanding of trends in patents being filed but I couldn't figure out how they knew what they knew. To solve this issue, I built a scraper which pulls data from the USPTO and also wrote several notebooks analyzing different aspects of the patents.

No-Code Scraping Workflow

The initial search page can be found here. The program will query the patent search page and pull the number of patents that should be returned for the search query entered into the functions parameters.

The URL of the page returned from a search will look like this:

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=0&f=S&l=50&TERM1=probiotic&FIELD1=&co1=AND&TERM2=&FIELD2=&d=PTXT

Using those numbers, the program will loop through links to individual patents. One of those links looks like this:

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=probiotic&OS=probiotic&RS=probiotic

The url parameter &r=1 is modified to return different patents for the search query in the &s1=probiotic parameter.

Analysis Notebooks

Meta-Data

In this notebook, the pipeline is made to analyze features like inventors, publication dates, primary examiners.

Publication Date Example Plot

Abstracts

In this notebook, I created a pipeline to analyze the content of the abstracts of the patents in the input data.

Example Plot of Common 3-Grams

uBiome Analysis

I specifically dove into uBiome because I knew a bit about their business from before they filed for bankruptcy and was interested to see what their patents looked like.

Inventors

Publication Dates

Claims Topic 3-Gram Cloud

To Do

  • Standardize plots
    • Both high-level style choices (grids, tick marks, font sizes) to axis labels, titles and legends.
  • Do more in-depth text analysis
    • Got into claims and was able to apply N-Gram analysis and LDA Topic Modeling to both claims and abstracts
    • Next step is to find a more in depth method of analyzing text. LDA seems sufficient for unsupervised clustering but there may be more complex methods out there.

Example Input Data

https://drive.google.com/file/d/1FtqAcsA-xKhNqVqFMK0rzjQTmsxWaIz3/view?usp=sharing

About

Scraping patent data and analyzing data for microbiome patents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published