Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patent Data #9

Closed
blester125 opened this issue Nov 7, 2023 · 11 comments · Fixed by baberabb/licensed-pile#1 or #71
Closed

Patent Data #9

blester125 opened this issue Nov 7, 2023 · 11 comments · Fixed by baberabb/licensed-pile#1 or #71
Assignees
Labels
ready to go This is ready for someone to start working on.

Comments

@blester125
Copy link
Collaborator

Domain: Patents

Can we use the Google Patents data for this?

It might be possible to use C4/Common Crawl data for this as patents.google.com is one of the most represented domains in c4

@craffel
Copy link
Collaborator

craffel commented Nov 7, 2023

For the record, from https://www.dol.gov/general/aboutdol/copyright

As part of the terms of granting the patent to the inventor, patents are published into the public domain.

@sunnydigital sunnydigital self-assigned this Dec 11, 2023
@chris-ha458
Copy link

Would this be relevant in this context?

https://www.uspto.gov/learning-and-resources/bulk-data-products

@chris-ha458
Copy link

https://bulkdata.uspto.gov/
I'll take a look into some of those sets.

@chris-ha458
Copy link

If this line of inquiry is fruitful, the following might be useful as well.

It ostensibly combines multiple countries datasets and multiple other patent datasets as well.
However, I do not have a proper GCP account (which is necessary for the queries and even the queries cost money) so I'd appreciate input from somebody familiar with GCP / GCP datasets
https://console.cloud.google.com/marketplace/product/google_patents_public_datasets/google-patents-public-data?pli=1&project=api-project-904060009868

@StellaAthena
Copy link
Collaborator

@sunnydigital @chris-ha458 any updates on this?

@sunnydigital
Copy link

@sunnydigital @chris-ha458 any updates on this?

Hi Stella, I'm no longer working on this project. Let me unassign myself.

@StellaAthena StellaAthena added the ready to go This is ready for someone to start working on. label Jan 8, 2024
@baberabb
Copy link
Contributor

baberabb commented Jan 9, 2024

If this line of inquiry is fruitful, the following might be useful as well.

It ostensibly combines multiple countries datasets and multiple other patent datasets as well. However, I do not have a proper GCP account (which is necessary for the queries and even the queries cost money) so I'd appreciate input from somebody familiar with GCP / GCP datasets https://console.cloud.google.com/marketplace/product/google_patents_public_datasets/google-patents-public-data?pli=1&project=api-project-904060009868

Had a look and they only have text available for US publications. Other countries just have (v. short) abstracts, from what I could tell. I can take a look at the sets available from USPO if no one else is working on this.

@chris-ha458
Copy link

@baberabb can you share how you accessed it?
Did it require GCP credits?

@StellaAthena I do think this is a plausible pathway, but I am not able to spearhead it at the moment. I will try to assist any effots though.

@baberabb
Copy link
Contributor

baberabb commented Jan 11, 2024

@baberabb can you share how you accessed it? Did it require GCP credits?

It's available through BigQuery which is Google's SQL-like database system. And Yes! charged me $20 and I just made a few requests. I think if you still have free GCP credits then you can use that.

@baberabb
Copy link
Contributor

Ok got trial access and did some more experimenting and we can just use the Google dataset IMO. They provide full-text for all US patent publications (not applications) and titles/abstracts for all others. All in plain-text as well so will be easy to format. Total 150m rows and seems to have the full US record till Oct 27, 2023.

sample extract here.

@StellaAthena
Copy link
Collaborator

Ok got trial access and did some more experimenting and we can just use the Google dataset IMO. They provide full-text for all US patent publications (not applications) and titles/abstracts for all others. All in plain-text as well so will be easy to format. Total 150m rows and seems to have the full US record till Oct 27, 2023.

sample extract here.

Amazing!

@baberabb baberabb mentioned this issue Mar 1, 2024
3 tasks
This was referenced May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready to go This is ready for someone to start working on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants