-
Notifications
You must be signed in to change notification settings - Fork 28
pqai db
This service intends to store the original copies of the documents, which may be retrieved in the search pipeline.
These documents need not just be patents. They may also be research papers, invention disclosures, etc. Storage of images (patent drawings) is also supported.
Most patent retrieval systems need to store document (e.g. patent) data in some form. Depending on the use case, the system may store this data in machine readable form (e.g., JSON or XML) or in human readable form (e.g. PDF files). This service provides a way to store and retrieve arbitrary documents in any form, depending on the use case.
The actual storage medium can take many forms. In the current implementation, 3 types are supported:
- Local storage (on a hard disk)
- Cloud storage (on S3)
- Database (on Mongo DB)
The service can be configured to use one of or combine the above storage mediums in a flexible manner. For instance, it's possible to store bibliographic data of documents in a database, their full text as plain JSON files on local disk, and their PDFs on the cloud.
The users of this service don't see the storage details, although, the latency may differ from one storage medium to another (local databases would be fastest, cloud storage slowest).
root
|-- core
|-- storage.py // defines wrappers on storage types
|-- tests
|-- test_server.py // Tests for the REST API
|-- test_storage.py // Tests for storage module
|-- main.py // Defines the REST API
|
|-- requirements.txt // List of Python dependencies
|
|-- Dockerfile // Docker files
|-- docker-compose.yml
|
|-- env // .env file template
|-- deploy.sh // Script for setting up on local system
The storage module defines wrappers over different types (e.g. local, cloud, or database) of physical storage mediums.
These wrappers inherit their interface from an abstract class named Storage
which provides the following methods:
-
get
: to fetch an item by its identifier -
ls
: to list all items with a given prefix (which can be a path) -
exists
: to check if an item with the given identifier exists in the storage or not -
remove
: to delete an item -
put
: to add a new item
Note that all of the above functionalities are not exposed by the service via its public-facing REST API in its current implementation. Specifically, the provision of adding and removing new items is not exposed.
The actual wrappers are as follows:
LocalStorage
is a wrapper over local file system. It can be initalized with the absolute path to a local directory, which is given by the root
argument. The root
folder can contain further subfolders. Files are identified with their relative path to the root folder.
Typical usage:
import json
from core.storage import LocalStorage
root = "/home/ubuntu/documents"
storage = LocalStorage(root)
contents = storage.get("US7654321B2.json") # bytes
patent_data = json.loads(contents) # dict
This is a wrapper over an AWS S3 bucket and can be initalized with an instance of botoclient
and the name of the S3 bucket where the data is stored.
To use this wrapper, you need to have appropriate environment variables listed in the .env
file. The following three variables are required:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_S3_BUCKET_NAME
Typical usage:
import json
import os
import botocore
from core.storage import S3Bucket
config = botocore.config.Config(
read_timeout=400,
connect_timeout=400,
retries={"max_attempts": 0}
)
credentials = {
"aws_access_key_id": os.environ["AWS_ACCESS_KEY_ID"],
"aws_secret_access_key": os.environ["AWS_SECRET_ACCESS_KEY"]
}
botoclient = boto3.client("s3", **credentials, config=config)
bucket_name = os.environ["AWS_S3_BUCKET_NAME"]
storage = S3Bucket(botoclient, bucket_name)
obj = storage.get("patents/US7654321B2.json") # bytes
patent_data = json.loads(obj) # dict
This is a wrapper on the MongoDB database and can be initalized with an instance of pymongo.MongoClient
, a database name, a collection name, and the name of attribute that stores the identifier field (e.g. publicationNumber
).
To use this wrapper, you need to have appropriate variables listed in the .env
file. The following variables are required:
MONGO_HOST
MONGO_PORT
MONGO_USER
MONGO_PASS
MONGO_DB
MONGO_COLL
If you are using a Mongo installation on your local system, the MONGO_HOST
can be set to localhost
.
Unless you have changed it, MongoDB is accessible on port 27017
which is the default value for MONGO_PORT
.
Unless your MongoDB is protected with a username and password, you can leave the MONGO_USER
and MONGO_PASS
fields blank in the .env
file.
If you are using the database dump supplied by PQAI, then MONGO_DB
should be set to pqai
and MONGO_COLL
should be set to bibliography
.
Typical usage:
import os
import json
from pymongo import MongoClient
from core.storage import MongoDB
mongo_client = MongoClient(MONGO_URI)
db = os.environ["MONGO_DB"]
coll = os.environ["MONGO_COLL"]
field = "publicationNumber"
storage = MongoDB(client, db, coll, field)
obj = storage.get("patents/US7654321B2.json") # bytes
patent_data = json.loads(obj) # dict
The service does not require any assets as such.
For any realistic testing or experimental, however, you would need to connect it to some data source. A good starting point could be the US patent bibliography data which you can download from the PQAI S3 bucket free of cost:
https://s3.amazonaws.com/pqai.s3/public/pqai-mongo-dump.tar.gz
The above data is in the form of a MongoDB database dump, which can be restored on your local system.
Make sure you have at least 30 GB of free space on your system before downloading and restoring this database dump.
It contains bibliographic details (title, abstract, CPC classes, inventor and assignee names, citations, etc.) of about 13 million US patents and published applications. Once restored into the database, individual datapoints can be retrieved as JSON documents.
Prerequisites
The following instructions assume that you are running a Linux distribution and have Git and MongoDB installed on your system.
Setup
The easiest way to get this service up and running on your local system is to follow these steps:
-
Clone the repository
git clone https://github.com/pqaidevteam/pqai-db.git
-
Using the
env
template in the repository, create a.env
file and set the environment variables.cd pqai-db cp env .env nano .env
-
Run
deploy.sh
script.chmod +x deploy.sh bash ./deploy.sh
This will create a docker image and run it as a docker container on the port number you specified in the .env
file.
Alternatively, after following steps (1) and (2) above, you can use the command python main.py
to run the service in a terminal.
This service is not dependent on any other PQAI service for its operation.
The following services depend on this service:
- pqai-gateway