Skip to content
Mahesh Maan edited this page Oct 11, 2021 · 4 revisions

This page lists some technical challenges specific to AI-based prior-art searching:

Query understanding

  • How to measure the ‘quality’ of a query? How to identify well-formed queries from ill-formed queries?

    Natural language queries formulated for prior-art searching describe technical inventions. Precision in that description is a sought-after quality. In the real world, however, not all users formulate queries well.

    Some users may create queries that are too short to describe the invention. For example, the query “adjustable mechanical keyboard” does not say enough to be very useful for running a prior-art search.

    Other users may create overly verbose queries that focus too much on the non-technical parts of a technical idea as to reduce the signal-to-noise ratio in the query. Think of descriptions with very elaborate background art.

    Other queries may be too broad and open to interpretation. Typically this happens when claim language itself is used as a query. In claims, a pen, for instance, can be described as a “marking device” - which is hardly a good term for a prior-art search system.

    Most of these issues can be resolved by providing the user feedback about whether the query is good or not, and asking the user to re-articulate it if required.

  • How to model and separate the ‘topic’ and ‘details’ in a prior-art search query?

    Prior-art queries typically have a hierarchical structure, where there is a high-level feature further characterised by lower-level features. Take this simple query: “a pen with an LED”. The “pen” is the high-level feature here and “LED” is the low level feature. This distinction is important because unless the high level feature (pen) is not matched in a document, matching a lower level features (LED) makes little sense.

    Such matching requirements can be addressed by conditional matching. The query can be modelled as a graph and matching lower level nodes only contribute to the score only if higher level nodes are matched in the document.

    Processing the query to segregate the features into higher- or lower-levels is another problem. It can either be solved by developing query parsing techniques or by UIs that enable the users to add technical features in a hierarchical manner without adding too much complexity in query building.

  • How to segment a prior-art search query into self-contained feature sub-queries that can be searched independently?

    Most of the new inventions that are filed to the patent offices are not exact replicas of prior inventions. Instead, most of them are “combinations” of prior inventions are rejected by the patent examiners on that ground.

    Taking a simple example: “a pen with an LED and an eraser” may be novel in the strict sense but if there exists pens with LEDs and pens with erasers in the prior-art, such an invention is unlikely to get a patent.

    Current prior-art search technology cannot bring up multiple documents that complement each other’s matching parts. This can be done by parsing queries to create independently searchable sub-queries, such as “a pen with an LED” and “a pen with an eraser”. Mechanical breaking the query (pen | eraser | LED) does not help because the sub-parts in themselves do not make up meaningful queries.

  • How to model the logical structure of a prior-art search query (e.g. technical entities and processes described therein and their relationships)?

    The query “mobile phone sends location information to a cellphone tower” is very much different from “a cellphone tower sends location information to the mobile phone”. The entities and the relationships between them needs to taken into account in an ideal prior-art search technology.

  • How to transform a patent claim into an effective search query?

    When it is known that a query is actually a claim of a published patent, the information contained in the claim can be significantly augmented by additional information available in dependent claims and the patent description. This can be done, for example, by replacing vague terminology with simpler terms (“pen” in place of “marking instrument”).

    This requires parsing the patent specification for locating definitions of query (claim) terms and re-articulating the query behind the scenes.

Search quality and performance

  • How to segment a patent corpus so as to minimise the number of segments needed searching for a typical query?

    A prior-art search query typically relates to a well-defined technology area, which can be considered as the highest-level feature of the query. It could, for example, be drones for one query or cancer-treatment for another query.

    To reduce computation and search time in production, it is generally helpful to segment the entire prior-art corpus into such areas, and look into only some of these for any given query. However, many patents relate to multiple areas. A smart watch which measures the blood glucose level of the wearer, for example, relates to medical domain as well as to wrist watches.

    Associating these areas to various patents and mapping these areas to a particular query are interesting research problems.

  • Which representations for patent documents maximize search accuracy and performance?

    In any production grade search technology, pre-computed representations of documents are used during the search time. Conventional representations used term indexes and newer representation use patent embeddings. A number of other representations are also known. For each of these representations, there also exist various algorithms to arrive at these representations. Which representations and algorithms work the best is still being explored.

  • How can patent metadata (e.g. citations and classifications) be leveraged to improve prior-art search performance?

    The CPC patent classification contains about 250,000 technology classes which have hierarchical relationships with each other. There are hundreds of millions of citation relationships among patents. Each patent also contains technical diagrams.

    All this metadata hasn’t yet been fully leveraged along with the patent’s text to make prior-art searches more effective. Part of the reason is that these datasets are so dissimilar in nature that using them all for a single task is a huge challenge. With the new techniques which can encode any of these datapoints into uniform representations, however, can make it possible.

  • How to gather, represent, and use ontological knowledge in prior-art searching?

    Domain specific ontological knowledge such as ‘is-a’ relationships can make prior-art searching more robust to different articulations of the same concept by different users. However, gathering such information through manual or automatic means, storing and using it in a search pipeline are open problems.

Intuitive and steerable search

  • How to formulate the task of relevance feedback for prior-art searching? What type of datasets can be used for training an ML model for it?

    Prior-art searching is a recursive process. Professional prior-art searchers who do boolean searching refine their queries iteratively as they assess the results (and the noise that comes up) during the search. This refinement is mostly done by adding or removing terms from their queries.

    In natural language searching where paragraph-long queries are involved, it is difficult for users to refine the queries in this manner. However, it may be possible for the search system itself to observe user behaviour and refine the query or its interpretation behind the scene.

    For example, for a query relating to keyboard, if the user is spending more time on results which talk about computer keyboards and ignoring results that talk about mobile phone keyboards, then such information can be used to refine the result set.

    The system can even do “mini-experiments” - where it creates hypotheses based on one likely interpretation of the query and project whether the user will engage with or ignore a result. Depending on the actual behavior of the user, then the system can keep or revise its interpretation.

    Such a feedback loop can be explicit (based on explicit feedback given by the user) or implicit (transparent to the user / based on engagement with the search results).

  • How to impart interpretability to search results (explain/justify why a result came up)?

    One upside of conventional boolean searching is that the users “understand” why any result has come up on the screen (typically it’s because it contains the same keywords that the user typed into the search box). With searching that is based on deep learning, however, the results come up due to a number of “weak signals” adding up together and it is not readily evident to the user why an irrelevant result has come up on the screen.

    This makes it difficult for users to grasp what they need to do next. Should the continue to go further down the list and see more results? Or should they change the query - and if yes, what should they change? This gives a feeling to the user that they are ‘not in control’.

    It may even be possible that users might be willing to compromise on the accuracy of the results for being able to feel in control of the search. That’s because when the users can control the search, they may be able to reach to the result they want within a few iterations.

Ease of assessing results

  • How to extract informative passages from the patent’s specification which are relevant for a prior-art search query?

    A majority of the prior-art searcher’s time goes not in formulating a query but in going through the search results. Patents are typically long documents (9-10 pages of text) and when a long list of patent numbers pop up on the screen, each of which has to be clicked and opened in a new browser tab, most users feel overwhelmed by information overload.

    Even if a patent contains all the relevant information the user needs, finding it within a long specification is a boring and effort intensive task.

    It helps if the search technology provides the best information snippets from within the specification to enable the users to judge whether a result deserves to be read in more detail on not - right from the search screen itself.

  • How to condense (typically large) passages matching a search query?

    The passages matching a particular query are usually multiple and each of them is usually a long string of text (patents contain very lengthy sentences). Dumping a lot of text for each patent on the search screen does not help much. What is required is a way to extract understandable snippets of text together in a sequence that makes an overall sense.

  • How to identify context-independent passages from a patent’s specification?

    The snippets that are shown on a search screen are read by the user outside of any context in which they were originally written. Any snippet of the following form: “In addition to the aforementioned aspects, the base portion 203 and the side wall 204 are joined by …” is not very useful for the users because it refers information (drawings and paragraphs) not shown on the screen.

    By developing classifiers which can rank passages according to the extent to which their meaning is dependent on other content in the document, for example, more useful snippets can be extracted from the patent texts.

Clone this wiki locally