Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full text crawlers #101

Merged
merged 12 commits into from
Oct 9, 2015
Merged

Full text crawlers #101

merged 12 commits into from
Oct 9, 2015

Conversation

stefan-kolb
Copy link
Member

This PR enables automatic PDF fulltext downloads.

Current catalogs:

  • ACS
  • arXiv
  • Springer
  • Sciencedirect (Elsevier)
  • Google Scholar

Questions:

  • Do we want to include more Crawlers? If yes, which?
  • Action was in the Tools menu. Imho it should be either auto downloaded or included with the downloador auto button in the detailed entry view.

TODO:

  • Make it nonblocking and keep the progress bar!
  • Add an option to delete the local file if a wrong file was downloaded
  • Tests on Travis & CircleCI are blocked by Google by 403 Forbidden (Bot detection)

@stefan-kolb stefan-kolb changed the title Science catalog full text crawlers [WIP] Science catalog full text crawlers Aug 14, 2015
@stefan-kolb stefan-kolb changed the title [WIP] Science catalog full text crawlers [WIP] Full text crawlers Aug 14, 2015
@mlep
Copy link
Contributor

mlep commented Aug 17, 2015

About "Do we want to include more Crawlers? If yes, which?":
You have a gold mine in the section "Search (and import) from my specific database, please." in the sorted list of feature requests (this morning email).

@stefan-kolb
Copy link
Member Author

@mlep Thanks 👍

@simonharrer
Copy link
Contributor

Your feature works pretty well. But there are cases in which a file can be downloaded but it is not the file that is intended. Thus, there must be a way to preview the pdf before adding it.

@stefan-kolb
Copy link
Member Author

We need to evaluate how often this will happen. Can you give me an example?

I think with a reasonable amount of crawlers this will work with a very high accuracy.
I would rather have the user open and check the file and then delete it and manually download it if it is not the correct file, as this will happen very seldom. If we use your approach we need to preview every PDF that will be anyway in > 90% of the times the correct one.

@simonharrer
Copy link
Contributor

Ok, deletion after download is OK when it is done easily. At the moment, I cannot easily delete the file entry from the bibtex entry AND also delete the file from the hard drive. When we fix this, we do not need the check feature in the first place. Good idea!

My example was: I wanted to download the conference article Barros, Service Interaction Patterns, but got the techrep instead. Another issue was that I got a PDF of slides in some language I do not understand, probably czech when attempting to download a very old version of the BPMN standard.

@stefan-kolb stefan-kolb force-pushed the catalog-crawler branch 2 times, most recently from 43d0c7d to 3384c4c Compare August 18, 2015 12:49
@koppor
Copy link
Member

koppor commented Aug 18, 2015

Regarding, "clean all used fields from Latex stuff, e.g. {~ etc.", this should be one of the "Edit -> Cleanup entries" functionality, isn't it? If not, this functionality should be added there, too.

Regarding "check for duplicates", JabRef offers "Search -> find duplicates". This functionality should be reused here.

@koppor
Copy link
Member

koppor commented Aug 18, 2015

A similar approach might have been taken by Christoph Lehner: https://sourceforge.net/p/jabref/discussion/318824/thread/6e5fea64/ Is it possible to synchronize somehow?
I think, this makes https://github.com/wbrenna/LocalCopy obsolete, doesn't it?

@stefan-kolb
Copy link
Member Author

The approach of C.Lehner is the base repository https://github.com/lehner/LocalCopy.
This PR is a native replacement for this and all forks.
Also, we don't support plugins anymore.

@lenhard
Copy link
Member

lenhard commented Oct 9, 2015

What is missing to get this PR functional and integrated into master?

@lenhard lenhard assigned lenhard and simonharrer and unassigned lenhard Oct 9, 2015
@simonharrer
Copy link
Contributor

This is just too messy to understand right now. Fixing this would require fixing the separation of GUI Event Thread with other Code parts - a major effort.

The issue is how to implement swing actions that require multiple user interactions during their task taking place. Normally, this would require a nesting of swing worker classes, one for each step and starting the next step within the EDT update method.

What is more, sometimes SwingWorker, and sometimes the spin Framework is used. All of this makes this issue even more complicated.

Someone else may take a look if they see this issue more clearly than me.

@simonharrer simonharrer removed their assignment Oct 9, 2015
@lenhard
Copy link
Member

lenhard commented Oct 9, 2015

So you are saying that you will not complete the PR? I cannot see anybody else who will. There is no point of having this hanging around in limbo until it deprecates. So we can close this PR without merging and close related open issues as won't fix.

@simonharrer: Please confirm!

@simonharrer
Copy link
Contributor

Merge as good enough.

@stefan-kolb stefan-kolb changed the title [WIP] Full text crawlers Full text crawlers Oct 9, 2015
stefan-kolb added a commit that referenced this pull request Oct 9, 2015
@stefan-kolb stefan-kolb merged commit 18a6a0e into master Oct 9, 2015
@stefan-kolb stefan-kolb deleted the catalog-crawler branch October 9, 2015 13:34
@koppor koppor mentioned this pull request Oct 10, 2015
@lecc0r
Copy link

lecc0r commented Dec 1, 2015

Hi, I've just tested the newly implemented "Full text article download" function, however for several articles it downloaded a version from ResearchGate but I want to have the one from Elsevier. Dependent who you are working for, you have special access privileges on certain publisher sites and it would be great to specify in the preferences which crawler the download function should prefer. Perhaps even define a priority list.

@stefan-kolb
Copy link
Member Author

There is a hard-coded priority list right know which prefers the official publishers over google scholar for example. If you want to file a new feature request or enhancement please create a separate issue.

@koppor
Copy link
Member

koppor commented Dec 1, 2015

Done at #435.

InAnYan added a commit that referenced this pull request Aug 3, 2024
github-merge-queue bot pushed a commit that referenced this pull request Aug 14, 2024
* Fix the code from code review

* Fix from code review and create new AiChatTabWorking

* Improve chat history storage code

* More fix from code review

* Remove obsolete parameter

* Add JavaDoc comment

* Fix checkstyle

* Fix JavaDoc

* Fix more checkstyle

* More checkstyle fixes

* Fix code changes

* Improve the PR

* Rework ADR-0031 to enable to use another option

* Add many LOGGEr.trace statements

* Change "message window" to "context window"

* Fix compiler errors

* Fix issue list index issue of langchain4j

* Fix lint issue

* Update 0031-store-chats-alongside-database.md

* More tracing

* Refine logging

* Remove closing of AiChatLanguageModel (because it's not closable)

* Use external package for OpenAI API connection

* Provide a custom executor for RetrievalAugmentor

* Fix shutdown issue (I hope)

* Refactor classes

* Change BibDatabaseChatHistoryFile

* Revert BibDatabaseChatHistoryFile to old version because of langchain4j

* Make round corners for chat messages

* Refactor embeddings generation

* Refactor embeddings generation

* Refactor embeddings generation

* Fix CHANGELOG.md

* Remove jpro-mdfx

* Add comment

* Fix localizations

* Fix checkstyle and remove OpenAI from PRIVACY.md

* Remove unnecessary comments

* Fix privacy notice UI

* Introduce new ApiKeyMissingComponent

* Thanks Tobiaz Diez for writing such a good EntryEditorTab class

* Fix InAnYan/jabref issues

* Merge `build.gradle` and `settings.gradle` from main branch

* Update ADRs

* Implement rethought ADR for chat history

* Use OpenAI embedding model

* Use Deep Java embedding model

* Remove old langchain4j embedding models

* Fix checkstyle errors

* Fix checkstyle and remove old dependencies

* Fixes from code review

* Restructure

* Fix checkstyle errors

* Add API base URL parameter

* Fix localization

* Fix from code review + ADR

* Something broken

* Now MistralAI and Hugging Face work

* Fix base URL for other LLM providers

* Fix base URL for other LLM providers

* Refactor MVStore usage

* Load embedding model in background

* Bump langchain4j version

* Fix bug

* Fix checkstyle and localization

* Implement summarization

* Fix checkstyle and localization

* Improve PrivacyNoticeComponent

* Fix from code review

* Update localization

* Wrap text

* Add padding

* Fix markdown

* Use stuff algorithm

* Add GPT-4o-mini

* Make chat model editable

* Update context window size and summarization

* Fix checkstyle

* Update PrivacyNoticeComponent.fxml

* Update AI summary tab

* Fix localization

* Change order so that there is no diff

* Reorrder dependencies

* Add missing CHANGELOG.md entry

* Refine ADR-0033

* Refine ADR0034

* Fix typos

* Refine ADR-0036

* Fix ADR-0037

* Fix title case

* Fix changes in module-info.java

* Readd removed requires org.apache.httpcomponents.core5.httpcore5

* Revert change in JabRefGUI to avoid conflicts

* Remove empty lines

* Reorder entries in JabRef_en.properties

* Simplify SummariesStorage (and add test)

* Use region/endregion

* Fix position of comment

* Add comment why the event bus is needed

* Do not show exception to the user - just that an error is occurred (saves %0 in localization)

* Use "URL %0" without colon (consistency)

* Fix typos

* History has to be kept

* Remove empty lines

* Fix language (hopefully)

* Compilefix

* Simplify BibDatabaseChatHistoryManager

* Fix from code review

* Fix issue #103

* Rework embeddings cache clearing

* Fix #99 and partially #101

* Partially fixing shutdown issues and UI progress monitor issue

* Add "requires scala.library" and add "region:" / "endregion"

* More grouping (move de.saxsys.mvvmfx.validation up)

* Add alphabetical hint

* Fix InAnYan#101 and InAnYan#106

* Discard changes to settings.gradle

* Fix InAnYan#105

* Follow-up fix for InAnYan#103

* Follow-up fix for InAnYan#103

* Remove obsolete class

* Partially fix InAnYan#98

* We do need dependencies to the AI providers, don't we?

* Fix InAnYan#93

* Simplify code

* Partially fix InAnYan#92

* Fix checkstyle and localization

* Fix hyperlinks and text in ApiKeyMissingComponent

* Fixes from code review

* Fix InAnYan#120

* Remove "X% work done" messages

* Fix InAnYan#114

* Partially fix InAnYan#113

* Partially fix InAnYan#110

* Fix InAnYan#110

* Fix InAnYan#111

* Improve embedding model downloading notifications

* Fix InAnYan#124

* Fix InAnYan#122

* Fix wrong context window size when expert settings customization is turned off

* Attempt to fix InAnYan#95

* Finally fix InAnYan#105

* Fix InAnYan#108

* Attempt to fix InAnYan#98

* Fix for InAnYan#104

* Fix for InAnYan#98

* Fix for InAnYan#95 (comment)

* Fix for InAnYan#98 (comment)

* Fix for InAnYan#126

* Fix for InAnYan#115

* Fix for InAnYan#113

* Fix for InAnYan#91

* Fix for InAnYan#121

* Fix for InAnYan#112 and InAnYan#116

* Fix for InAnYan#125

* Fixes from commit comments

* Fix for InAnYan#115

* Fix for InAnYan#120

* Fix for InAnYan#132

* Fix for InAnYan#132

* Fix for InAnYan#104

* Fix for InAnYan#118

* Fix for InAnYan#114

* Fix for InAnYan#104

* Store error messages in chat history

* Make error be a ChatMessageComponent

* Implement delete messages InAnYan#136

* Fix for InAnYan#118

* Fix for InAnYan#92

* Fix checkstyle and localization. And refactoring

* Fix for InAnYan#92

* Fix for InAnYan#139

* Show "Delete message" button only when necessary

* Fix for InAnYan#83

* Update src/main/java/org/jabref/logic/ai/AiService.java

Co-authored-by: Oliver Kopp <kopp.dev@gmail.com>

* Update src/main/java/org/jabref/logic/ai/chathistory/BibDatabaseChatHistoryManager.java

Co-authored-by: Oliver Kopp <kopp.dev@gmail.com>

* Update src/main/java/org/jabref/logic/ai/AiService.java

Co-authored-by: Oliver Kopp <kopp.dev@gmail.com>

* Update src/main/java/org/jabref/gui/Base.css

Co-authored-by: Oliver Kopp <kopp.dev@gmail.com>

* Update src/main/java/org/jabref/gui/Base.css

Co-authored-by: Oliver Kopp <kopp.dev@gmail.com>

* Fix from code review

* Partial fix for InAnYan#125

* Update colors for error message

* Fix for InAnYan#145 and InAnYan#142

* Make progress for embedding model download

* Fix checkstyle and localization

* Add workaround to get FileHistoryMenuTest running again

* Small fixes

* Revert "Small fixes"

This reverts commit 85382a1.

* Introduce AiApiKeyProvider

* Fix IDE setup instructions

* Do not load API keys on startup

* Rely on keystore encryption

* Prevent mulitple rebuilds when muliple preferences are updated

* Fix localization to be more provider independent

* Fix method names

* Add poor man's solution to notify of API key changes

* Reduce calls to key store (and fix key saving)

* Fix for InAnYan#148 and partially InAnYan#146

* Revert "Fix for InAnYan#148 and partially InAnYan#146"

This reverts commit 5fa3bb5.

* Fix for scrolling down when deleting a message

* Sort EmbeddingModel enum variants

* Fix GenerateSummaryTask progress indication

* Fix dark mode

* Add notice for embedding models size

---------

Co-authored-by: Oliver Kopp <kopp.dev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants