Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to automatically obtain labels #327

Closed
hongruilin opened this issue Jul 26, 2024 · 16 comments
Closed

Unable to automatically obtain labels #327

hongruilin opened this issue Jul 26, 2024 · 16 comments
Labels
question Further information is requested

Comments

@hongruilin
Copy link

This is my. env configuration file. Not only can't it retrieve tags, but some English websites also can't retrieve images. Even after changing servers in several countries, it still doesn't work. If you see my post, can you send me the. env content of the demo website
1
2

@djl0
Copy link

djl0 commented Jul 26, 2024

I just installed this last night, and I'm having the same issue. When I add a link directly, it is unable to fetch anything (no metadata other than URL). When I try to add a bookmark via CLI, it says: Error: Failed to add a link bookmark for url "<url>". Reason: fetch failed. Looking at the docker logs (not quite sure what I should be looking for), there are quite a few errors. The workers container shows error that it can't connect to chrome container. Chrome's log is:

[0726/150514.678798:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0726/150514.686022:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0726/150514.686100:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0726/150514.690042:WARNING:dns_config_service_linux.cc(427)] Failed to read DnsConfig.
[0726/150514.841258:INFO:policy_logger.cc(145)] :components/policy/core/common/config_dir_policy_loader.cc(118) Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed
[0726/150514.841296:INFO:policy_logger.cc(145)] :components/policy/core/common/config_dir_policy_loader.cc(118) Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended

DevTools listening on ws://0.0.0.0:9222/devtools/browser/77a167b4-93da-436e-aa5f-3286b81ca42b
[0726/150514.858411:WARNING:bluez_dbus_manager.cc(248)] Floss manager not present, cannot set Floss enable/disable.
[0726/150514.924572:WARNING:sandbox_linux.cc(418)] InitializeSandbox() called with multiple threads in process gpu-process.
[0726/150514.981362:WARNING:dns_config_service_linux.cc(427)] Failed to read DnsConfig.

The web container shows a timeout, but not sure if it's downstream of chrome or worker:

Error: connect ETIMEDOUT
    at Socket.<anonymous> (/app/apps/web/.next/server/chunks/673.js:4662:17325)
    at Object.onceWrapper (node:events:633:28)
    at Socket.emit (node:events:519:28)
    at Socket._onTimeout (node:net:589:8)
    at listOnTimeout (node:internal/timers:573:17)
    at process.processTimers (node:internal/timers:514:7) {
  errorno: 'ETIMEDOUT',
  code: 'ETIMEDOUT',
  syscall: 'connect'
}

Hopefully some of this helps.

@MohamedBassem
Copy link
Collaborator

@hongruilin to be able to help, we'll need to see the logs of your worker container. Also, if you're planning to just use openai, you don't need to set the base url.

@MohamedBassem
Copy link
Collaborator

@djl0 we need the logs from the worker container as well

@MohamedBassem
Copy link
Collaborator

@djl0 if your worker container is not able to talk to the chrome container then this is your problem. The worker container is the one that schedules the crawling requests on the chrome container. If they can't talk, no crawling will happen. Your issue seems to be different than that of @hongruilin, you might want to open a separate issue

@djl0
Copy link

djl0 commented Jul 26, 2024

@MohamedBassem Thanks for taking a look so quickly. Here is a snippet of the worker log. If you confirm that it can't talk to chrome container, I will make a separate issue.

Note that there were many timeout errors like the one at the top of this.

Error: connect ETIMEDOUT
    at Socket.<anonymous> (/app/apps/workers/node_modules/.pnpm/ioredis@5.3.2/node_modules/ioredis/built/Redis.js:170:41)
    at Object.onceWrapper (node:events:633:28)
    at Socket.emit (node:events:519:28)
    at Socket.emit (node:domain:488:12)
    at Socket._onTimeout (node:net:589:8)
    at listOnTimeout (node:internal/timers:573:17)
    at process.processTimers (node:internal/timers:514:7) {
  errorno: 'ETIMEDOUT',
  code: 'ETIMEDOUT',
  syscall: 'connect'
}
2024-07-26T15:17:56.971Z error: [Crawler] Failed to connect to the browser instance, will retry in 5 secs
2024-07-26T15:18:01.971Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2024-07-26T15:18:01.972Z info: [Crawler] Successfully resolved IP address, new address: http://172.18.0.3:9222/
2024-07-26T15:18:22.197Z info: [Crawler][3] Will crawl "https://docs.llamaindex.ai/en/stable/" for link with id "h6f7dp87hh368y64h34dj93e"
2024-07-26T15:18:22.198Z info: [Crawler][3] Attempting to determine the content-type for the url https://docs.llamaindex.ai/en/stable/
2024-07-26T15:18:22.300Z info: [search][6] Attempting to index bookmark with id kv9eg56xd46pvqka6sv1badv ...
2024-07-26T15:18:22.600Z info: [search][6] Completed successfully
2024-07-26T15:18:27.201Z error: [Crawler][3] Failed to determine the content-type for the url https://docs.llamaindex.ai/en/stable/: TimeoutError: The operation was aborted due to timeout
2024-07-26T15:18:37.957Z error: [Crawler][3] Crawling job failed: Error: net::ERR_NAME_NOT_RESOLVED at https://docs.llamaindex.ai/en/stable/
2024-07-26T15:18:38.017Z info: [Crawler][4] Will crawl "https://docs.hoarder.app/command-line/" for link with id "kv9eg56xd46pvqka6sv1badv"
2024-07-26T15:18:38.018Z info: [Crawler][4] Attempting to determine the content-type for the url https://docs.hoarder.app/command-line/
2024-07-26T15:18:43.021Z error: [Crawler][4] Failed to determine the content-type for the url https://docs.hoarder.app/command-line/: TimeoutError: The operation was aborted due to timeout
2024-07-26T15:18:53.255Z error: [Crawler][4] Crawling job failed: Error: net::ERR_NAME_NOT_RESOLVED at https://docs.hoarder.app/command-line/
2024-07-26T15:18:53.337Z info: [Crawler][5] Will crawl "https://docs.hoarder.app/command-line/" for link with id "kv9eg56xd46pvqka6sv1badv"
2024-07-26T15:18:53.338Z info: [Crawler][5] Attempting to determine the content-type for the url https://docs.hoarder.app/command-line/
2024-07-26T15:18:58.340Z error: [Crawler][5] Failed to determine the content-type for the url https://docs.hoarder.app/command-line/: TimeoutError: The operation was aborted due to timeout
2024-07-26T15:19:08.528Z error: [Crawler][5] Crawling job failed: Error: net::ERR_NAME_NOT_RESOLVED at https://docs.hoarder.app/command-line/

@MohamedBassem
Copy link
Collaborator

@djl0 hmmm, no this doesn't seem like a chrome problem. It seems like a dns/connectivity problem. This container doesn't seem to be able to talk to the internet for some reason. It's failing to resolve dns, and sometimes timesout

@djl0
Copy link

djl0 commented Jul 26, 2024

@MohamedBassem I've been playing around with it (eg trying newer chrome image, adding explicit dns server to docker).

Not sure how much of that did anything, but these are my errors currently, and they don't appear to me to be dns related anymore. In fact, the dbus item is something i see in other troubleshooting discussions for other other projects using that image see here. I tried to docker exec -it apt install dbus but apt wasn't available.

workers (many of these errors):

Error: connect ETIMEDOUT
    at Socket.<anonymous> (/app/apps/workers/node_modules/.pnpm/ioredis@5.3.2/node_modules/ioredis/built/Redis.js:170:41)
    at Object.onceWrapper (node:events:633:28)
    at Socket.emit (node:events:519:28)
    at Socket.emit (node:domain:488:12)
    at Socket._onTimeout (node:net:589:8)
    at listOnTimeout (node:internal/timers:573:17)
    at process.processTimers (node:internal/timers:514:7) {
  errorno: 'ETIMEDOUT',
  code: 'ETIMEDOUT',
  syscall: 'connect'
}

chrome (entire contents of log):

[0726/162246.138018:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0726/162246.144835:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0726/162246.144928:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0726/162246.214875:WARNING:sandbox_linux.cc(420)] InitializeSandbox() called with multiple threads in process gpu-process.
[0726/162246.221963:INFO:config_dir_policy_loader.cc(118)] Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed
[0726/162246.222007:INFO:config_dir_policy_loader.cc(118)] Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended
[0726/162246.239523:WARNING:bluez_dbus_manager.cc(248)] Floss manager not present, cannot set Floss enable/disable.

DevTools listening on ws://0.0.0.0:9222/devtools/browser/89386373-466d-473d-b260-280450b06004
[0726/163746.348304:INFO:config_dir_policy_loader.cc(118)] Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed
[0726/163746.350295:INFO:config_dir_policy_loader.cc(118)] Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended

web (many of these errors):

Error: connect ETIMEDOUT
    at Socket.<anonymous> (/app/apps/web/.next/server/chunks/673.js:4662:17325)
    at Object.onceWrapper (node:events:633:28)
    at Socket.emit (node:events:519:28)
    at Socket._onTimeout (node:net:589:8)
    at listOnTimeout (node:internal/timers:573:17)
    at process.processTimers (node:internal/timers:514:7) {
  errorno: 'ETIMEDOUT',
  code: 'ETIMEDOUT',
  syscall: 'connect'
}

I appreciate any insight.

@MohamedBassem
Copy link
Collaborator

the timeouts you're getting are redis timeouts (which is used as the job queue). Is your redis container healthy?

@djl0
Copy link

djl0 commented Jul 26, 2024

this is the redis log (which seemed healthy to me?). And docker ps didn't show any errors (though i wouldn't necessarily expect it to)

$ docker logs hoarder-redis-1
1:C 26 Jul 2024 16:22:45.921 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
1:C 26 Jul 2024 16:22:45.921 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 26 Jul 2024 16:22:45.921 * Redis version=7.2.5, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 26 Jul 2024 16:22:45.921 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
1:M 26 Jul 2024 16:22:45.922 * monotonic clock: POSIX clock_gettime
1:M 26 Jul 2024 16:22:45.926 * Running mode=standalone, port=6379.
1:M 26 Jul 2024 16:22:45.927 * Server initialized
1:M 26 Jul 2024 16:22:45.927 * Loading RDB produced by version 7.2.5
1:M 26 Jul 2024 16:22:45.927 * RDB age 134 seconds
1:M 26 Jul 2024 16:22:45.927 * RDB memory usage when created 0.86 Mb
1:M 26 Jul 2024 16:22:45.929 * Done loading RDB, keys loaded: 26, keys expired: 0.
1:M 26 Jul 2024 16:22:45.929 * DB loaded from disk: 0.002 seconds
1:M 26 Jul 2024 16:22:45.929 * Ready to accept connections tcp

@djl0
Copy link

djl0 commented Jul 26, 2024

@hongruilin sorry to take over your Issue. Curious to know if your logs show something similar to mine.

@hongruilin
Copy link
Author

@hongruilin to be able to help, we'll need to see the logs of your worker container. Also, if you're planning to just use openai, you don't need to set the base url.

Hello, I tried not adding OPENAI_BASE_URL=https://api.openai.com to the .env file, so the tag can be automatically retrieved, and everything works fine. Could you please tell me if setting the variable OPENAI_BASE_URL=https://api.openai.com is correct? If my region requires using Azure OpenAI or another intermediary that is compatible with the OpenAI interface, what should I do? Thank you very much for taking the time to respond to my question.

@MohamedBassem
Copy link
Collaborator

@hongruilin your problem is that open AI's base URL is https://api.openai.com/v1 and not just https://api.openai.com/.

@MohamedBassem
Copy link
Collaborator

@djl0 can you try running the version latest? We're removing the dependency on redis at all, so you can see if it helps. If not, please open a new issue to discuss it there.

@MohamedBassem MohamedBassem added the question Further information is requested label Jul 27, 2024
@hongruilin
Copy link
Author

@hongruilin your problem is that open AI's base URL is and not just .https://api.openai.com/v1``https://api.openai.com/

In the case of using openai, I only need to set OPENAI_API_KEY to get the label. In my guess, if I still set OPENAI_BASE_URL to the official api request connection of openai, it should also automatically get the label. But it failed. I still hope to add the official api url of openai to the OPENAI_BASE_URL variable. So how should the OPENAI_BASE_URL parameter be set?https://api.openai.com/v1 or https://api.openai.com/?

@MohamedBassem
Copy link
Collaborator

I think it should be https://api.openai.com/v1

@hongruilin
Copy link
Author

I think it should be https://api.openai.com/v1

Thank you very much https://api.openai.com/v1 It's correct. You can obtain the label and everything is normal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants