Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REJECT_REQUEST_PATTERN写死导致browserless无法加载某些网页 #7240

Open
wxxsfxyzm opened this issue Mar 30, 2025 · 2 comments
Open
Labels
🐛 Bug Something isn't working | 缺陷 unconfirm 未被维护者确认的问题

Comments

@wxxsfxyzm
Copy link

📦 部署环境

Docker

📦 部署模式

服务端模式(lobe-chat-database 镜像)

📌 软件版本

v1.77.3

💻 系统环境

Windows

🌐 浏览器

Edge

🐛 问题描述

使用crawlSinglePage中的browserless方法(自部署)回答以下输入内容

总结这个网页https://bugzilla.proxmox.com/show_bug.cgi?id=3482

报错如图

Image

browserless报错如下

  browserless.io:ChromiumContentPostRoute:info 192.168.48.2 Starting new ChromiumCDP instance +0ms
  browserless.io:ChromiumContentPostRoute:info 192.168.48.2 ChromiumCDP got open port 37507 +1ms
  browserless.io:ChromiumContentPostRoute:info 192.168.48.2 {
  args: [
    '--remote-debugging-port=37507',
    '--no-sandbox',
    '--user-data-dir=/tmp/browserless-data-dirs/browserless-data-dir-d2e52fb2-7a87-4b41-962c-2263065aee51'
  ],
  executablePath: '/usr/local/bin/playwright-browsers/chromium-1161/chrome-linux/chrome'
} Launching ChromiumCDP Handler +0ms
  browserless.io:ChromiumContentPostRoute:info 192.168.48.2 ChromiumCDP is running on ws://127.0.0.1:37507/devtools/browser/2d91d024-5fe3-41d4-b7e9-5011eed46970 +172ms
  browserless.io:router:trace  Running found HTTP handler. +21s
  browserless.io:ChromiumContentPostRoute:info 192.168.48.2 Content API invoked with body: {
  gotoOptions: { waitUntil: 'networkidle2' },
  rejectRequestPattern: [
    '.*\\.(?!(html|css|js|json|xml|webmanifest|txt|md)(\\?|#|$))[\\w-]+(?:[\\?#].*)?$'
  ],
  url: 'https://bugzilla.proxmox.com/show_bug.cgi?id=3482'
} +1ms
  browserless.io:ChromiumContentPostRoute:trace 192.168.48.2 Setting up file:// protocol request rejection +0ms
  browserless.io:ChromiumContentPostRoute:trace 192.168.48.2 GET: https://bugzilla.proxmox.com/show_bug.cgi?id=3482 +26ms
  browserless.io:ChromiumContentPostRoute:debug 192.168.48.2 Aborting request GET: https://bugzilla.proxmox.com/show_bug.cgi?id=3482 +0ms
  browserless.io:ChromiumContentPostRoute:warn 192.168.48.2 "net::ERR_FAILED": https://bugzilla.proxmox.com/show_bug.cgi?id=3482 +0ms
  browserless.io:router:trace  HTTP Request handler has finished. +48ms
  browserless.io:browser-manager:info  0 Client(s) are currently connected, Keep-until: 0, force: false +0ms
  browserless.io:browser-manager:info  Closing browser session +0ms
  browserless.io:browser-manager:info  Deleting "/tmp/browserless-data-dirs/browserless-data-dir-d2e52fb2-7a87-4b41-962c-2263065aee51" user-data-dir and session from memory +0ms
  browserless.io:ChromiumContentPostRoute:info 192.168.48.2 Closing ChromiumCDP process and all listeners +47ms
  browserless.io:server:error  Error handling request: Error: net::ERR_FAILED at https://bugzilla.proxmox.com/show_bug.cgi?id=3482
Error: net::ERR_FAILED at https://bugzilla.proxmox.com/show_bug.cgi?id=3482
    at navigate (file:///usr/src/app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Frame.js:180:27)
    at async Deferred.race (file:///usr/src/app/node_modules/puppeteer-core/lib/esm/puppeteer/util/Deferred.js:33:20)
    at async CdpFrame.goto (file:///usr/src/app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Frame.js:146:25)
    at async CdpPage.goto (file:///usr/src/app/node_modules/puppeteer-core/lib/esm/puppeteer/api/Page.js:570:20)
    at async ChromiumContentPostRoute.handler (file:///usr/src/app/build/shared/content.http.js:67:30)
    at async file:///usr/src/app/build/router.js:60:28
    at async bound (file:///usr/src/app/build/limiter.js:136:36) +0ms
  browserless.io:limiter:error  Error: net::ERR_FAILED at https://bugzilla.proxmox.com/show_bug.cgi?id=3482
    at navigate (file:///usr/src/app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Frame.js:180:27)
    at async Deferred.race (file:///usr/src/app/node_modules/puppeteer-core/lib/esm/puppeteer/util/Deferred.js:33:20)
    at async CdpFrame.goto (file:///usr/src/app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Frame.js:146:25)
    at async CdpPage.goto (file:///usr/src/app/node_modules/puppeteer-core/lib/esm/puppeteer/api/Page.js:570:20)
    at async ChromiumContentPostRoute.handler (file:///usr/src/app/build/shared/content.http.js:67:30)
    at async file:///usr/src/app/build/router.js:60:28
    at async bound (file:///usr/src/app/build/limiter.js:136:36) +0ms
  browserless.io:limiter:info  Recording failed stat, cleaning up: "Error: net::ERR_FAILED at https://bugzilla.proxmox.com/show_bug.cgi?id=3482" +227ms
Issue calling error hook: "TypeError: Invalid URL". Did you set a working ERROR_ALERT_URL env variable?
  browserless.io:limiter:info  (Running: 0, Pending: 0) All jobs complete.  +0ms
  browserless.io:router:trace  Response has been written, resolving +5ms
  browserless.io:browser-manager:info  Deleting data directory "/tmp/browserless-data-dirs/browserless-data-dir-d2e52fb2-7a87-4b41-962c-2263065aee51" +5ms

经分析可能是 rejectRequestPattern 拦截了必要的 .cgi 页面请求导致,看了下源码这个参数是写死的,可否改成可以从环境变量传入的方式?

📷 复现步骤

模型选择deepseek-r1,辅助搜索模型deepseek-v3,询问

总结这个网页https://bugzilla.proxmox.com/show_bug.cgi?id=3482

🚦 期望结果

正确获取网页内容

📝 补充信息

No response

@wxxsfxyzm wxxsfxyzm added the unconfirm 未被维护者确认的问题 label Mar 30, 2025
@lobehubbot
Copy link
Member

👀 @wxxsfxyzm

Thank you for raising an issue. We will investigate into the matter and get back to you as soon as possible.
Please make sure you have given us as much context as possible.
非常感谢您提交 issue。我们会尽快调查此事,并尽快回复您。 请确保您已经提供了尽可能多的背景信息。

Copy link

dosubot bot commented Mar 30, 2025

The issue you're encountering is due to the rejectRequestPattern parameter, which is currently hardcoded to reject certain file types, including .cgi pages. This is causing the browserless service to abort requests to URLs like the one you're trying to access.

To resolve this, you can modify the code to allow the rejectRequestPattern to be set via an environment variable. This would provide flexibility to adjust the pattern as needed for different use cases.

Here's a general approach to implement this change:

  1. Modify the Code: Update the part of the code where rejectRequestPattern is defined to check for an environment variable. If the environment variable is set, use its value; otherwise, fall back to the default pattern.

  2. Set Environment Variable: In your deployment environment (e.g., Docker), set the environment variable with the desired pattern that includes .cgi files.

This change would involve editing the code where rejectRequestPattern is used, such as in the browserless function or related configuration files. Unfortunately, I don't have the exact location of the crawlSinglePage function in the codebase, but you can look for where rejectRequestPattern is defined and used, such as in the test file packages/web-crawler/src/crawImpl/__tests__/browserless.test.ts [1].

By making this change, you should be able to customize the request pattern to allow .cgi pages and other necessary file types.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@dosubot dosubot bot added the 🐛 Bug Something isn't working | 缺陷 label Mar 30, 2025
@github-project-automation github-project-automation bot moved this to Roadmap - Chat 1.x in Lobe Chat Routine Mar 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 Bug Something isn't working | 缺陷 unconfirm 未被维护者确认的问题
Projects
Status: Roadmap - Chat 1.x
Development

No branches or pull requests

2 participants