This repository has been archived by the owner on Jun 10, 2024. It is now read-only.
Releases: binux/pyspider
Releases · binux/pyspider
v0.3.10
v0.3.9
New features:
- Support for Python 3.6.
- Auto Pause: the project will be paused for
scheduler.PAUSE_TIME
(default: 5min) when lastscheduler.FAIL_PAUSE_NUM
(default: 10) task failed, and dispatchscheduler.UNPAUSE_CHECK_NUM
(default: 3) tasks afterscheduler.PAUSE_TIME
. Project will resume if any one of lastscheduler.UNPAUSE_CHECK_NUM
tasks success. - Each callback now have a default 30s process time limit. (Platform support required) @beader
- New Javascript render engine - Splash support: Enabled by fetch argument
--splash-endpoint=http://splash:8050/execute
- Python3 webdav support.
- Python3
from projects import project
support. - A link to corresponding task is added to webui debug page when debugging a exists task in webui.
- New
user_agent
parameter inself.crawl
, you can set user-agent by headers though.
Fix several bugs:
- New webui dashboard frontend framework - vue.js, improved the performance when having large number of tasks (e.g. http://demo.pyspider.org/)
- Fix crawl_config doesn't work in webui while debugging a script issue.
- Fix CSS Selector Helper doesn't work issue. @ackalker
- Fix
connection_timeout
not working issue. - FIx
need_auth
option not applied on webdav issue. - Fix "fix can't dump counter to file: scheduler.all" error.
- Some other fixes
v0.3.8
New features:
- Now you can use
cancel
to stop an active task of a task withauto_recrawl
enabled. Handler.crawl_config
will now be applied to the task when fetching. (It's applied when the task created before, that means proxy/headers can be changed afterward). See http://docs.pyspider.org/en/latest/apis/self.crawl/#handlercrawl_config
Fix several bugs:
- * Fixed a global config object thread interference issue, which may cause
connect to scheduler rpc error: error(10061, '')
error whenall --run-in=thread
(default in windows platform) - Fix
response.save
lost when fetch failed issue - Fix potential scheduler failure caused by old version of six
- Fix result dump return nothing when using mongodb backend
v0.3.7
- ThreadBaseScheduler added to improve the performance of scheduler
- robots.txt supported!
- elasticsearch database backend supported!
- new script callback
on_finished
, http://docs.pyspider.org/en/latest/About-Projects/#on_finished-callback - you can now set the delay time between retries:
retry_delay is a dict to specify retry intervals. The items in the dict
are {retried: seconds}, and a special key: '' (empty string) is used to
specify the default retry delay if not specified.
- dict parameters in crawl_config, @config will be merged (e.g. headers), thanks to @ihipop
- add parameter
max_redirects
inself.crawl
to control maximum redirect numbers when doing the fetch, thanks to @AtaLuZiK - add parameter
validate_cert
inself.crawl
to ignore the error of server’s certificate. - new property
etree
for Response,etree
is a cached lxml.html.HtmlElement object, thanks to @waveyeung - you can now pass arguments to phantomjs from command line or config file.
- support for pymongo 3.0
- local.projectdb now accept a glob path (e.g. script/*.py) to load multiple projects from local filesystem.
- queue size in the dashboard is not working for osx, thanks to @xyb
- counters in dashboard will shown for stopped projects
- other bug fix
v0.3.6
- NEW: webdav mode, now you can use webdav to mount project folder to your local filesystem and edit scripts with your favority editor! (not support python 3, wsgidav required, which is not contained in setup.py)
- bug fixes for Python 3 compatibility, Postgresql, flask-Login>=0.3.0, typo and more, thanks for the help of @lushl9301 @hitjackma @exoticknight @d0ugal @qiang.luo @twinmegami @jttoday @machinewu @littlezz @yaokaige
- fix Queue.qsize NotImplementedError on Mac OS X, thanks @xyb
v0.3.5
- New parameter: auto_recrawl - auto restart task every
age
. - New parameter: js_viewport_width/js_viewport_height to set viewport size for phantomjs engine.
- New command line option to set different message queue backends with URI scheme.
- New task level storage mechanism:
self.save
- New redis taskdb
- New redis message queue.
- New high level message queue interface kombu.
- Fix bugs related to mongodb (keyword missing if not set).
- Fix phantomjs not work in all mode.
- Fix a potential deadlock in processor send_message.
- Default log level of scheduler is changed to INFO
v0.3.4
Global
- New message queue support: beanstalkd by @tiancheng91
- New global argument:
--logging-config
to specify a customization logging config (to disable werkzeug logs for instance). You can get a sample config from pyspider/logging.conf). - Project
group
info is added to task package now. - Change docker base image to cmfatih/phantomjs, you can use phantomjs with same docker image now.
- Auto restart phantomjs if crash, only enabled in all mode by default.
WebUI
- Show next
exetime
of a task in task page. - Show fetch time and process time in tasks page.
- Show average fetch time and process time in 5min in dashboard page.
- Show message queue status in dashboard page.
limit
andoffset
parameter support in result dump.- Fix frontend bug when crawling pages with dataurl.
Other
- Fix support for phantomjs 2.0.
- Fix scheduler project update inform not work, and use md5sum of script as another signal.
- Scheduler: periodic counter report in log.
- Fetcher: fix for legacy version of pycurl
v0.3.3
API
- self.crawl will raise TypeError when get unexcepted arguments
- self.crawl not accapt cURL command as first argument, see http://docs.pyspider.org/en/latest/apis/self.crawl/#curl-command.
WEBUI
- A new css selector tool bar is added, the pre-generated css selected pattern can be modified and added/copy to script.
Benchmarking
- The database table for bench test will be cleared before and after bench test.
- insert/update/get bench test for database and put/get test for message queue is added.
Other
- The default message queue is switched to ampq.
- docs fix.
v0.3.2
Scheduler
- The size of task queue is more accurate now, you can use it to determine all done status of scheduler.
Fetcher
- Fix tornado loss cookies while doing 30x redirects
- You can use cookies with cookie header at same time now
- Fix proxy not working bug.
- Enable proxy by default.
- Proxy now support username and password authorization. @soloradish
- Etag and Last-Modified header will be disabled while last crawl is failed.
Databases
- MySQL default engine changed to InnoDB @laapsaap
- MySQL, larger result column size, changed to MEDIUMBLOB(up to 16M) @laapsaap
WebUI
- WebUI will use same arguments as the fetcher, fix proxy not word for webui bug.
- Results will be sorted in the order of updatetime.
One Mode
- Script exception logs would be printed to screen
New Command send_message
You can use the command pyspider send_message [project] [message]
to send a message to project via command-line.
Other
- Using localhosted test web pages
- Remove version specify of lxml, you can use apt-get to install any version of lxml
v0.3.1
One Mode
One mode not only means all-in-one, it runs every thing in one process over tornado.ioloop. One mode is designed for debug purpose. You can test scripts written in local files and using --interactive
to choose a task to be tested.
With one
mode you can use pyspider.libs.utils.python_console()
to open an interactive shell in your script context to test your code.
full documentation: http://docs.pyspider.org/en/latest/Command-Line/#one
- bug fix