Crawler

爬虫集

互联网招聘网址爬虫如下：

互联网知名公司招聘信息爬虫如下：

内容服务商爬虫:

知乎

爬虫脚手架

pipeline

目前只有两个 pipeline , 一个使用mongo做数据存储，一个使用set做数据的判重, 点击查看源码。

middleware

目前只有两个 middleware ，一个使用 fake_useragent 来生成随机UA，一个用于使用http代理列表, 点击查看源码。

工具集

抓取免费代理

抓取代理网站中给出的免费代理, 并初步校验,点击查看源码！目前抓取的代理网站如下：

代理验证

使用 httpbin 来测验代理的时效性和种类。

IP信息获取

使用 geoiplookup 用于查询IP信息。

示例如下:

from utils.ip_info import get_ip_info

print(get_ip_info('8.8.8.8'))

{u'countrycode': u'US', u'ip': u'8.8.8.8', u'isp': u'Google', u'longitude': u'-97.822', u'countryname': u'United States', u'host': u'8.8.8.8', u'latitude': u'37.751'}

翻译函数

目前只做了简单封装，支持如下：

有道词典

from utils.translate import translate
import json

print(translate(u'努力工作', dict_name='youdao')['translateResult'][0][0]['tgt'])
print(translate(u'hard work', dict_name='youdao', lfrom='en', lto='zh-CHS')['translateResult'][0][0]['tgt'])

To work hard
努力工作

百度翻译

from utils.translate import translate

print(translate(u'努力工作', dict_name='baidu')[0]['dst'])
print(translate(u'hard work', dict_name='baidu', lfrom='en', lto='zh-CHS')[0]['dst'])

Work hard
艰苦的工作

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
jobs		jobs
resource		resource
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.org		README.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawler

爬虫集

爬虫脚手架

pipeline

middleware

工具集

抓取免费代理

代理验证

IP信息获取

翻译函数

About

Releases

Packages

Languages

License

NB-STAR/crawler

Folders and files

Latest commit

History

Repository files navigation

Crawler

爬虫集

爬虫脚手架

pipeline

middleware

工具集

抓取免费代理

代理验证

IP信息获取

翻译函数

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages