Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update babel monorepo (major) - autoclosed #19

Closed
wants to merge 23 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 23 additions & 10 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,10 +1,23 @@
/.idea/
/eggs/
/build/
/logs/
/project.egg-info/
/dbs/
/autonews/lib/hanlp-1.3.2/data/
/autonews/lib/THUCTC_java_v1/news_model/
/autonews/lib/THUCTC_java_v1/dbs/
/autonews/lib/jdk-6u45-linux-x64.bin
# See https://help.github.com/articles/ignoring-files/ for more about ignoring files.

# dependencies
node_modules
.pnp
.pnp.js

# testing
coverage

# production
build

# misc
.DS_Store
.env.local
.env.development.local
.env.test.local
.env.production.local

npm-debug.log*
yarn-debug.log*
yarn-error.log*
1 change: 0 additions & 1 deletion .python-version

This file was deleted.

31 changes: 0 additions & 31 deletions Dockerfile

This file was deleted.

61 changes: 41 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,46 @@
# Auto News 新闻监控系统爬虫
# Autonews

## 安装
- install pyenv: `brew pyenv`
- install virtualenv:
- install python: `$ pyenv install 3.6.1`
- install requirements: `$ pip install -r requirements.txt`
- create new virtualenv: `$ pyenv virtualenv 3.6.1 playground`
- activate a virtualenv: `$ pyenv activate playground`
新闻源监控(Autonews),是一个实时监控、收录新闻更新的工具,主要功能如下:

## 运行
- 运行调度: `python scrapy_scheduler.py`
- 运行所有爬虫: `python run_all_spiders.py`
- 准实时监控新闻更新并汇总反应到界面,免除人工值守,反复刷新监控
- 汇总分散的新闻,提供一处界面总览当日新闻全局,供网络新闻编辑和新闻关注者查阅、筛选、处理
- 查询往期内容,为新闻专题、汇总专题、旧闻查阅提供数据参考
- (开发中的功能……)

## 常用命令
- 新建爬虫:`scrapy gensipder <new_spider> <url>`
目前监控对象如下:
- 腾讯·大楚网新闻,包括: 要闻/宜昌/襄阳/黄石/十堰/孝感/荆门/荆州/黄冈/恩施/随州/潜江/仙桃
- 三峡晚报
- 楚天都市报
- 湖北日报
- 楚天金报
- 楚天快报
- 楚天时报
- 长江日报
- 武汉晚报
- 武汉晨报
- 人民网-湖北频道
- 黄石日报(待添加)添加中……

## Develop

## 使用scrapyd
- 命令`scrapyd`启动 scrapyd,默认在[localhost:6800](http://localhost:6800/)建立监控界面
## 运行
- run mongoDB server: `mongod --config /usr/local/etc/mongod.conf`
- init db: `node tools/dbInit.js`
- 运行HTTP服务:`node server/index.js`
- 运行爬虫:参照项目 autonews-scrapy
- 编译客户端:`npm run deploy:prod`
- 客户端 dev 环境:运行`npm start`,打开 [localhost:3091/autonews/](http://localhost:3091/autonews/)
- 打开客户端: [localhost:3090/autonews/](http://localhost:3090/autonews/)

## Release Note
请见[About](http://www.berlinchan.com/autonews/about)

## Build docker image
- lib目录下放置[jdk-6u45-linux-x64.bin](http://www.oracle.com/technetwork/java/javase/downloads/java-archive-downloads-javase6-419409.html#jre-6u45-oth-JPR)
- lib目录下放置[hanlp model]()
- lib目录下放置[THUCTC model]()
- docker build -t autonews-scrapy .
## 常用命令
- 备份mongodb:`mongodump -h 127.0.0.1:27017 -d auto-news -o C:\data\backup\`
- 恢复mongodb:`mongorestore -h 127.0.0.1:27017 -d auto-news C:\data\backup\auto-news`
- 构建docker image:`docker build -t autonews-api .`
- 运行 API docker:`docker run -d --name autonews-api-container --restart always -p 3090:3090 autonews-api`
- 运行 scrapy docker(爬虫项目):`docker run -d --name autonews-scarpy-container --restart always --link autonews-api autonews-scrapy`
- stop all Docker container: `docker stop $(docker ps -a -q)`
- remove all Docker container: `docker rm $(docker ps -a -q)`
- 使用 [qydev.com](http://qydev.com) 的内网穿透: `ngrok -config=ngrok.cfg -subdomain autonews 3090`
10 changes: 10 additions & 0 deletions autonews-scrapy/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
/.idea/
/eggs/
/build/
/logs/
/project.egg-info/
/dbs/
/autonews/lib/hanlp-1.3.2/data/
/autonews/lib/THUCTC_java_v1/news_model/
/autonews/lib/THUCTC_java_v1/dbs/
/autonews/lib/jdk-6u45-linux-x64.bin
19 changes: 19 additions & 0 deletions autonews-scrapy/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Use an official Python runtime as a base image
FROM python:3.9.4

# Set the working directory to /app
WORKDIR /autonews

# Copy the current directory contents into the container at /app
ADD ./autonews /autonews/autonews
ADD ./requirements.txt /autonews
ADD ./scrapy.cfg /autonews
ADD ./scrapy_scheduler.py /autonews

RUN pip install -r requirements.txt

# Make port 80 available to the world outside this container
#EXPOSE 80

# Run app.py when the container launches
CMD ["python", "./scrapy_scheduler.py"]
25 changes: 25 additions & 0 deletions autonews-scrapy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Auto News 新闻监控系统爬虫

## 安装
- install pyenv: `brew pyenv`
- install virtualenv:
- install python: `$ pyenv install 3.6.1`
- install requirements: `$ pip install -r requirements.txt`
- create new virtualenv: `$ pyenv virtualenv 3.6.1 playground`
- activate a virtualenv: `$ pyenv activate playground`

## 运行
- 运行调度: `python scrapy_scheduler.py`
- 运行所有爬虫: `python run_all_spiders.py`

## 常用命令
- 新建爬虫:`scrapy gensipder <new_spider> <url>`

## 使用scrapyd
- 命令`scrapyd`启动 scrapyd,默认在[localhost:6800](http://localhost:6800/)建立监控界面

## Build docker image
- lib目录下放置[jdk-6u45-linux-x64.bin](http://www.oracle.com/technetwork/java/javase/downloads/java-archive-downloads-javase6-419409.html#jre-6u45-oth-JPR)
- lib目录下放置[hanlp model]()
- lib目录下放置[THUCTC model]()
- docker build -t autonews-scrapy .
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
8 changes: 8 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"private": true,
"workspaces": [
"packages/*"
],
"scripts": {
}
}
22 changes: 22 additions & 0 deletions packages/autonews-api/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Use an official Python runtime as a base image
FROM node:7.10

# Set the working directory to /app
WORKDIR /autonews

# Copy the current directory contents into the container at /app
ADD ./server /autonews/server
ADD ./utils /autonews/utils
ADD ./package.json /autonews

# Install any needed packages specified in requirements.txt
RUN npm i --registry=https://registry.npm.taobao.org

# Make port 80 available to the world outside this container
EXPOSE 3090

# Define environment variable
ENV NAME autonews-api

# Run app.py when the container launches
CMD ["node", "./server/index.js"]
1 change: 1 addition & 0 deletions packages/autonews-api/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Autonews API
9 changes: 9 additions & 0 deletions packages/autonews-api/config.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
/**
* Created by Berlin Chan on 2017/3/7.
* 全局配置
*/

module.exports = {
HTTP_PORT: 3090,//HTTP server port
DB_SERVER: 'mongodb://localhost:27017/autonews',// docker: Mac<host's ip> Win<vEthernet IPv4>
};
27 changes: 27 additions & 0 deletions packages/autonews-api/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"name": "autonews-api",
"version": "0.3.0",
"description": "",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "",
"license": "MIT",
"dependencies": {
"kcors": "^1.3.2",
"koa": "^2.0.1",
"koa-compress": "^2.0.0",
"koa-conditional-get": "^2.0.0",
"koa-etag": "^3.0.0",
"koa-route": "^3.2.0",
"koa-socket": "^4.4.0",
"koa-static": "^3.0.0",
"moment": "^2.18.1",
"monk": "^4.0.0",
"prop-types": "^15.5.10",
"raw-body": "^2.2.0",
"colors": "^1.1.2",
"koa-convert": "^1.2.0"
}
}
107 changes: 107 additions & 0 deletions packages/autonews-api/src/DAO.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
/**
* Created by Berlin on 2017/3/17.
*/

const config = require('../config');
const monk = require('monk');
const db = monk(config.DB_SERVER);
const moment = require('moment');

//查询来源列表
function getOrigin() {
return db.get('origin').find({});
}

//获取当日 news list
async function getTodayList(origin_key) {
const todayDate = new Date(moment().format('YYYY-MM-DD'));
const tomorrowDate = new Date(moment().add({days: 1}).format('YYYY-MM-DD'));
let origin_key_array = [];
if (origin_key) {
origin_key_array = origin_key.split(',');
} else {
const allOrigin = await db.get('origin').find({}, {fields: 'key'});
allOrigin.forEach(item => origin_key_array.push(item.key));
}

return db.get('detail').find(
{
"date": {
$gte: new Date(Date.parse(todayDate) - 28800000),
$lt: new Date(Date.parse(tomorrowDate) - 28800000)
}, //减去8小时?
"origin_key": {$in: origin_key_array}
},
{sort: {'date': -1}, fields: '_id title subTitle url date nlpSentiment'}
);
}

/*
* 查询往期数据
*
* 参数:
* beginDate: 开始时间
* endDate: 结束时间
* origin_key: 来源key,多个以","分割
* keyword: 标题关键字
* current: 查询页面
* pageSize: 每页数量
*/
async function pastInquiry(origin = '', beginDate, endDate, keyword = '', current = 1, pageSize = 20) {
let origin_key_array = origin.split(',');
let query = {
"date": {
$gte: new Date(Date.parse(beginDate) - 28800000),
$lt: new Date(Date.parse(endDate) - 28800000)
}, //减去8小时?
"origin_key": {$in: origin_key_array},
};
if (keyword) {
query['title'] = eval(`/${keyword}/i`);
}

let detailList = await db.get('detail').find(
query,
{
sort: {'date': -1},
fields: '-content',
limit: parseInt(pageSize),
skip: (parseInt(current) - 1) * parseInt(pageSize),
}
);
let totalList = await db.get('detail').count(query);

return {
list: detailList,
pagination: {current: parseInt(current), pageSize: parseInt(pageSize), total: parseInt(totalList)},
};
}

/*
* 通过 id 查询 news detail
*
* 参数:
* id:对应数据库 detail collection _id field
*/
function getNewsDetailById(id) {
return db.get('detail').findOne({"_id": id});
}

/*
* 根据id查询已筛选的列表
* 参数:
* id:以逗号","分割的 id 字符串
*/
function getFilteredList(id) {
let idList = id.split(',');
return db.get('detail').find({_id: {$in: idList}}, {fields: '-content'});
}


module.exports = {
getOrigin,
getTodayList,
pastInquiry,
getNewsDetailById,
getFilteredList,
};
Loading