A Nodejs Crawler for crawling user's article from csdn.com.

Only for Node.js Application, not work on browser.

Offer options.username will return you the user's article list(default length is 5);
Upload the Article's image to your own Qiniu Cloud Server when you offer the config: options.qiniu<object>
Offer options.page, options.size can limit the page and size config for api

为什么写这个？ / Why would I code this?

I want some data to fill my database for big-data's test, but it seems hard to me to write it myself(because I am so lazy).
May be so many coder face the same things like me. So, let me make this job become easier.
WARN: This repo is only for test and study, do not use this to run Pressure-Test on csdn.com. And CSDN is Sucks!

实现原理 / How to fuck this site

# dependencies
 cheerio
 html-to-md
 pinyin
 request-promise
 
 # 使用request-primose获取目标文档
 # 通过cheerio解析HTML文档，获取文章内容
 # 使用html-to-md 解析HTML内容， 转为md
 # 使用pinyin生成文章alias

使用指南 / Usages

1、Fill you own config

// Example:
const options = {
    username: 'weixin_45534242', // target username
    page: 1, // the page index you are crawling
    size: 5, // page size
    link: '', // the user center article list api, you can find it on csdn.com using: F12
    businessType: 'blog', // crawl article type. only support 'blog' now.
    sleepTime: null, // Unit is: ms. sleep time when you crawling the data, it may save your ip from blocking.
    supportImageType: ['jpg', 'png', 'jpeg', 'webp', 'gif', 'mp4', 'bmp', 'svg'], // support uplaod image
    imagePrefixName: 'crawl-', // upload image name prefix
    contentNodeIdentify: '#article_content', // the html id name in article node
    qiniu: {
        zone: '', // Your qiniu cloud zone
        scope: '', // Your qiniu scope name. Storage name.
        useHttpsDomain: true, // like what you see. this is https setting
        useCdnDomain: true, // config your cdn domain, it use on Article List Image
        baseQiNiuCdnApi: '', // you CDN domain name
        remoteFilePath: '/openStatic', // the folder path where you want to save img
        isNeedWaterMark: false, // if `true`, you will need to offer qiniu image style name, write it below:
        imageStyleSplitQuote: '&', // the quote you use in image src link like: https://qiniu.com/asd.png&scale-my-img
        imageStyleName: '', // your qiniu style name
        accessKey: '', // Qiniu cloud accessKey
        secretKey: '', // Qiniu secretKey
        imageBaseAlt: '' // image base alt message prefix
    }
}

2、开始使用csdnCrawler / Fly your code now.

// You can find this code on `./demo.js`
const csdnCrawler = require('./index')
const exampleOptions = {
    username: 'weixin_45534242',
    page: 1,
    size: 5,
    link: '',
    businessType: 'blog',
    sleepTime: null, // Unit is: ms
    supportImageType: ['jpg', 'png', 'jpeg', 'webp', 'gif', 'mp4', 'bmp', 'svg'],
    imagePrefixName: 'crawl-',
    contentNodeIdentify: '#article_content',
    qiniu: {}
}

csdnCrawler(exampleOptions, data => {
    console.log(data)
    console.log(`==============================`)
    console.log(`===  Demo Crawl Succeed !!!===`)
    console.log(`==============================`)
    console.log(`Total Data length : ${data.length}`)
})

再次警告 / FBI WARN AGAIN( to save me from trouble)

Don't use this for bad purpose.
It may cause something bad result in CN(Maybe break the law...) and will drive you crazy.
Plz only use this for testing and study purpose.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
lib		lib
.babelrc		.babelrc
.eslintignore		.eslintignore
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.npmignore		.npmignore
LICENSE		LICENSE
README.md		README.md
demo.js		demo.js
index.d.ts		index.d.ts
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json
webpack.config.js		webpack.config.js
webpack.demo.config.js		webpack.demo.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Nodejs Crawler for crawling user's article from csdn.com.

为什么写这个？ / Why would I code this?

实现原理 / How to fuck this site

使用指南 / Usages

1、Fill you own config

2、开始使用csdnCrawler / Fly your code now.

再次警告 / FBI WARN AGAIN( to save me from trouble)

About

Releases

Packages

Languages

License

zeeklog/csdn-crawler

Folders and files

Latest commit

History

Repository files navigation

A Nodejs Crawler for crawling user's article from csdn.com.

为什么写这个？ / Why would I code this?

实现原理 / How to fuck this site

使用指南 / Usages

1、Fill you own config

2、开始使用csdnCrawler / Fly your code now.

再次警告 / FBI WARN AGAIN( to save me from trouble)

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages