Skip to content

A Node.js Crawler for csdn.com. node 爬虫,csdn 爬虫, 爬取csdn 用户的全部文章。代码仅用于测试和交流学习,请勿用于不良用途。

License

Notifications You must be signed in to change notification settings

zeeklog/csdn-crawler

Repository files navigation

A Nodejs Crawler for crawling user's article from csdn.com.

Only for Node.js Application, not work on browser.

  • Offer options.username will return you the user's article list(default length is 5);
  • Upload the Article's image to your own Qiniu Cloud Server when you offer the config: options.qiniu<object>
  • Offer options.page, options.size can limit the page and size config for api

为什么写这个? / Why would I code this?

  • I want some data to fill my database for big-data's test, but it seems hard to me to write it myself(because I am so lazy).

  • May be so many coder face the same things like me. So, let me make this job become easier.

  • WARN: This repo is only for test and study, do not use this to run Pressure-Test on csdn.com. And CSDN is Sucks!

实现原理 / How to fuck this site

# dependencies
 cheerio
 html-to-md
 pinyin
 request-promise
 
 # 使用request-primose获取目标文档
 # 通过cheerio解析HTML文档,获取文章内容
 # 使用html-to-md 解析HTML内容, 转为md
 # 使用pinyin生成文章alias
 

使用指南 / Usages

1、Fill you own config

// Example:
const options = {
    username: 'weixin_45534242', // target username
    page: 1, // the page index you are crawling
    size: 5, // page size
    link: '', // the user center article list api, you can find it on csdn.com using: F12
    businessType: 'blog', // crawl article type. only support 'blog' now.
    sleepTime: null, // Unit is: ms. sleep time when you crawling the data, it may save your ip from blocking.
    supportImageType: ['jpg', 'png', 'jpeg', 'webp', 'gif', 'mp4', 'bmp', 'svg'], // support uplaod image
    imagePrefixName: 'crawl-', // upload image name prefix
    contentNodeIdentify: '#article_content', // the html id name in article node
    qiniu: {
        zone: '', // Your qiniu cloud zone
        scope: '', // Your qiniu scope name. Storage name.
        useHttpsDomain: true, // like what you see. this is https setting
        useCdnDomain: true, // config your cdn domain, it use on Article List Image
        baseQiNiuCdnApi: '', // you CDN domain name
        remoteFilePath: '/openStatic', // the folder path where you want to save img
        isNeedWaterMark: false, // if `true`, you will need to offer qiniu image style name, write it below:
        imageStyleSplitQuote: '&', // the quote you use in image src link like: https://qiniu.com/asd.png&scale-my-img
        imageStyleName: '', // your qiniu style name
        accessKey: '', // Qiniu cloud accessKey
        secretKey: '', // Qiniu secretKey
        imageBaseAlt: '' // image base alt message prefix
    }
}

2、开始使用csdnCrawler / Fly your code now.

// You can find this code on `./demo.js`
const csdnCrawler = require('./index')
const exampleOptions = {
    username: 'weixin_45534242',
    page: 1,
    size: 5,
    link: '',
    businessType: 'blog',
    sleepTime: null, // Unit is: ms
    supportImageType: ['jpg', 'png', 'jpeg', 'webp', 'gif', 'mp4', 'bmp', 'svg'],
    imagePrefixName: 'crawl-',
    contentNodeIdentify: '#article_content',
    qiniu: {}
}

csdnCrawler(exampleOptions, data => {
    console.log(data)
    console.log(`==============================`)
    console.log(`===  Demo Crawl Succeed !!!===`)
    console.log(`==============================`)
    console.log(`Total Data length : ${data.length}`)
})

再次警告 / FBI WARN AGAIN( to save me from trouble)

  • Don't use this for bad purpose.
  • It may cause something bad result in CN(Maybe break the law...) and will drive you crazy.
  • Plz only use this for testing and study purpose.

About

A Node.js Crawler for csdn.com. node 爬虫,csdn 爬虫, 爬取csdn 用户的全部文章。代码仅用于测试和交流学习,请勿用于不良用途。

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published