GitHub - hectorleiva/simple-xml-scraper: Simple XML Scrapper to save remote XMLs into local XML files (other formats in progress)

Simple Node XML Scraper

A very simple XML Scraper that will search for all the <loc></loc> tags within an XML index sitemap. One-by-one it will then perform an HTTP GET for each of those links, the response for each link will then be crawled and eventually saved into a separate .csv file.

Set-up

Run npm install to install all the dependancies.

node app.js --sitemap_index_url=http://www.nytimes-se.com/nytse/sitemap.xml

Cron

This node application features a running internal cron job that can be set using a regular cron expression and using the cron_schedule= flag within the CLI command for this job. The following command will scrap on the 30 minute marker the specified sitemap.

node app.js --sitemap_index_url=http://www.nytimes-se.com/nytse/sitemap.xml --cron_schedule="30 * * * *"

Saving

Files by default are saved into an XML format. Plans for CSV formatting will be made available.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
spec		spec
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
app.js		app.js
crawler.js		crawler.js
filesystem.js		filesystem.js
messages.js		messages.js
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Node XML Scraper

Set-up

Cron

Saving

About

Releases

Packages

Languages

hectorleiva/simple-xml-scraper

Folders and files

Latest commit

History

Repository files navigation

Simple Node XML Scraper

Set-up

Cron

Saving

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages