DacqPipe

DacqPipe (Data Acqisition Pipeline) is a massive RSS data acquisition tool. It consists of a series of components that interoperate to acquire and prepare Web documents for further analysis.

Clone and Build

Clone DacqPipe from the GIT repository into, for example, C:\Work\DacqPipe:

git clone https://github.com/SowaLabs/DacqPipe.git C:\Work\DacqPipe

Clone the dependencies:

Clone LATINO into C:\Work\LATINO (see the LATINO readme file for more details):

git clone https://github.com/LatinoLib/LATINO.git C:\Work\LATINO

Clone LATINO Workflows into C:\Work\LatinoWorkflows:

git clone https://github.com/SowaLabs/LATINO-Workflows.git C:\Work\LatinoWorkflows

Clone SemWeb into C:\Work\SemWeb:

git clone https://github.com/SowaLabs/SemWeb.git C:\Work\SemWeb

Clone SharpNLP into C:\Work\SharpNLP:

git clone https://github.com/SowaLabs/SharpNLP.git C:\Work\SharpNLP

Open the solution file (C:\Work\DacqPipe\DacqPipe.sln) in Visual Studio.
Build the solution.

Configure and Run

Copy the contents of C:\Work\DacqPipe\DacqPipe\bin\Release to a deployment folder (for example, C:\DacqPipe).
To configure DacqPipe, edit the file DacqPipe.exe.config (located in the deployment folder) with a text editor. The configuration file contains a set of key-value pairs in the form <add key="..." value="..."/>. The following table lists and explains the supported configuration keys:

Key	Description	Default value
logFileName	The name of the log file to which DacqPipe writes events (important mainly for debugging).	Not set
xmlDataRoot	The location to which the acquired (accepted) documents are stored in the XML format.	Data
xmlDataDumpRoot	The location to which the rejected documents (mostly duplicates) are stored in the XML format.	Not set
htmlDataRoot	The location to which the acquired (accepted) documents are stored in their original HTML form.	DataHtml
htmlDataDumpRoot	The location to which the rejected documents are stored in their original HTML form.	Not set
htmlViewRoot	The location to which the previews of the acquired (accepted) documents are stored. A preview is an HTML page displaying content, annotations, and metadata of the corresponding Web document.	Not set
dataSourcesFileName	The name of the file containing RSS sources to be polled for content.	RssSources.txt
dbConnectionString	The string containing information required to connect to the DacqPipe database.	Server=127.0.0.1; Port=5432; Database=DacqPipe; Integrated Security=true;
language	The language in which the acquired (accepted) documents are written. Note that setting this to other than English turns off the NLP part of the pipeline.	English
numPipes	The number of parallel pipelines between which load balancing is performed. You should increase this if you see the RAM consumption constantly increasing (the queues are filling up). If this does not work, your system most likely does not have enough processing resources.	2
sleepBetweenPolls	The amount of time a RSS reader waits before polling its RSS feeds again from the start.	00:15:00

Create the file with RSS sources. The name of this file is specified with the dataSourcesFileName configuration parameter. The file format is relatively simple and contains several lists of RSS sources, one for each Web site. Each RSS list starts with a site identifier (e.g., "Site: cnn"). The URLs of RSS sources are listed after the site identifier, each in its own line. This list ends with the next site identifier (or with the end of file). If a line starts with "#", which indicates a comment, it is ignored by DacqPipe. The following is an example of such a file:

Site: cnn
# Site: http://edition.cnn.com/
# RSS list: http://edition.cnn.com/services/rss/
http://rss.cnn.com/rss/edition.rss
http://rss.cnn.com/rss/edition_asia.rss
http://rss.cnn.com/rss/edition_europe.rss
http://rss.cnn.com/rss/edition_us.rss
http://rss.cnn.com/rss/edition_world.rss
http://rss.cnn.com/rss/edition_africa.rss
http://rss.cnn.com/rss/edition_americas.rss
http://rss.cnn.com/rss/edition_meast.rss
http://rss.cnn.com/rss/edition_business.rss
http://rss.cnn.com/rss/edition_technology.rss
http://rss.cnn.com/rss/edition_space.rss
http://rss.cnn.com/rss/edition_entertainment.rss
http://rss.cnn.com/rss/edition_sport.rss
http://rss.cnn.com/rss/edition_football.rss
http://rss.cnn.com/rss/edition_travel.rss
http://rss.cnn.com/rss/cnn_freevideo.rss
http://rss.cnn.com/rss/cnn_latest.rss
http://rss.cnn.com/rss/edition_business360.rss
http://rss.cnn.com/rss/edition_connecttheworld.rss
http://rss.cnn.com/rss/edition_questmeansbusiness.rss
http://rss.cnn.com/rss/edition_worldsportblog.rss
http://rss.cnn.com/rss/edition_golf.rss
http://rss.cnn.com/rss/edition_motorsport.rss
http://rss.cnn.com/rss/edition_tennis.rss
http://afghanistan.blogs.cnn.com/feed/
http://news.blogs.cnn.com/feed/

Site: mirror
# Site: http://www.mirror.co.uk/
http://www.mirror.co.uk/rss.xml

Site: spiegel
# Site: http://www.spiegel.de/international/
# RSS list: http://www.spiegel.de/international/0,1518,643192,00.html
http://www.spiegel.de/schlagzeilen/index.rss
http://www.spiegel.de/international/index.rss
http://www.spiegel.de/international/germany/index.rss
http://www.spiegel.de/international/europe/index.rss
http://www.spiegel.de/international/world/index.rss
http://www.spiegel.de/international/business/index.rss
http://www.spiegel.de/international/zeitgeist/index.rss
http://www.spiegel.de/schlagzeilen/tops/index.rss

Create the database:
1. Start pgAdmin.
2. Create a new database.
3. Run the script PgCreateTables.sql (contained in C:\Work\DacqPipe\DacqPipe\DB) on the newly created database.
4. Make sure that the database connection string is set correctly in DacqPipe.exe.config.
Execute DacqPipe.exe. DacqPipe starts as a console-mode application. The console displays activity and error messages. The same messages are written into a log file if logging is enabled.

DacqPipe is shut down by pressing Ctrl-C. The message "Ctrl-C command received." appears in the console. Note that DacqPipe needs some time to shut down gracefully as it needs to finalize the processing of document queues.

Acquired Data

Documents acquired with DacqPipe are internally stored as annotated document objects. An annotated document is described with features and contains annotations. An annotation gives a special meaning to a text segment (e.g., boilerplate, token, sentence) and can further be described with features.

DacqPipe stores acquired documents into files. The corresponding metadata is stored into the database. The database structure is very simple, containing practically only one table called Documents. Each record corresponds to one acquired (accepted) document. Apart from the metadata, a record contains the reference to the corresponding data files.

Each acquired document can be stored as a compressed XML (.xml.gz), compressed HTML (.html.gz), and/or preview HTML. While the HTMLs are the original documents acquired from the Web, the XMLs contain extracted and annotated content with additional metadata (features). In addition, a preview is an HTML page displaying content, annotations, and metadata of the corresponding Web document.

Each of these three datasets is stored into a separate root folder in which DacqPipe creates a separate folder for each day (e.g., <xmlDataRoot>\2011\09\08\ would be created on September 8, 2011) and assigns unique names to data files. The name of a file consists of a time stamp and the document identifier (e.g., <xmlDataRoot>\2011\09\08\14_29_33_c9bef21a1d4f4e4db0c82624d5b741bb.xml.gz). Note that the time stamp (the first 8 characters in the file name, i.e., hh_mm_ss) represents the acquisition time and not the publication time.

Advanced Config

You can configure the following advanced settings in DacqPipe.exe.config:

Key	Description	Default value
maxDocsPerCorpus	Specifies how many acquired documents are bundled in a document corpus that is passed between the pipeline components. Smaller document corpora are more suitable for effective load balancing. On the other hand, larger document corpora are better for solving the cold start problem in the boilerplate removal process.	50
randomDelayAtStart	Specifies whether each RSS reader component should sleep for some amount of time before making the first request. Set this to "yes" if you experience problems with network traffic or simultanious requests at startup.	no
rssReaderDefaultRssXmlEncoding	Specifies the default encoding of retrieved RSS XML documents (used when encoding is not specified in the header or HTTP response).	ISO-8859-1
rssReaderDefaultHtmlEncoding	Specifies the default encoding of retrieved HTML documents (used when encoding is not specified in the header or HTTP response).	ISO-8859-1
urlRulesFileName	Points to the file containing URL normalization rules required for boilerplate removal (see below*).	Not set
urlBlacklistFileName	Points to the file specifying URLs from which the content should be rejected (see below**).	Not set

* Example of a rule-set file (the first part of each line is a regex against which the URL is matched, the second part is the URL query parameter that should be retained in order to correctly form a unique URL key):

http://www\.cbsnews\.com:80.*?/watch    id
http://abcnews\.go\.com:80  id
http://www\.boston\.com:80.*?/video bctid
http://www\.marketwatch\.com:80.*?/story    Guid
http://home\.nzcity\.co\.nz:80.*?/article\.aspx id
http://www\.nzherald\.co\.nz:80.*?/article\.cfm objectid
http://www\.politicsweb\.co\.za:80  oid
http://espn\.go\.com:80 id
http://members\.morningstar\.com:80.*?/Default\.aspx    vurl
http://www\.jpost\.com:80.*?/Article\.aspx  id
http://www\.sfgate\.com:80.*?/article\.cgi  f
http://www\.dailytimes\.com\.pk:80/default\.asp page
http://mlb\.mlb\.com:80.*?/article\.jsp content_id
http://www\.fitchratings\.com:80.*?/detail\.cfm pr_id
http://market-ticker\.org:80/akcs-www   post
http://www\.skynews\.com\.au:80.*?/article\.aspx    id
http://www\.eyewitnessnews\.co\.za:80/Story\.aspx   Id
http://www\.rotoworld\.com:80.*?/playerbreakingnews\.asp    id  sport
http://celebs\.gather\.com:80/viewArticle\.action   articleId
http://www\.9and10news\.com:80.*?/Story id
http://sports\.yahoo\.com:80.*?/news    slug
http://bbs\.chinadaily\.com\.cn:80/viewthread\.php  tid
http://news\.businessweek\.com:80/article\.asp  documentKey
http://www\.businessday\.co\.za:80.*?/Content\.aspx id
http://www\.daijiworld\.com:80.*?/news_disp\.asp    n_id
http://www\.taiwannews\.com\.tw:80.*?/news_content\.php id
http://bostonherald\.com:80.*?/view\.bg articleid
http://www\.newstalkzb\.co\.nz:80/newsdetail1\.asp  storyid
http://www\.newstalkzb\.co\.nz:80/newsdetail1\.asp  storyID
http://pakobserver\.net:80/detailnews\.asp  id
http://news\.morningstar\.com:80.*?/article\.aspx   id

** Example of a blacklist file:

http://www.hulu.jp:80
http://www.clubmed-jp.com:80
http://www.u-tokai.ac.jp:80
http://ads.pheedo.com:80
http://consultant.en-japan.com:80
http://japan.cnet.com:80
http://jp.fujitsu.com:80
http://membership.ft.com:80
http://special.nikkeibp.co.jp:80
http://www.lit.nagoya-u.ac.jp:80
http://www.luther.ac.jp:80
http://www.meijo-u.ac.jp:80
http://www.nhc.noaa.gov:80
http://www.nvlu.ac.jp:80
https://home.modernhealthcare.com:443

License

Most of DacqPipe is under the MIT license. However, certain parts and/or dependencies fall under other licenses. See LICENSE.txt for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
Config		Config
DacqPipe		DacqPipe
JsiServerConfig		JsiServerConfig
RssScraper		RssScraper
RssScraperConsole		RssScraperConsole
RssSources		RssSources
UrlAnalyzer		UrlAnalyzer
.gitignore		.gitignore
DacqPipe.sln		DacqPipe.sln
LICENSE.txt		LICENSE.txt
README.md		README.md
cleanup.bat		cleanup.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DacqPipe

Clone and Build

Configure and Run

Acquired Data

Advanced Config

License

About

Releases

Packages

Contributors 2

Languages

License

SowaLabs/DacqPipe

Folders and files

Latest commit

History

Repository files navigation

DacqPipe

Clone and Build

Configure and Run

Acquired Data

Advanced Config

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages