Skip to content

How to build and test a search engine parser

g4jc edited this page Jul 10, 2014 · 1 revision

Seeks websearch plugin comes with support for a few existing search engines. But the plugin has generic code for building new search engine parsers.

If you'd like to construct a parser for an additional search engine, this is very easy (hum).

Table of Contents

Preparing the field

Go to seeks/src/plugins/websearch/. Here, copy an existing parser (ie se_parser_bing.cpp and se_parser_bing.h) to create the files for your parser. Then edit Makefile.am and add your parser in it (same as for the other parsers). Edit websearch_configuration.h, add your parser to the define list (the number should be double for each new line).

Not sure but maybe you'll need to hack se_handler.h to add your parser here to.

Preparing the test

Go to seeks/proxy/src/proxy/tests/. Here, use test_curl_mget to download a sample of the page you are going to parse.

Example: ./test_curl_mget 1 http://search.blah.net/search?q=blabla&... The second argument is the number of time you want to download the page.

Save the output to a file, then move this file to seeks/src/plugins/websearch/tests/.

Here, copy test test-bing-parser.cpp to a new file for your parser. Edit this file, remplace bing with your parser name.

Edit the Makefile.am, add the test of your parser in it.

Preparing the parser

Edit the files (.cpp and .h) of your parser in seeks/src/plugins/websearch/. Remplace the bing keyword (or the equivalent of the parser your have copy) with the keyword of your parser.

Comment the code inside the method.

Launch make. Prey. If this doesn't compile, find why and edit this page (or come and talk with us on irc). If everything goes fine, breath, you are now ready to hack.

Hack

Okay, now the game begins. Open up se_parser_yourwebsite.cpp and the page you have download with test_curl_mget. The objectif is to find the pattern of the result snippet in this html file and to hack se_parser_yourwebsite to find it.

This is a event based parser, it means that each time he encouter an opening balise, a closing balise or the content of a tag, it will call the corresponding method. You should play with some boolean attributes to indicate where you are in the snippet (declare them in se_parser_yourwebsite.h and initialise them in the constructor). The important part is the creation of your snippet, you should already have the code from the copied parser (a big if with a lot of ||).

To test your parser, use the test created in seeks/src/plugins/websearch/tests/ on the file download with test_curl_mget. You should have the corresponding number of parsers created and no errors.

Happy hacking.

Here is a very simple example for twitter: se_parser_twitter.cpp (not include in seeks for the moment)

It looks for <entry></entry> tag, then <title> tag, grab what is inside &lt;title&gt;&lt;/title&gt; look for &lt;link&gt;&lt;/link&gt;, grab the href="" then push it snippet when it encounter &lt;/entry&gt;</title>