forked from apache/nutch
-
Notifications
You must be signed in to change notification settings - Fork 0
/
CHANGES.txt
3035 lines (1906 loc) · 121 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Nutch Change Log
Nutch 1.17 Development
Nutch 1.16 Release 02/10/2019 (dd/mm/yyyy)
Release Report: https://s.apache.org/l2j94
Comments
- schema.xml has been moved to indexer-solr plugin directory. This file is provided as a
reference/guide for Solr users (NUTCH-2654)
Breaking Changes
- The value of crawl.gen.delay is now read in milliseconds as stated in the description
in nutch-default.xml. Previously, the value has been read in days, see NUTCH-1842 for
further information.
- HostDB entries have been moved from Integer to Long in order to accomodate very large
hosts. Remove your existing HostDB and recreate it with bin/nutch updatehostdb, see
NUTCH-2694 for additional information.
- The signature class TextProfileSignature has been improved to be stable over
consecutive runs by sorting tokens by frequency first and secondarily in lexicographic
order. If an existing CrawlDb contains signatures generated by TextProfileSignature
these are likely to change when upgrading to Nutch 1.16. The previous behavior relying
on a semi-stable pseudo-random hash sorting could be restored setting the property
`db.signature.text_profile.sec_sort_lex` to `false`. See also NUTCH-2381.
Bug
[NUTCH-1063] - OutlinkExtractor test generates an exception but does not fail
[NUTCH-1842] - crawl.gen.delay has a wrong default value in nutch-default.xml or is being parsed incorrectly
[NUTCH-2279] - LinkRank fails when using Hadoop MR output compression
[NUTCH-2381] - In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.
[NUTCH-2387] - Nutch should not index document with "noindex" meta
[NUTCH-2457] - Embedded documents likely not correctly parsed by Tika
[NUTCH-2475] - If and else-if branches has the same condition
[NUTCH-2482] - index-geoip not to add null values to document fields
[NUTCH-2585] - NPE in TrieStringMatcher
[NUTCH-2598] - URLNormalizerChecker fails on invalid URLs in input
[NUTCH-2606] - MIME detection is wrong for plain-text documents send as Content-Type "application/msword"
[NUTCH-2635] - Generator writes unneeded temporary output
[NUTCH-2639] - bin/nutch fails to set native library path on Cygwin causing jobs to fail with UnsatisfiedLinkError
[NUTCH-2641] - ClassCastException in webui
[NUTCH-2642] - MoreIndexingFilter parses ISO 8601 UTC dates in local time zone
[NUTCH-2643] - ant target "resolve-default" to depend on "init"
[NUTCH-2644] - CrawlDbReader -dump ignores filter options
[NUTCH-2645] - Webgraph tools ignore command-line options
[NUTCH-2650] - -addBinaryContent -base64 flags are causing "String length must be a multiple of four" error in IndexingJob
[NUTCH-2652] - Fetcher launches more fetch tasks than fetch lists
[NUTCH-2655] - Update Solr schema.xml for Solr 7.x
[NUTCH-2656] - Update description to configure Solr 7.x in tutorial
[NUTCH-2673] - EOFException protocol-http
[NUTCH-2674] - HostDb: dump shows wrong column headers
[NUTCH-2680] - Documentation: https supported by multiple protocol plugins not only httpclient
[NUTCH-2687] - Regex for reading title from Content-Disposition is wrong
[NUTCH-2694] - HostDB to aggregate by long instead of integer
[NUTCH-2696] - Nutch SegmentReader does not dump non-ASCII characters with Hadoop 3.x
[NUTCH-2699] - Protocol-okhttp: needless loops to increment requested bytes counter when more content is already buffered
[NUTCH-2703] - parse-tika: Boilerpipe should not run for non-(X)HTML pages
[NUTCH-2706] - -addBinaryContent flag can cause "String length must be a multiple of four" error in IndexingJob
[NUTCH-2715] - WARCExporter fails on large records
[NUTCH-2716] - protocol-http: Response headers are not stored for a compressed response
[NUTCH-2717] - Generator cannot open hostDB
[NUTCH-2722] - Fetch dependencies via https
[NUTCH-2723] - Indexer Solr not to decode URLs before deletion
[NUTCH-2724] - Metadata indexer not to emit empty values
[NUTCH-2729] - protocol-okhttp: fix marking of truncated content
[NUTCH-2731] - Solr Cleanup Step Fails when Authentication is Required
[NUTCH-2738] - Generator: document property generate.restrict.status
[NUTCH-2740] - Generator: generate.max.count overflow not logged
New Feature
[NUTCH-2676] - Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver
Improvement
[NUTCH-1014] - Migrate from Apache ORO to java.util.regex
[NUTCH-1021] - Migrate OutlinkExtractor from Apache ORO to java.util.regex
[NUTCH-1982] - Make Git ignore IDE project files and add note about IDE setup
[NUTCH-2460] - use the headless option of firefox and chrome in protocol-selenium
[NUTCH-2602] - Configuration values in the description of index writers
[NUTCH-2612] - Support for sitemap processing by hostname
[NUTCH-2623] - Fetcher to guarantee delay for same host/domain/ip independent of http/https protocol
[NUTCH-2625] - ProtocolFactory.getProtocol(url) may create multiple plugin instances
[NUTCH-2626] - bin/crawl: remove option -noParsing from fetch command
[NUTCH-2627] - Fetcher to optionally filter URLs
[NUTCH-2628] - Fetcher: optionally generate signature of unparsed content
[NUTCH-2629] - Documentation for CSV Index Writer
[NUTCH-2630] - Fetcher to log skipped records by robots.txt
[NUTCH-2631] - KafkaIndexWriter
[NUTCH-2632] - protocol-okhttp doesn't accept proxy authentication
[NUTCH-2633] - Fix deprecation warnings when building Nutch master branch under JDK 10.0.2+13
[NUTCH-2647] - Skip TLS certificate checks in protocol-http plugin
[NUTCH-2648] - Make configurable whether TLS/SSL certificates are checked by protocol plugins
[NUTCH-2651] - Upgrade to Tika 1.19.1 (from 1.18)
[NUTCH-2653] - ProtocolFactory.getProtocol(url) creates separate plugin instances for http/https
[NUTCH-2654] - Remove obsolete index-writer configuration in conf/
[NUTCH-2657] - Protocol-http to store HTTP response header with "\r\n"
[NUTCH-2658] - Add README file to all plugins in src/plugin
[NUTCH-2659] - Add missing Apache license headers
[NUTCH-2660] - Unit tests of plugins parse-js, headings, index-jexl-filter to be executed during build
[NUTCH-2661] - Move TestOutlinks to the proper path
[NUTCH-2663] - Improve index-jexl-filter syntax for scripts
[NUTCH-2666] - Increase default value for http.content.limit / ftp.content.limit / file.content.limit
[NUTCH-2668] - Integrate OWASP dependency checks as ant target
[NUTCH-2678] - Allow for per-host configurable protocol plugin
[NUTCH-2682] - Upgrade to Tika 1.20
[NUTCH-2683] - DeduplicationJob: add option to prefer https:// over http://
[NUTCH-2686] - Separate field for mime types mapped by index-more plugin
[NUTCH-2688] - Unify the licence headers
[NUTCH-2689] - Speed up urlfilter-regex and urlfilter-automaton
[NUTCH-2690] - Configurable and fast URL filter
[NUTCH-2691] - Improve logging from scoring-depth plugin
[NUTCH-2692] - Subcollection to support case-insensitive white and black lists
[NUTCH-2693] - Misspelled configuration property names in documentation
[NUTCH-2695] - Fix some alerts raised by LGTM
[NUTCH-2700] - Indexchecker: improve command-line help
[NUTCH-2701] - Fetcher: log dates and times also in human-readable form
[NUTCH-2702] - Fetcher: suppress stack for frequent exceptions
[NUTCH-2704] - Upgrade crawler-commons dependency to 1.0
[NUTCH-2708] - urlfilter-automaton: update library dependency (dk.brics.automaton)
[NUTCH-2709] - Remove unused properties and code related to HTTP protocol
[NUTCH-2718] - Names of index writers and exchanges configuration files to be configurable
[NUTCH-2719] - NPE if exchanges.xml uses index writer not available
[NUTCH-2725] - Plugin lib-http to support per-host configurable cookies
[NUTCH-2726] - Upgrade to Tika 1.22
[NUTCH-2727] - Upgrade Hadoop dependencies to 2.9.2
[NUTCH-2728] - protocol-okhttp: upgrade okhttp dependency to 3.14.2
[NUTCH-2732] - Ignored and tracked configuration files by git
[NUTCH-2736] - Upgrade Dockerfile to be based on recent Ubuntu LTS version
[NUTCH-2737] - Generator: count and log reason of rejections during selection
Task
[NUTCH-2192] - Get rid of oro
[NUTCH-2613] - Documentation for exchange component
[NUTCH-2698] - Remove sonar build task from build.xml
Sub-task
[NUTCH-1121] - JUnit test for parse-js
[NUTCH-2621] - Generate report of third-party licenses
[NUTCH-2684] - Add README.md file to all indexer writers plugins
[NUTCH-2685] - Add README.md file to all exchange plugins
Nutch 1.15 Release (25/07/2018)
Release Report: https://s.apache.org/nczS
Breaking Changes
- indexer plugins are now configured in a single XML file (conf/index-writers.xml),
see https://wiki.apache.org/nutch/IndexWriters - setting or overwriting configuration
parameters via Nutch properties is not possible anymore.
Bug
[NUTCH-1993] - Nutch does not use backup parsers
[NUTCH-2071] - A parser failure on a single document may fail crawling job if parser.timeout=-1
[NUTCH-2145] - parse/index checker fail to fetch valid percent-encoded URLs
[NUTCH-2161] - Interrupted failed and/or killed tasks fail to clean up temp directories in HDFS
[NUTCH-2273] - Selenium and InteractiveSelenium Do Not Support HTTPS
[NUTCH-2310] - Protocol-Selenium does not support HTTPS protocol
[NUTCH-2321] - Indexing filter checker leaks threads
[NUTCH-2324] - Issue in setting default linkdb path
[NUTCH-2447] - Work-around SSLProtocolException: handshake alert: unrecognized_name
[NUTCH-2454] - REST API fix for usage of hostdb in generator
[NUTCH-2461] - Generate passes the data to when maxCount == 0
[NUTCH-2466] - Sitemap processor to follow redirects
[NUTCH-2467] - Sitemap type field can be null
[NUTCH-2485] - ParserFactory swallows exception
[NUTCH-2486] - Compiler Warning: Unchecked / unsafe operations in MimeTypeIndexingFilter
[NUTCH-2489] - Dependency collision with lucene-analyzers-common in scoring-similarity plugin
[NUTCH-2490] - Sitemap processing: Sitemap index files not working
[NUTCH-2494] - Fetcher: java.lang.IllegalArgumentException: Wrong FS: s3
[NUTCH-2499] - Elastic REST Indexer: Duplicate values
[NUTCH-2505] - nutch does not delete the .locked file, when the generator partition got an exception
[NUTCH-2508] - Misleading documentation about http.proxy.exception.list
[NUTCH-2509] - Inconsistent behavior in SitemapProcessor
[NUTCH-2513] - ant eclipse target fails with "protocol switch unsafe"
[NUTCH-2517] - mergesegs corrupts segment data
[NUTCH-2518] - Must check return value of job.waitForCompletion()
[NUTCH-2520] - Wrong Accept-Charset sent when http.accept.charset is not defined
[NUTCH-2521] - SitemapProcessor to use property sitemap.redir.max
[NUTCH-2523] - UpdateHostDB blocks usage of plugins unintentionally
[NUTCH-2524] - bin/crawl: fix check for HostDb in distributed mode
[NUTCH-2533] - Injector: NullPointerException if seed URL dir contains non-file entries
[NUTCH-2535] - CrawlDbReader -stats: ClassCastException
[NUTCH-2544] - Nutch 1.15 no longer compatible with AWS EMR and S3
[NUTCH-2547] - urlnormalizer-basic fails on special characters in path/query
[NUTCH-2549] - protocol-http does not behave the same as browsers
[NUTCH-2550] - Fetcher fails to follow redirects
[NUTCH-2551] - NullPointerException in generator
[NUTCH-2552] - CrawlDbReader -topN fails
[NUTCH-2553] - Fetcher not to modify URLs to be fetched
[NUTCH-2554] - parserchecker can't fetch some URLs
[NUTCH-2565] - MergeDB incorrectly handles unfetched CrawlDatums
[NUTCH-2568] - Caught exception is immediately rethrown
[NUTCH-2569] - ClassNotFoundException when running in (pseudo-)distributed mode
[NUTCH-2570] - Deduplication job fails to install deduplicated CrawlDb
[NUTCH-2571] - SegmentReader -list fails to read segment
[NUTCH-2572] - HostDb: updatehostdb does not set values
[NUTCH-2574] - Generator: hostCount >= maxCount comparison wrong
[NUTCH-2581] - Caching of redirected robots.txt may overwrite correct robots.txt rules
[NUTCH-2589] - HTML redirections are not followed when using parse-tika
[NUTCH-2590] - SegmentReader -get fails
[NUTCH-2592] - Fetcher to log reason of failed fetches
[NUTCH-2593] - Single mode doesn't work in RabbitMQ indexer
[NUTCH-2597] - NPE in updatehostdb
[NUTCH-2601] - Elasticsearch Rest and Amazon CloudSearch have the same implementation class in indexer-writers.xml
[NUTCH-2607] - ParserChecker should call ScoringFilters.passScoreAfterParsing() on all parses
[NUTCH-2609] - urlnormalizer-basic to normalize path of file: URLs
[NUTCH-2614] - NPE in CrawlDbReader -stats on empty CrawlDb
[NUTCH-2616] - Review routing of deletions by Exchange component
[NUTCH-2618] - protocol-okhttp not to use http.timeout for max duration to fetch document
[NUTCH-2620] - urlfilter-validator incorrectly assumes that top-level domains are not longer than 4 characters
[NUTCH-2624] - protocol-okhttp resource leak
New Feature
[NUTCH-1129] - Any23 Nutch plugin
[NUTCH-1541] - Indexer plugin to write CSV
[NUTCH-2412] - Exchange component for indexing job
[NUTCH-2492] - Add more configuration parameters to crawl script
Improvement
[NUTCH-1106] - Options to skip url's based on length
[NUTCH-1480] - SolrIndexer to write to multiple servers.
[NUTCH-2012] - Merge parsechecker and indexchecker
[NUTCH-2375] - Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
[NUTCH-2390] - No documentation on pluggable indexing
[NUTCH-2411] - Index-metadata to support indexing multiple values for a field
[NUTCH-2416] - Fetcher to log thread ID
[NUTCH-2432] - Protocol httpclient to disable cookies if http.enable.cookie.header is false
[NUTCH-2441] - ARG_SEGMENT usage
[NUTCH-2491] - Integrate sitemap processing and HostDB into crawl script
[NUTCH-2493] - Add configuration parameter for sitemap processing to crawler script
[NUTCH-2497] - Elastic REST Indexer: Allow multiple hosts
[NUTCH-2502] - Any23 Plugin: Add Content-Type filtering
[NUTCH-2503] - Add option to run tests for a single plugin
[NUTCH-2510] - Crawl script modification. HostDb : generate, optional usage and description
[NUTCH-2516] - Hadoop imports use wildcards
[NUTCH-2519] - Log mapreduce job counters in local mode
[NUTCH-2526] - NPE in scoring-opic when indexing document without CrawlDb datum
[NUTCH-2527] - URL filter: provide rules to exclude localhost and private address spaces
[NUTCH-2530] - Rename property db.max.anchor.length > linkdb.max.anchor.length
[NUTCH-2534] - CrawlDbReader -stats: make score quantiles configurable
[NUTCH-2539] - Not correct naming of db.url.filters and db.url.normalizers in nutch-default.xml
[NUTCH-2543] - readdb & readlinkdb to implement AbstractChecker
[NUTCH-2545] - Upgrade to Any23 2.2
[NUTCH-2566] - Fix exception log messages
[NUTCH-2576] - HTTP protocol plugin based on okhttp
[NUTCH-2577] - protocol-selenium can't handle https
[NUTCH-2578] - Avoid lock by MimeUtil in constructor of protocol.Content
[NUTCH-2579] - Fetcher to use parsed URL to call ProtocolFactory.getProtocol(url)
[NUTCH-2580] - Improvements for Rabbitmq support
[NUTCH-2583] - Upgrading Nutch's dependencies
[NUTCH-2584] - Upgrade parse-tika to use Tika 1.18
[NUTCH-2594] - Documentation for indexer plugins
[NUTCH-2595] - Upgrade crawler-commons dependency to 0.10
[NUTCH-2600] - Refactoring indexer-solr
[NUTCH-2611] - Add line-breaks when parsing HTML block-level elements
[NUTCH-2617] - Disable Exchange component by default
[NUTCH-2619] - protocol-okhttp: allow to keep partially fetched docs as truncated
Task
[NUTCH-1219] - Upgrade all jobs to new MapReduce API
[NUTCH-1228] - Change mapred.task.timeout to mapreduce.task.timeout in fetcher
Sub-task
[NUTCH-1223] - Migrate WebGraph to MapReduce API
[NUTCH-1224] - Migrate FreeGenerator to MapReduce API
[NUTCH-1226] - Migrate CrawlDbReader to MapReduce API
[NUTCH-2152] - CommonCrawl dump via Service endpoint
[NUTCH-2555] - URL normalization problem: path not starting with a '/'
[NUTCH-2556] - protocol-http makes invalid HTTP/1.0 requests
[NUTCH-2557] - protocol-http fails to follow redirections when an HTTP response body is invalid
[NUTCH-2558] - protocol-http cannot handle a missing HTTP status line
[NUTCH-2559] - protocol-http cannot handle colons after the HTTP status code
[NUTCH-2560] - protocol-http throws an error when an http header spans over multiple lines
[NUTCH-2561] - protocol-http can be made to read arbitrarily large HTTP responses
[NUTCH-2562] - protocol-http fails to read large chunked HTTP responses
[NUTCH-2563] - HTTP header spellchecking issues
[NUTCH-2575] - protocol-http does not respect the maximum content-size for chunked responses
[NUTCH-2622] - Unbundle LGPL-licensed jars from binary release
Nutch 1.14 Release 18/12/2017 (dd/mm/yyyy)
- the bin/crawl script now expects the path to the seed to be preceded by -s (NUTCH-2046)
Bug
[NUTCH-2071] - A parser failure on a single document may fail crawling job
[NUTCH-2235] - Classpath discrepancy with protocol-selenium in deploy mode
[NUTCH-2269] - Clean not working after crawl
[NUTCH-2295] - Nutch master docker container broken
[NUTCH-2297] - CrawlDbReader -stats wrong values for earliest fetch time and shortest interval
[NUTCH-2316] - Library conflict with Parser-Tika Plugin and Lib Folder
[NUTCH-2317] - Plugin jars don't get added to classpath while running in local
[NUTCH-2322] - URL not available for Jexl operations
[NUTCH-2354] - Upgrade Hadoop dependencies to 2.7.4
[NUTCH-2365] - HTTP Redirects to SubDomains don't get crawled if db.ignore.external.links.mode == byDomain
[NUTCH-2371] - Injector to support noFilter and noNormalize
[NUTCH-2372] - Javadocs build failing.
[NUTCH-2386] - BasicURLNormalizer does not encode curly braces
[NUTCH-2391] - Spurious Duplications for MD5
[NUTCH-2394] - Possible bugs in the source code
[NUTCH-2398] - Fetcher saving redirected robots.txt under redirect target URL
[NUTCH-2399] - indexer-elastic does not index multi-value fields (only the first value is indexed)
[NUTCH-2401] - headings plugin does not trim values
[NUTCH-2403] - Nutch Selenium: Wrong documentation about PhantomJS
[NUTCH-2413] - Parsing fetcher to respect property "parse.filter.urls"
[NUTCH-2420] - Bug in variable generate.max.count and fetcher.server.delay
[NUTCH-2436] - Remove empty comment, and redundant semicolon from CommandRunner
[NUTCH-2442] - Injector to stop if job fails to avoid loss of CrawlDb
[NUTCH-2444] - HostDB CSV dumper to emit field header by default
[NUTCH-2446] - URLFiltersCheck fix
[NUTCH-2448] - Allow Sending an empty http.agent.version
[NUTCH-2451] - protocol-ftp to resolve relative URL when following redirects
[NUTCH-2452] - Problem retrieving encoded URLs via FTP?
[NUTCH-2456] - Allow to index pages/URLs not contained in CrawlDb
[NUTCH-2458] - TikaParser doesn't work with tika-config.xml set
[NUTCH-2464] - Plugin headings: Headers That Contain HTML Elements Are Not Parsed
[NUTCH-2465] - Broken Eclipse project. Classpaths and interactiveselenium should be fixed.
[NUTCH-2472] - Sitemap processor does not honour db.ignore.external.links
[NUTCH-2473] - Elasticsearch REST Indexer broken due to wrong depenency
[NUTCH-2474] - CrawlDbReader -stats fails with ClassCastException
[NUTCH-2478] - // is not a valid base URL
[NUTCH-2483] - Remove/replace indirect dependencies to org.json
Improvement
[NUTCH-1763] - Improving comments on the Injector Class
[NUTCH-2034] - CrawlDB filtered documents counter.
[NUTCH-2035] - Regex filter using case sensitive rules.
[NUTCH-2046] - The crawl script should be able to skip an initial injection.
[NUTCH-2135] - Ant Eclipse build does not include protocol-interactiveselenium
[NUTCH-2193] - Upgrade feed parser plugin to use rome 1.5
[NUTCH-2216] - db.ignore.*.links to optionally follow internal redirects
[NUTCH-2281] - Support non-default FileSystem
[NUTCH-2296] - Elasticsearch Indexing Over Rest
[NUTCH-2320] - URLFilterChecker to run as TCP Telnet service
[NUTCH-2335] - Injector not to filter and normalize existing URLs in CrawlDb
[NUTCH-2362] - Upgrade MaxMind GeoIP version in index-geoip
[NUTCH-2368] - Variable generate.max.count and fetcher.server.delay
[NUTCH-2370] - FileDumper: save JSON mapping file -> URL
[NUTCH-2376] - Improve configurability of HTTP Accept* header fields
[NUTCH-2378] - ChildFirst plugin classloader
[NUTCH-2380] - indexer-elastic version upgrade to 5.3.0
[NUTCH-2397] - Parser to add paragraph line breaks
[NUTCH-2400] - Solr 6.6.0 compatibility
[NUTCH-2406] - Sum up constants, make minor changes
[NUTCH-2408] - CrawlDb: allow update from unparsed segments
[NUTCH-2409] - Injector: complete command-line help and counters
[NUTCH-2414] - Allow LanguageIndexingFilter to actually filter documents by language.
[NUTCH-2430] - Complete plugin build configuration
[NUTCH-2431] - URLFilterchecker to implement Tool-interface
[NUTCH-2439] - Upgrade to Apache Tika 1.17
[NUTCH-2443] - Extract links from the video tag with the parse-html plugin
[NUTCH-2445] - Fetcher following outlinks to keep track of already fetched items
[NUTCH-2463] - Enable sampling CrawlDB
[NUTCH-2468] - should filter out invalid URLs by default
[NUTCH-2470] - CrawlDbReader -stats to show quantiles of score
[NUTCH-2477] - Refactor *Checker classes to use base class for common code
[NUTCH-2480] - Upgrade crawler-commons dependency to 0.9
New Feature
[NUTCH-1465] - Support sitemaps in Nutch
[NUTCH-1932] - Automatically remove orphaned pages
[NUTCH-2333] - Indexer for RabbitMQ
[NUTCH-2338] - URLNormalizerChecker to run as TCP Telnet service
[NUTCH-2415] - Create a JEXL based IndexingFilter
[NUTCH-2433] - Html Parser: keep htmltag where the outlinks are found
[NUTCH-2435] - New configuration allowing to choose whether to store 'parse_text' directory or not.
[NUTCH-2484] - Extend indexer-elastic-rest to support languages
Task
[NUTCH-2181] - Add Webpage for 3rd Party Connectors/Libraries to Apache Nutch
Nutch 1.13 Release 28/03/2017 (dd/mm/yyyy)
Release Report: https://s.apache.org/wq3x
Sub-task
[NUTCH-2246] - Refactor /seed endpoint for backward compatibility
Bug
[NUTCH-1553] - Property 'indexer.delete.robots.noindex' not working when using parser-html.
[NUTCH-2242] - lastModified not always set
[NUTCH-2291] - Fix mrunit dependencies
[NUTCH-2337] - urlnormalizer-basic to strip empty port
[NUTCH-2345] - FetchItemQueue logs are logged with wrong class name
[NUTCH-2349] - urlnormalizer-basic NPE for ill-formed URL "http:/"
[NUTCH-2357] - Index metadata throw Exception because writable object cannot be cast to Text
[NUTCH-2359] - Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed
[NUTCH-2364] - http.agent.rotate: IllegalArgumentException / last element of agent names ignored
[NUTCH-2366] - Deprecated Job constructor in hostdb/ReadHostDb.java
Improvement
[NUTCH-1308] - Add main() to ZipParser
[NUTCH-2164] - Inconsistent 'Modified Time' in crawl db
[NUTCH-2234] - Upgrade to elasticsearch 2.3.3
[NUTCH-2236] - Upgrade to Hadoop 2.7.2
[NUTCH-2262] - Utilize parameterized logging notation across Fetcher
[NUTCH-2272] - Index checker server to optionally keep client connection open
[NUTCH-2286] - CrawlDbReader -stats to show fetch time and interval
[NUTCH-2287] - Indexer-elastic plugin should use Elasticsearch BulkProcessor and BackoffPolicy
[NUTCH-2299] - Remove obsolete properties protocol.plugin.check.*
[NUTCH-2300] - Fetcher to optionally save robots.txt
[NUTCH-2327] - Seeds injected in REST workflow must be ingested into HDFS
[NUTCH-2329] - Update Slf4j logging for Java 8 and upgrade miredot plugin version
[NUTCH-2336] - SegmentReader to implement Tool
[NUTCH-2352] - Log with Generic Class Name at Nutch 1.x
[NUTCH-2355] - Protocol plugins to set cookie if Cookie metadata field is present
[NUTCH-2367] - Get single record from HostDB
New Feature
[NUTCH-2132] - Publisher/Subscriber model for Nutch to emit events
Task
[NUTCH-2171] - Upgrade Nutch Trunk to Java 1.8
Nutch 1.12 Release 28/05/2016 (dd/mm/yyyy)
Release Report: https://s.apache.org/nutch1.12
Comments
Fellow committers, Nutch 1.12 contains a breaking change NUTCH-2220. Please use the note below and
in the release announcement and keep it on top in this CHANGES.txt for the Nutch 1.12 release.
* replace your old conf/nutch-default.xml with the conf/nutch-default.xml from Nutch 1.12 release
* if you use LinkDB (e.g. invertlinks) and modified parameters db.max.inlinks and/or db.max.anchor.length
and/or db.ignore.internal.links, rename those parameters to linkdb.max.inlinks and
linkdb.max.anchor.length and linkdb.ignore.internal.links
* db.ignore.internal.links and db.ignore.external.links now operate on the CrawlDB only
* linkdb.ignore.internal.links and linkdb.ignore.external.links now operate on the LinkDB only
Sub-task
[NUTCH-2250] - CommonCrawlDumper : Invalid format + skipped parts
Bug
[NUTCH-2042] - parse-html increase chunk size used to detect charset
[NUTCH-2180] - FileDumper dumps data, but breaks midway on corrupt segments
[NUTCH-2189] - Domain filter must deactivate if no rules are present
[NUTCH-2203] - Suffix URL filter can't handle trailing/leading whitespaces
[NUTCH-2206] - Provide example scoring.similarity.stopword.file
[NUTCH-2213] - CommonCrawlDataDumper saves gzipped body in extracted form
[NUTCH-2223] - Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection
[NUTCH-2224] - Average bytes/second calculated incorrectly in fetcher
[NUTCH-2225] - Parsed time calculated incorrectly
[NUTCH-2228] - Plugin index-replace unit test broken on Java 8
[NUTCH-2232] - DeduplicationJob should decode URL's before length is compared
[NUTCH-2241] - Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
[NUTCH-2256] - Inconsistent log level practice
Improvement
[NUTCH-1233] - Rely on Tika for outlink extraction
[NUTCH-1712] - Use MultipleInputs in Injector to make it a single mapreduce job
[NUTCH-2172] - index-more: document format of contenttype-mapping.txt
[NUTCH-2178] - DeduplicationJob to optionally group on host or domain
[NUTCH-2182] - Make reverseUrlDirs file dumper option hash the URL for consistency
[NUTCH-2183] - Improvement to SegmentChecker for skipping non-segments present in segments directory
[NUTCH-2187] - Change FileDumper SHAs to all uppercase
[NUTCH-2195] - IndexingFilterChecker to optionally follow N redirects
[NUTCH-2196] - IndexingFilterChecker to optionally normalize
[NUTCH-2197] - Add solr5 solrcloud indexer support
[NUTCH-2204] - Remove junit lib from runtime
[NUTCH-2218] - Switch CrawlCompletion arg parsing to Commons CLI
[NUTCH-2221] - Introduce db.ignore.internal.links to FetcherThread
[NUTCH-2229] - Allow Jexl expressions on CrawlDatum's fixed attributes
[NUTCH-2231] - Jexl support in generator job
[NUTCH-2252] - Allow phantomjs as a browser for selenium options
[NUTCH-2263] - Support for mingram and maxgram at Unigram Cosine Similarity Model
New Feature
[NUTCH-961] - Expose Tika's boilerpipe support
[NUTCH-1325] - HostDB for Nutch
[NUTCH-2144] - Plugin to override db.ignore.external to exempt interesting external domain URLs
[NUTCH-2190] - Protocol normalizer
[NUTCH-2191] - Add protocol-htmlunit
[NUTCH-2194] - Run IndexingFilterChecker as simple Telnet server
[NUTCH-2219] - Criteria order to be configurable in DeduplicationJob
[NUTCH-2227] - RegexParseFilter
[NUTCH-2245] - Developed the NGram Model on the existing Unigram Cosine Similarity Model
Task
[NUTCH-2201] - Remove loops program from webgraph package
[NUTCH-2211] - Filter and normalizer checkers missing in bin/nutch
[NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.*
Nutch 1.11 Release 03/12/2015 (dd/mm/yyyy)
Release Report: http://s.apache.org/nutch11
* NUTCH-2176 Clean up of log4j.properties (markus)
* NUTCH-2107 plugin.xml to validate against plugin.dtd (snagel)
* NUTCH-2177 Generator produces only one partition even in distributed mode (jnioche, snagel)
* NUTCH-2158 Upgrade to Tika 1.11 (jnioche, snagel)
* NUTCH-2175 Typos in property descriptions in nutch-default.xml (Roannel Fernández Hernández via snagel)
* NUTCH-2069 Ignore external links based on domain (jnioche)
* NUTCH-2173 String.join in FileDumper breaks the build (joyce)
* NUTCH-2166 Add reverse URL format to dump tool (joyce)
* NUTCH-2157 Addressing Miredot REST API Warnings (Sujen Shah)
* NUTCH-2165 FileDumper Util hard codes part-# folder name (joyce)
* NUTCH-2167 Backport TableUtil from 2.x for URL reversing (joyce)
* NUTCH-2160 Upgrade Selenium Java to 2.48.2 (lewismc, kwhitehall)
* NUTCH-2120 Remove MapWritable from trunk codebase (lewismc)
* NUTCH-1911 Improve DomainStatistics tool command line parsing (joyce)
* NUTCH-2064 URLNormalizer basic to encode reserved chars and decode non-reserved chars (markus, snagel)
* NUTCH-2159 Ensure that all WebApp files are copied into generated artifacts for 1.X Webapp (lewismc)
* NUTCH-2154 Nutch REST API (DB) suffering NullPointerException (Aron Ahmadia, Sujen Shah via mattmann)
* NUTCH-2150 Add protocolstats utility (Michael Joyce via mattmann)
* NUTCH-2146 hashCode on the Outlink class (jorgelbg via mattmann)
* NUTCH-2155 Create a "crawl completeness" utility (Michael Joyce via mattmann)
* NUTCH-1988 Make nested output directory dump optional... again (Michael Joyce via lewismc)
* NUTCH-1800 Documentation for Nutch 1.X and 2.X REST APIs (lewismc)
* NUTCH-2149 REST endpoint to read Nutch sequence files (Sujen Shah)
* NUTCH-2139 Basic plugin to index inlinks and outlinks (jorgelbg)
* NUTCH-2128 Review and update mapred --> mapreduce config params in crawl script (lewismc)
* NUTCH-2141 Change the InteractiveSelenium plugin handler Interface to return page content
(Balaji Gurumurthy via mattmann)
* NUTCH-2129 Add protocol status tracking to crawl datum (Michael Joyce via mattmann)
* NUTCH-2142 Nutch File Dump - FileNotFoundException (Invalid Argument) Error (Karanjeet Singh via mattmann)
* NUTCH-2136 Implement a different version of Naive Bayes Parse Filter (Asitang Mishra)
* NUTCH-2109 Create a brute force click-all-ajax-links utility fucntion for selenium interactive plugin (Asitang Mishra)
* NUTCH-2108 Add a function to the selenium interactive plugin interface to do multiple manipulation of driver and then return the data (Asitang Mishra)
* NUTCH-2124 Fetcher following same redirect again and again (Yogendra Kumar Soni via snagel)
* NUTCH-2123 Seed List REST API returns Text but headers indicate/require JSON
(Aron Ahmadia, Sujen Shah via mattmann)
* NUTCH-2086 Nutch 1.X Webui (Sujen Shah, mattmann via lewismc)
* NUTCH-2121 Update javadoc link for Hadoop 2.4.0 in default.properties (Sujen Shah)
* NUTCH-2119 Eclipse shows build path errors on building Nutch (Sujen Shah)
* NUTCH-2117 NutchServer CLI Option for CMD_PORT is incorrect and should be CMD_HOST (zhangmianhongni via lewismc)
* NUTCH-2115 - Add total counts to mimetype stats (Jimmy Joyce via lewismc)
* NUTCH-2111 Delete temporary files location for selenium tmp files after driver quits (Kim Whitehall via lewismc)
* NUTCH-2095 WARC exporter for the CommonCrawlDataDumper (jorgelbg)
* NUTCH-2102 WARC Exporter (jnioche)
* NUTCH-2106 Runtime to contain Selenium and dependencies only once (snagel)
* NUTCH-2104 Add documentation to the protocol-selenium plugin Readme file
re: selenium grid implementation (Kim Whitehall via mattmann)
* NUTCH-2099 Refactoring the REST endpoints for integration with
webui (Sujen Shah via mattmann)
* NUTCH-2098 Add null SeedUrl constructor (Aron Ahmadia via mattmann)
* NUTCH-2093 Indexing filters to use current signatures (markus)
* NUTCH-2092: Unit Test for NutchServer (Sujen Shah via mattmann)
* NUTCH-2096 Explicitly indicate broswer binary to use when selecting
selenium remote option in config (Kim Whitehall via mattmann)
* NUTCH-2090 Refactor Seed Resource in REST API (Sujen Shah
via mattmann)
* NUTCH-2088 Add URL Processing Check to Interactive Selenium
Handlers (Michael Joyce via mattmann)
* NUTCH-2077 Upgrade to Tika 1.10 (Michael Joyce via lewismc)
* NUTCH-1517 CloudSearch indexer (jnioche)
* NUTCH-2085 Upgrade Guava (markus)
* NUTCH-2084 SegmentMerger to report missing input dirs (markus)
* NUTCH-2083 Implement functionality to shadow nutch-selenium-grid-plugin from Mo Omer (lewismc)
* NUTCH-2049 Upgrade to Hadoop 2.4 (lewismc)
* NUTCH-1486 Upgrade to Solr 4.10.2 (lewismc, markus)
* NUTCH-2048 parse-tika: fix dependencies in plugin.xml (Michael Joyce via snagel)
* NUTCH-2066 Parameterize Generate REST endpoint (Sujen Shah via mattmann)
* NUTCH-2072 Deflate encoding support is broken when http.content.limit is set to -1 (Tanguy Moal via mattmann)
* NUTCH-2062 Add Plugin for interacting with Selenium WebDriver (Michael Joyce, mattmann)
* NUTCH-1785 Ability to index raw content (markus, lewismc)
* NUTCH-2063 Add -mimeStats flag to FileDumper tool (Mike Joyce via lewismc)
* NUTCH-2021 Use protocol-selenium to Capture Screenshots of the Page as it is Fetched (lewismc)
* NUTCH-2058 Indexer plugin that allows RegEx replacements on the NutchDocument
field values (Peter Ciuffetti via mattmann)
* NUTCH-2059 protocol-httpclient, protocol-http unit test errors on Jenkins (Peter Ciuffetti via mattmann)
* NUTCH-1980 Jexl expressions for CrawlDbReader (markus)
* NUTCH-1692 SegmentReader was broken in distributed mode (markus, tejasp)
* NUTCH-1684 ParseMeta to be added before fetch schedulers are run (markus)
* NUTCH-2038 fix for NUTCH-2038: Naive Bayes classifier based html Parse filter (for filtering outlinks)
(Asitang Mishra, snagel via mattmann)
* NUTCH-2041 indexer fails if linkdb is missing (snagel)
* NUTCH-2016 Remove unused class OldFetcher (snagel)
* NUTCH-2000 Link inversion fails with .locked already exists (jnioche, snagel)
* NUTCH-2036 Adding some continuous crawl goodies to the crawl script (jorge, snagel)
* NUTCH-2039 Relevance based scoring filter (Sujen Shah, lewismc via mattmann)
* NUTCH-2037 Job endpoint to support Indexing from the REST API (Sujen Shah via mattmann)
* NUTCH-2017 Remove debug log from MimeUtil (snagel)
* NUTCH-2027 seed list REST endpoint for Nutch 1.10 (Asitang Mishra via mattmann)
* NUTCH-2031 Create Admin End point for Nutch 1.x REST service (Sujen Shah via mattmann)
* NUTCH-2015 Make FetchNodeDb optional (off by default) if NutchServer is not used (Sujen Shah via mattmann)
* NUTCH-208 http: proxy exception list: (Matthias Günter, siren, markus, lewismc)
* NUTCH-2007 add test libs to classpath of bin/nutch junit (snagel)
* NUTCH-1995 Add support for wildcard to http.robot.rules.whitelist (totaro)
* NUTCH-2013 Fetcher: missing logs "fetching ..." on stdout (snagel)
* NUTCH-2014 Fetcher hang-up on completion (snagel)
* NUTCH-2011 Endpoint to support realtime JSON output from the fetcher (Sujen Shah via mattmann)
* NUTCH-2006 IndexingFiltersChecker to take custom metadata as input (jnioche)
* NUTCH-2008 IndexerMapReduce to use single instance of NutchIndexAction for deletions (snagel)
* NUTCH-1998 Add support for user-defined file extension to CommonCrawlDataDumper (totaro via mattmann)
* NUTCH-1873 Solr IndexWriter/Job to report number of docs indexed. (snagel via lewismc)
* NUTCH-1934 Refactor Fetcher in trunk (lewismc)
* NUTCH-2004 ParseChecker does not handle redirects (mjoyce via lewismc)
Nutch 1.10 Release - 29/04/2015 (dd/mm/yyyy)
Release Report: http://s.apache.org/nutch10
* NUTCH-1969 URL Normalizer properly handling slashes (markus via mattmann)
* NUTCH-2001 Sub Collection Field Name incorrect in nutch-default.xml
(Jeff Cocking via mattmann)
* NUTCH-1997 Add CBOR "magic header" to CommonCrawlDataDumper
output (Giuseppe Totaro, Luke Sh via mattmann)
* NUTCH-1991 Tika mime detection not using Nutch supplied tika-mimetypes.xml for content based
detection (Iain Lopata, snagel via mattmann)
* NUTCH-1994 Upgrade to Apache Tika 1.8 (lewismc)
* NUTCH-1996 Make protocol-selenium README part of plugin (lewismc)
* NUTCH-1990 Use URI.normalise() in BasicURLNormalizer (snagel, jnioche)
* NUTCH-1973 Job Administration end point for the REST service (Sujen Shah via mattmann)
* NUTCH-1697 SegmentMerger to implement Tool (markus, snagel)
* NUTCH-1987 - Make bin/crawl indexer agnostic (Michael Joyce, snagel via mattmann)
* NUTCH-1989 Handling invalid URLs in CommonCrawlDataDumper (Giuseppe Totaro via mattmann)
* NUTCH-1988 Make nested output directory dump optional (Michael Joyce via mattmann)
* NUTCH-1927 Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing (mattmann, snagel)
* NUTCH-1986 Clarify Elastic Search Indexer Plugin Settings (Michael Joyce via mattmann)
* NUTCH-1906 Typo in CrawlDbReader command line help (Michael Joyce via mattmann)
* NUTCH-1911 Improve DomainStatistics tool command line parsing (Michael Joyce via mattmann)
* NUTCH-1854 bin/crawl fails with a parsing fetcher (Asitang Mishra via snagel)
* NUTCH-1981 Upgrade to icu4j 55.1 (Marko Asplund via snagel)
* NUTCH-1960 JUnit test for dump method of CommonCrawlDataDumper (Giuseppe Totaro via mattmann)
* NUTCH-1983 CommonCrawlDumper and FileDumper don't dump correct JSON (mattmann)
* NUTCH-1972 Dockerfile for Nutch 1.x (Michael Joyce via mattmann)
* NUTCH-1771 Indexer fails if a segment is corrupted or incomplete (Diaa, Chong Li via snagel)
* NUTCH-1975 New configuration for CommonCrawlDataDumper tool (Giuseppe Totaro via mattmann)
* NUTCH-1979 CrawlDbReader to implement Tool (markus)
* NUTCH-1970 Pretty print JSON output in config resource (Tyler Pasulich, mattmann)
* NUTCH-1976 Allow Users to Set Hostname for Server (Tyler Palsulich via mattmann)
* NUTCH-1941 Optional rolling http.agent.name's (Asitang Mishra, lewismc via snagel)
* NUTCH-1959 Improving CommonCrawlFormat implementations (Giuseppe Totaro via mattmann)
* NUTCH-1974 keyPrefix option for CommonCrawlDataDumper tool (Giuseppe Totaro via mattmann)
* NUTCH-1968 File Name too long issue of DumpFileUtil.java file (Xin Zhang, Renxia Wang via mattmann)
* NUTCH-1966 Configuration endpoint for 1x REST API (Sujen Shah via mattmann)
* NUTCH-1967 Possible SIooBE in MimeAdaptiveFetchSchedule (markus)
* NUTCH-1957 FileDumper output file name collisions (Renxia Wang via mattmann)
* NUTCH-1955 ByteWritable missing in NutchWritable (markus)
* NUTCH-1956 Members to be public in URLCrawlDatum (markus)
* NUTCH-1954 FilenameTooLong error appears in CommonCrawlDumper (mattmann)
* NUTCH-1949 Dump out the Nutch data into the Common Crawl format (Giuseppe Totaro via lewismc)
* NUTCH-1950 File name too long (Jiaheng Zhang, Chong Li via mattmann)
* NUTCH-1921 Optionally disable HTTP if-modified-since header (markus)
* NUTCH-1933 nutch-selenium plugin (Mo Omer, Mohammad Al-Moshin, lewismc)
* NUTCH-827 HTTP POST Authentication (Jasper van Veghel, yuanyun.cn, snagel, lewismc)
* NUTCH-1724 LinkDBReader to support regex output filtering (markus)
* NUTCH-1939 Fetcher fails to follow redirects (Leo Ye via snagel)
* NUTCH-1913 LinkDB to implement db.ignore.external.links (markus, snagel)
* NUTCH-1925 Upgrade to Apache Tika 1.7 (Tyler Palsulich via markus)
* NUTCH-1323 AjaxNormalizer (markus)
* NUTCH-1918 TikaParser specifies a default namespace when generating DOM (jnioche)
* NUTCH-1889 Store all values from Tika metadata in Nutch metadata (jnioche)
* NUTCH-865 Format source code in unique style (lewismc)
* NUTCH-1893 Parse-tika failes to parse feed files (Mengying Wang via snagel)
* NUTCH-1920 Upgrade Nutch to use Java 1.7 (lewismc)
* NUTCH-1919 Getting timeout when server returns Content-Length: 0 (jnioche)
* NUTCH-1912 Dump tool -mimetype parameter needs to be optional to prevent NPE (Tyler Palsulich via lewismc)
* NUTCH-1881 ant target resolve-default to keep test libs (snagel)
* NUTCH-1660 Index filter for Page's latitude and longitude (Yasin Kılınç, lewismc)
* NUTCH-1140 index-more plugin, resetTitle creates multiple values in title field (Joe Liedtke, kaveh minooie via snagel)
* NUTCH-1904 Schema for Solr4 doesn't include _version_ field (mattmann)
* NUTCH-1897 Easier debugging of plugin XML errors (markus)
* NUTCH-1823 Upgrade to elasticsearch 1.4.1 (Phu Kieu, markus via lewismc)
* NUTCH-1592 TikaParser can uppercase the element names while generating the DOM (jnioche)
* NUTCH-1877 Suffix URL filter to ignore query string by default (markus via snagel)
* NUTCH-1890 Major Typo in Documentation for Integrating Nutch and Solr (Boadu Akoto Charles Jnr, mattmann)
* NUTCH-1887 Specify HTMLMapper to use in TikaParser (jnioche)
* NUTCH-1884 NullPointerException in parsechecker and indexchecker with symlinks in file URL (Mengying Wang, snagel)
* NUTCH-1825 protocol-http may hang for certain web pages (Phu Kieu via snagel)
* NUTCH-1483 Can't crawl filesystem with protocol-file plugin (Rogério Pereira Araújo, Mengying Wang, snagel)
* NUTCH-1885 Protocol-file should treat symbolic links as redirects (Mengying Wang, snagel)
* NUTCH-1880 URLUtil should not add additional slashes for file URLs (snagel)
* NUTCH-1879 Regex URL normalizer should remove multiple slashes after file: protocol (snagel)
* NUTCH-1883 bin/crawl: use function to run bin/nutch and check exit value (snagel)
* NUTCH-1865 Enable use of SNAPSHOT's with Nutch Ivy dependency management (lewismc)
* NUTCH-1882 ant eclipse target to add output path to src/test (snagel)
* NUTCH-1876 Upgrade to Crawler Commons 0.5 (jnioche)
* NUTCH-1874 FileDumper comment typos ( Arthur Cinader via lewismc)
* NUTCH-1164 Write JUnit tests for protocol-http (nimafl via snagel)
* NUTCH-1868 Document and improve CLI for FileDumper tool (lewismc)
* NUTCH-1869 Add a flag to -mimeType fiag to FileDumper (lewismc)
* NUTCH-1867 CrawlDbReader: use setFloat to pass min score (lewismc, snagel)
* NUTCH-1826, NUTCH-1864 indexchecker fails if solr.server.url not configured (lewismc, snagel)
* NUTCH-1866 ant eclipse target should not delete runtime (nimafl via lewismc)
* NUTCH-1857 readb -dump -format csv should use comma (lewismc)
* NUTCH-1853 Add commented out WebGraph executions to ./bin/crawl (lewismc)
* NUTCH-1844 testresources/testcrawl not referenced anywhere in code (mattmann)
* NUTCH-1839 Improve WebGraph CLI parsing (lewismc)
* NUTCH-1526 Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs (mattmann, lewismc, Julien Le Dem)
* NUTCH-1840 the describe function in SolrIndexWriter is not correct (kaveh minooie via jnioche)
* NUTCH-1837 Upgrade to Tika 1.6 (jnioche)
* NUTCH-1829 Generator : unable to distinguish real errors (Mathieu Bouchard via jnioche)
* NUTCH-1835 Nutch's Solr schema doesn't work with Solr 4.9 because of the RealTimeGet handler (mattmann)
* NUTCH-1833 Include version number within nutch binary usage statement (Rishi Verma via mattmann)
* NUTCH-1832 Make Nutch work without an indexer (mattmann)
* NUTCH-1828 bin/crawl : incorrect handling of nutch errors (Mathieu Bouchard via jnioche)
* NUTCH-1775 IndexingFilter: document origin of passed CrawlDatum (snagel)
* NUTCH-1693 TextMD5Signature computed on textual content (Tien Nguyen Manh, markus via snagel)
* NUTCH-1409 remove deprecated properties db.{default,max}.fetch.interval, generate.max.per.host.by.ip (Matthias Agethle via snagel)
Nutch 1.9 Release Change Log - 12/08/2014 (dd/mm/yyyy)
Release Report - http://s.apache.org/1.9-release
* NUTCH-1561 improve usability of parse-metatags and index-metadata (snagel)
* NUTCH-1708 use same id when indexing and deleting redirects (snagel)
* NUTCH-1818 Add deps-test-compile task for building plugins (jnioche)
* NUTCH-1817 Remove pom.xml from source (jnioche)
* NUTCH-926 Redirections from META tag don't get filtered (snagel)
* NUTCH-1422 Bypass signature comparison when a document is redirected (snagel)
* NUTCH-1502 Test for CrawlDatum state transitions (snagel)
* NUTCH-1804 Move JUnit dependency to test scope (jnioche)
* NUTCH-1811 bin/nutch junit to use junit 4 test runner (snagel)
* NUTCH-1799 ANT Eclipse task discovers all plugin jars automatically (jnioche)
* NUTCH-578 URL fetched with 403 is generated over and over again (snagel)
* NUTCH-1776 Log incorrect plugin.folder file path (Diaa via snagel)
* NUTCH-1566 bin/nutch to allow whitespace in paths (tejasp, snagel)
* NUTCH-1605 MIME type detector recognizes xlsx as zip file (snagel)
* NUTCH-1802 Move TestbedProxy to test environment (jnioche)
* NUTCH-1803 Put test dependencies in a separate lib dir (jnioche)
* NUTCH-385 Improve description of thread related configuration for Fetcher (jnioche,lufeng)
* NUTCH-1633 slf4j is provided by hadoop and should not be included in the job file (kaveh minooie via jnioche)
* NUTCH-1787 update and complete API doc overview page (snagel)
* NUTCH-1767 remove special treatment of "params" in relative links (snagel)
* NUTCH-1718 redefine http.robots.agent as "additional agent names" (snagel, Tejas Patil, Daniel Kugel)
* NUTCH-1794 IndexingFilterChecker to optionally dumpText (markus)
* NUTCH-1590 [SECURITY] Frame injection vulnerability in published Javadoc (jnioche)
* NUTCH-1793 HttpRobotRulesParser not configured properly (jnioche)
* NUTCH-1647 protocol-http throws 'unzipBestEffort returned null' for redirected pages (jnioche)
* NUTCH-1736 Can't fetch page if http response header contains Transfer-Encoding:chunked (ysc via jnioche)
* NUTCH-1782 NodeWalker to return current node (markus)
* NUTCH-1758 IndexChecker to send document to IndexWriters (jnioche)
* NUTCH-1786 CrawlDb should follow db.url.normalizers and db.url.filters (Diaa via markus)
* NUTCH-1757 ParserChecker to take custom metadata as input (jnioche)
* NUTCH-1676 Add rudimentary SSL support to protocol-http (jnioche, markus)
* NUTCH-1772 Injector does not need merging if no pre-existing crawldb (jnioche)
* NUTCH-1752 Cache robots.txt rules per protocol:host:port (snagel)
* NUTCH-1613 Timeouts in protocol-httpclient when crawling same host with >2 threads (brian44 via jnioche)
* NUTCH-1766 Generator to unlock crawldb and remove tempdir if generate job fails (Diaa via jnioche)
* NUTCH-207 Bandwidth target for fetcher rather than a thread count (jnioche)
* NUTCH-1182 fetcher to log hung threads (snagel)
* NUTCH-1759 Upgrade to Crawler Commons 0.4 (jnioche)
* NUTCH-1764 readdb to show command-line help if no action (-stats, -dump, etc.) given (Diaa via snagel)
* NUTCH-1700 Remove deprecated code from creativecommons plugin (lewismc)
* NUTCH-1761 Crawl script fails to find job file if not started from inside bin dir (David Hosking, jnioche)
* NUTCH-1603 ZIP parser complains about truncated PDF file (snagel)
* NUTCH-1720 Duplicate lines in HttpBase.java (Walter Tietze via jnioche)
* NUTCH-1750 Improvement of Fetcher's reportStatus (jnioche)
* NUTCH-1747 Use AtomicInteger as semaphore in Fetcher (jnioche)
* NUTCH-1735 code dedup fetcher queue redirects (snagel)
* NUTCH-1745 Upgrade to ElasticSearch 1.1.0 (jnioche)