forked from crawler-commons/crawler-commons
-
Notifications
You must be signed in to change notification settings - Fork 1
/
CHANGES.txt
140 lines (129 loc) · 8.28 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
Crawler-Commons Change Log
Current Development 0.11-SNAPSHOT (yyyy-mm-dd)
- [Sitemaps] Add support for sitemap extensions (tuxnco, sebastian-nagel) #35, #36, #149, #162
- [Sitemaps] Use the Java 8 date and time API (java.time.*) to parse dates in sitemaps (sebastian-nagel) #217
- [Robots] Fix for handling URLs with query parameters but no path (kkrugler) #215
Release 0.10 (2018-06-05)
- Add JAX-B dependencies to POM (jnioche) #207
- [Sitemaps] Add method to parse and iterate sitemap SiteMapParser#walkSiteMap(URL,Consumer) (Luc Boruta) #190
- [Sitemaps] Sitemap file location to ignore query part of URL (sebastian-nagel) #202
- [RSS sitemaps] Link extraction from RSS feeds fails on XML entities (sebastian-nagel) #204
- [RSS sitemaps] Resolve relative links in RSS feeds (sebastian-nagel) #203
- [RSS sitemaps] Extract links from <guid> elements (sebastian-nagel) #201
- [Sitemaps] Limit on "bad url" log messages (sebastian-nagel) #145
- EffectiveTldFinder to parse Internationalized Domain Names (sebastian-nagel) #179
- Add main() to EffectiveTldFinder (sebastian-nagel) #187
- Handle new suffixes in PaidLevelDomain (kkrugler) #183
- Remove Tika dependency (kkrugler) #199
- Improve MIME detection for sitemaps (sebastian-nagel) #200
- Make RobotRules accessible (aecio via kkrugler) #134
- SimpleRobotRulesParser: Expose MAX_WARNINGS and MAX_CRAWL_DELAY (aecio via kkrugler) #194
- Added main to SimpleRobotRulesParser for testing (sebastian-nagel) #193
- Allow for legacy URIs when checking sitemap namespaces (sebastian-nagel) #211
Release 0.9 (2017-10-27)
- [Sitemaps] Removed DOM-based sitemap parser (jnioche) #177
- Incorrect domains returned by EffectiveTldFinder (sebastian-nagel) #172
- [Sitemaps] Add namespace aware DOM/SAX parsing for XML Sitemaps (Marko Milicevic, jnioche, sebastian-nagel) #176
- Upgraded Tika 1.16 (jnioche) #175
- [Sitemaps] Sitemap SAX parsing mangles target URLs (jnioche, sebastian-nagel) #169
- [Sitemaps] RSS parser ignores pubDate of link (MichealKum via kkrugler) #166
Release 0.8 (2017-06-09)
- Upgraded Tika 1.15 (jnioche) #163
- [Sitemaps] Disable XML resolvers (sebastian-nagel) #151
- Update forbiddenapis to v2.3 (jnioche) #99
- [Sitemaps] gzipped text files fail to parse (sebastian-nagel) #143
- [Sitemaps] Optionally use SAX parser (matt-deboer, jnioche, sebastian-nagel) #116
- [Sitemaps] Properly log XML parsing errors (sebastian-nagel) #146
- Use StandardCharsets where applicable (sebastian-nagel) #141
- Increase sitemap size limit to 50MB (Chaiavi) #132
- Remove dependencies to system-specific locale (sebastian-nagel) #137
- BasicURLNormalizer: NPE for URLs without authority (sebastian-nagel) #136
- BasicURLNormalizer to strip empty port (sebastian-nagel) #133
- Remove deprecated HTTP fetcher (kkrugler) #96
Release 0.7 (2016-11-24)
- Upgrade to JDK 1.8 (lewismc) #126
- [Sitemaps] SitemapParser methods now protected (michaellavelle) #124
- [Sitemaps] Faster parsing of dates (jnioche) #117
- Upgraded Tika 1.13 (jnioche) #113
- Fix license headers (jnioche) #108
- Rename package crawlercommons.url (jnioche) #107
- Sitemap url is not extracted if user agent matches earlier in file (srwilson, kkrugler) #112
- Deprecate HTTP fetcher support (kkrugler) #92
- Added URLFilter interface + BasicURLNormalizer (jnioche) #106
- Updated tld names from publicsuffix.org (jnioche) #100
- Upgraded http-client to version 4.5.1 (aecio via kkrugler) #84
- Upgraded Tika 1.10 (jnioche) #89
- [Sitemaps] Upgrade Valid / Legal / Strict SitemapUrls (Avi Hayun) #82
- [Sitemaps] Upgrade Valid / Legal / Strict SitemapUrls (Avi Hayun) #60
- Simplify pom file (jnioche, lewismc) #77
- Upgrade javac.src.version and javac.target.version to 1.7 or 1.8 (lewismc) #93
- [Sitemaps] Not able to detect RSS feeds (yogendrasoni via kkrugler) #87
- [Robots] Added javadoc comments to the SimpleRobotRulesParser class (MichaelRoeder, kkrugler) #95
Release 0.6 (2015-05-27)
- Issue 75: [Sitemaps] more robust parsing of XML elements (jnioche, kkrugler)
- Issue 76: maven-java-formatter-plugin (jnioche)
- Issue 73: Switch groupID in pom from com.google.code.crawler-commons to crawler-commons (jnioche)
- Issue 71: Upgrade to Tika 1.8 (jnioche)
- Issue 68: [Robots] Path matching should be case-sensitive (kkrugler)
- Issue 67: [Sitemaps] Parsing of lastMod date should use time portion (kkrugler)
- Issue 59: [Robots] Let SimpleRobotRules and its members implements the Serializable interface (kkrugler)
- Issue 65: [Sitemaps] Make SiteMapTool simpler by removing the Recursive flag (Avi Hayun)
- Issue 64: Upgraded to Tika 1.7 (jnioche)
- Issue 32: [Robots] Resolve relative URL for sitemaps (jnioche)
- Issue 62: [Sitemaps] Add new parseSiteMap method (jnioche)
- Issue 57: [Sitemaps] SiteMap should contain a list of SitemapUrls instead of a table of them (Avi Hayun)
- Issue 51: Upgrade httpclient to the latest version (Avi Hayun)
- Issue 61: [Sitemaps] Sitemap Parser changes the processed flag unnecessarily (Avi Hayun)
- Issue 56: [Sitemaps] SiteMap.setBaseUrl(...) causes the domain name to be lowered case which shouldn't happen (Avi Hayun)
- Issue 50: Add Fetch Report to FetchedResult (lewismc, avraham2)
- Issue 55: [Sitemaps] SitemapUrl "setPriority(String str)" should check for proper value (Avi Hayun)
Release 0.5 (2014-10-15)
- Issue 53: Spaces in a comma separated list of names in a User-agent: line cause rules to be applicable to all agents (kkrugler)
- Issue 45: [Sitemaps] Upgrade code after release of Tika v1.6 (Avi Hayun)
- Issue 48: Upgraded to Tika 1.6 (jnioche)
- Issue 47: [Sitemaps] SiteMapParser Tika detection doesn't work well on some cases (Avi Hayun)
- Issue 40: [Sitemaps] Add Tika MediaType Support (Avi Hayun)
- Issue 39: [Sitemaps] Add the Parser a convenience method with only a URL argument (Avi Hayun via lewismc)
- Issue 42: [Sitemaps] Add more JUnit tests (Avi Hayun via lewismc)
- Issue 37: Upgrade the Slf4j logging Library to v1.7.7 (Avi Hayun via kkrugler)
- Issue 41: Upgrade to JUnit v4 conventions in SiteMapParser (Avi Hayun via lewismc)
- Issue 34: Upgrade the Slf4j logging in SiteMaps (Avi Hayun via lewismc)
Release 0.4 (2014-04-11)
- Issue 13: Fix deprecation in Crawler Commons Code (lewismc via kkrugler)
- Issue 8 : Upgrade of httpclient to v4.2.6 (Fuad Efendi, lewismc via kkrugler)
- Issue 18: Support matching against query parameters in robots.txt rules (alparslanavci, kkrugler)
- Issue 21: Follow Google example of giving Allow directives higher match weight than Disallow directives (y.vladimirov, via kkrugler)
- Issue 22: Use longest-match-wins approach to matching URLs in robots.txt (kkrugler)
- Issue 17: Support Googlebot-compatible regular expressions in URL specifications (alparslanavci. kkrugler)
- Issue 31: Missing top level domains (jnioche, kkrugler)
- Issue 23: Trivial improvements to UserAgent (lewismc)
- Issue 30: SitemapIndex should allow to skip sitemaps (Sebastian Nagel, kkrugler)
- cleanup of ANT build remnants [lib and lib-ext] (jnioche)
Release 0.3 (2013-10-11)
- Upgraded to Tika 1.4 (jnioche)
- [SiteMap] added utility class for testing sitemaps (jnioche)
- Issue 16: remove ant scripts and configuration (lewismc)
- Issue 27: [SiteMap] Unnecessary String concatenations when logging + in SiteMapURL.toString() (jnioche)
- Issue 26: [SiteMap] Set correct default priority for URL in a sitemap file (jnioche)
- Issue 25: [Robots] Robots parser should not lowercase sitemap URLs (jnioche)
- Issue 29: [SiteMap] try urls when <loc> element is missing (jnioche)
Release 0.2 (2013-02-02)
- Move to pure Maven for CC build lifecycle (lewismc)
- Move Javadoc out of core code (lewismc)
- Substantiate Javadoc (lewismc)
- Review default.properties (lewismc)
- add HTTP status code & reason to FetchedResult (Fuad Efendi via kkrugler)
- support for multiple user agent names (Tejas Patil via kkrugler)
- added javadoc generation, publish in /doc/javadoc (kkrugler)
- switch to using eclipse-formatter.properties (kkrugler)
- support robots.txt files that have UTF-16LE and UTF-16BE BOMs (kkrugler)
- support for user agent names that contain spaces (kkrugler)
- fixed handling of BOM in sitemaps (Vivek Magotra via kkrugler)
- refactoring of SiteMap objects (Hannes Schwarz via jnioche)
- added simple support for the file: protocol (kkrugler)
- cleaned up packaging and added "install" target (kkrugler)
Release 0.1
- parsing robots.txt
- parsing sitemaps
- URL analyzer which returns Top Level Domains
- a simple HttpFetcher