-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NUTCH-2856] Implement a protocol-smb plugin based on hierynomus/smbj #826
base: master
Are you sure you want to change the base?
Conversation
Draft version of a protocol-smb plugin. Lots of todo comments still, but it seems to work.
Moving this to DRAFT status and acknowledging the PR @HiranChaudhuri thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @HiranChaudhuri I added quite a few comments for your consideration. Thanks for submitting this PR 👍
Please ping me once your ready and we can go for round # 2 of peer review.
Further out, I think we could implement some testing for this protocol plugin. We could use testcontainers
and essentially spin up a local Samba server using @nddipiazza 's smbj-inttest image. We can come back to this one the PR has evolved a bit.
@@ -26,3 +26,7 @@ lib/spotbugs-* | |||
ivy/dependency-check-ant/* | |||
.gradle* | |||
ivy/apache-rat-* | |||
.vscode | |||
crawl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove crawl
, urls
, and solr_datadir
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prevents me from accidentially committing these directories.
Maybe the tutorial should point out where to create such data - in my case it ended up on this level.
So I'll follow your recommendataion at some later point in time.
@@ -25,7 +25,8 @@ | |||
<Appenders> | |||
<RollingFile name="RollingFile" fileName="${hadoop.log.dir}/${hadoop.log.file}" | |||
filePattern="${hadoop.log.dir}/$${date:yyyy-MM}/nutch-%d{yyyy-MM-dd}.log.gz"> | |||
<PatternLayout pattern="%d %p %c{1.} [%t] %m%n" /> | |||
<!--<PatternLayout pattern="%d %p %c{1.} [%t] %m%n" />--> | |||
<PatternLayout pattern="%d %p %c [%t] %m%n" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for this change? Does this print the logger name in full?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does print in full as I am otherwise not entirely clear where a message is coming from.
We can revert the change before merge - for the time being on my side I need it.
@@ -0,0 +1,32 @@ | |||
#/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True again. For the time being I'll keep it but we can remove it before merging.
@@ -78,6 +78,7 @@ | |||
<ant dir="protocol-httpclient" target="deploy"/> | |||
<ant dir="protocol-interactiveselenium" target="deploy" /> | |||
<ant dir="protocol-okhttp" target="deploy"/> | |||
<ant dir="protocol-smb" target="deploy"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that we also add the clean
and test
targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please advise how. I was happy to have found this...
|
||
<dependencies> | ||
<dependency org="com.hierynomus" name="smbj" rev="0.13.0"/> | ||
<dependency org="net.engio" name="mbassador" rev="1.3.0"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need to define these dependencies as they are transitive and will therefore be fetched when we depend on org="com.hierynomus" name="smbj" rev="0.13.0"
.
Please remove all of the dependencies apart from smbj
. The others can be added to plugin.xml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be true. Having learned that 'ant clean' did not clean my plugin I am not sure however. Once I see it working I'll remove it.
sb.append(" ").append(HEX_ARRAY[b>>>4]).append(HEX_ARRAY[b & 0xF]); | ||
} | ||
LOG.warn("retrieved {} bytes starting with {}", bytes.length, sb.toString()); | ||
LOG.warn("metadata={}", metadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} else { | ||
// communicate error | ||
String message = "File not found: " + urlstr; | ||
LOG.warn(message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
throw new UnsupportedOperationException("neither directory nor file: " + urlstr); | ||
} catch(Exception e) { | ||
LOG.error("Could not get protocol output for " + urlstr, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use parameterized logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@Override | ||
public BaseRobotRules getRobotRules(Text url, CrawlDatum datum, | ||
List<Content> robotsTxtContent) { | ||
LOG.debug("getRobotRules({}, {}, {})", url, datum, robotsTxtContent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some functionality for robot.txt.
@@ -0,0 +1,57 @@ | |||
package org.apache.nutch.protocol.smb; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add ALv2 license header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Improve error handling Rename class as requested Added license header Improve url parsing added robots.txt
[NUTCH-2856] Implement a protocol-smb plugin based on hierynomus/smbj
Draft version of a protocol-smb plugin. Lots of todo comments still,
but it seems to work.