Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NUTCH-2856] Implement a protocol-smb plugin based on hierynomus/smbj #826

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

HiranChaudhuri
Copy link
Contributor

[NUTCH-2856] Implement a protocol-smb plugin based on hierynomus/smbj

Draft version of a protocol-smb plugin. Lots of todo comments still,
but it seems to work.

Draft version of a protocol-smb plugin. Lots of todo comments still,
but it seems to work.
@lewismc lewismc marked this pull request as draft October 3, 2024 04:10
@lewismc
Copy link
Member

lewismc commented Oct 3, 2024

Moving this to DRAFT status and acknowledging the PR @HiranChaudhuri thank you.
I will try to perform a full review this coming week... thank you.

Copy link
Member

@lewismc lewismc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HiranChaudhuri I added quite a few comments for your consideration. Thanks for submitting this PR 👍
Please ping me once your ready and we can go for round # 2 of peer review.

Further out, I think we could implement some testing for this protocol plugin. We could use testcontainers and essentially spin up a local Samba server using @nddipiazza 's smbj-inttest image. We can come back to this one the PR has evolved a bit.

@@ -26,3 +26,7 @@ lib/spotbugs-*
ivy/dependency-check-ant/*
.gradle*
ivy/apache-rat-*
.vscode
crawl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove crawl, urls, and solr_datadir.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents me from accidentially committing these directories.
Maybe the tutorial should point out where to create such data - in my case it ended up on this level.
So I'll follow your recommendataion at some later point in time.

@@ -25,7 +25,8 @@
<Appenders>
<RollingFile name="RollingFile" fileName="${hadoop.log.dir}/${hadoop.log.file}"
filePattern="${hadoop.log.dir}/$${date:yyyy-MM}/nutch-%d{yyyy-MM-dd}.log.gz">
<PatternLayout pattern="%d %p %c{1.} [%t] %m%n" />
<!--<PatternLayout pattern="%d %p %c{1.} [%t] %m%n" />-->
<PatternLayout pattern="%d %p %c [%t] %m%n" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this change? Does this print the logger name in full?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does print in full as I am otherwise not entirely clear where a message is coming from.
We can revert the change before merge - for the time being on my side I need it.

@@ -0,0 +1,32 @@
#/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True again. For the time being I'll keep it but we can remove it before merging.

@@ -78,6 +78,7 @@
<ant dir="protocol-httpclient" target="deploy"/>
<ant dir="protocol-interactiveselenium" target="deploy" />
<ant dir="protocol-okhttp" target="deploy"/>
<ant dir="protocol-smb" target="deploy"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we also add the clean and test targets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please advise how. I was happy to have found this...


<dependencies>
<dependency org="com.hierynomus" name="smbj" rev="0.13.0"/>
<dependency org="net.engio" name="mbassador" rev="1.3.0"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to define these dependencies as they are transitive and will therefore be fetched when we depend on org="com.hierynomus" name="smbj" rev="0.13.0".

Please remove all of the dependencies apart from smbj. The others can be added to plugin.xml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be true. Having learned that 'ant clean' did not clean my plugin I am not sure however. Once I see it working I'll remove it.

sb.append(" ").append(HEX_ARRAY[b>>>4]).append(HEX_ARRAY[b & 0xF]);
}
LOG.warn("retrieved {} bytes starting with {}", bytes.length, sb.toString());
LOG.warn("metadata={}", metadata);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup logging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

} else {
// communicate error
String message = "File not found: " + urlstr;
LOG.warn(message);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup logging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


throw new UnsupportedOperationException("neither directory nor file: " + urlstr);
} catch(Exception e) {
LOG.error("Could not get protocol output for " + urlstr, e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use parameterized logging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Override
public BaseRobotRules getRobotRules(Text url, CrawlDatum datum,
List<Content> robotsTxtContent) {
LOG.debug("getRobotRules({}, {}, {})", url, datum, robotsTxtContent);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some functionality for robot.txt.

@@ -0,0 +1,57 @@
package org.apache.nutch.protocol.smb;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add ALv2 license header

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Improve error handling
Rename class as requested
Added license header
Improve url parsing
added robots.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants