Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature][fn] Introduce connectors catalogue #58

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

nicoloboschi
Copy link
Owner

Motivation

STILL DRAFT

To enhance the builtin connectors offer, the functions worker must contain all the NAR files in the docker image.
This is not a scalable solution and it leads to heavy-weight docker images.
At the functions worker startup, all the nar are looked up in the ´connectors´ directory and unpacked.

The idea is to download the actual nar on demand, only when strictly needed.
Instead of bundling the nar inside the docker image, the NAR files will be placed in a remote machine, which will be called catalogue.
In the functions worker will be possible to configure the URL for the catalogue index (only basic auth supported for now).
The catalogue will be tipically a static webserver which serves a YAML file, with this structure:

entries:
  - url: https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download&filename=pulsar/pulsar-2.11.0/connectors/pulsar-io-elastic-search-2.11.0.nar
    definition:
      name: elastic_search
      description: Writes data into Elastic Search
      sinkClass: org.apache.pulsar.io.elasticsearch.ElasticSearchSink
      sinkConfigClass: org.apache.pulsar.io.elasticsearch.ElasticSearchConfig
  - url: https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download&filename=pulsar/pulsar-2.11.0/connectors/pulsar-io-file-2.11.0.nar
    definition:
      name: file
      description: Reads data from local filesystem
      sourceClass: org.apache.pulsar.io.file.FileSource
      sourceConfigClass: org.apache.pulsar.io.file.FileSourceConfig 

All these entries are builtin connectors. When the users ask for the builtin connectors, the metadata are retrieved from the index file.
Once the actual nar is needed, it's downloaded on-demand and cached in the container.
The nar is downloaded in the functions worker when:

  • the sink/source is installed/updated (needed for validation)
  • the sink/source is downloaded from a pod (k8s runtime)

The catalogue may or not also serve the connectors artifacts. There is basic auth support for remote artifacts as well.

In a Kubernetes environment, would be possible to create a separated deployment as connector catalogue, backed by nginx or similar services.

With the catalogue being decoupled from the functions worker, the upgrades of each connector is independant to the functions worker image. However, a functions worker restart is needed to download the newer catalogue index.

Modifications

  • Added lazy load of sinks/source when needed
  • Added the catalogue download connectorsCatalogueUrl

@cdbartholomew
Copy link

On this:

However, a functions worker restart is needed to download the newer catalogue index.

Would it be possible to have the newer catalogue index downloads on the reload command, such as:

pulsar-admin sinks reload

I believe in the current implementation that will scan for new connectors NARs, which is equivalent to downloading a newer index.

Copy link

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work,
I have left a couple of suggestions

(and also we need tests)

@@ -797,7 +803,7 @@ private void setupInput(ContextImpl contextImpl) throws Exception {

pulsarSourceConfig.setSubscriptionType(contextImpl.getSubscriptionType());

pulsarSourceConfig.setTypeClassName(sourceSpec.getTypeClassName());
// pulsarSourceConfig.setTypeClassName(sourceSpec.getTypeClassName());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

final ConnectorsManager connectorsManager = new ConnectorsManager(connectorsDir, narExtractionDirectory, connectorsCatalogueUrl);
connectors.forEach((name, connector) -> {
try {
connectorsManager.loadConnector(name, connector.getConnectorDefinition().getSinkClass() == null ? ComponentType.SOURCE : ComponentType.SINK);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to do it here ? (in localrun mode eagerly download everything?)

final File localCache = new File(narExtractionDirectory,
"downloads");
localCache.mkdirs();
localArchive = new File(localCache, archiveName);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should properly set permissions (only user can read/write)

throws IOException {
try {
File localArchive;
if (connector.getDownloadPath().startsWith(Utils.HTTP)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should log here something to make it clear that we are downloading from the Internet the connector and also the directory

@@ -279,18 +279,22 @@ public static File createPkgTempFile() throws IOException {
}

public static File extractFileFromPkgURL(String destPkgUrl) throws IOException, URISyntaxException {
final File pkgTempFile = createPkgTempFile();
pkgTempFile.deleteOnExit();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not guaranteed to happen.
but it is already like this in legacy code, we can keep it

@@ -437,3 +437,4 @@ zooKeeperSessionTimeoutMillis: -1
# ZooKeeper operation timeout in seconds
# Deprecated: use metadataStoreOperationTimeoutSeconds
zooKeeperOperationTimeoutSeconds: -1
connectorsCatalogueUrl: file:///Users/nicolo.boschi/catalogue.yaml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this catalogue will include URLs that will probably refer to third party websites.

It may be better to also configure a list of allowed DNS names for the final connector URLs,
this way a forged catalogue cannot force the worker to download anything from the Internet.

In other words we must have an explicit list of allowed sources for the .nar files. It can be a comma separated list of DNS names or "base urls"

connectorsAllowedDownloadArchives: https://server1/xxx,https://server2/xxx,https://server2/yyyyy

(the comma character is not common for URLs, we can use it without problems

futures.add(future);
}

FutureUtil.waitForAll(futures).join();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please set a timeout

* Make the catalogue connector not downloading the artifact and configure the docker image
* Remove the need for class loader for connectors from catalogue
* Make the catalogue connector not downloading the artifact and configure the docker image
@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label May 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants