Implement http data exchange #477
Replies: 17 comments
-
Hello, If I understood correctly, EDC is set to enforce data usage policies when exchanging data. From this diagram https://raw.githubusercontent.com/eclipse-dataspaceconnector/DataSpaceConnector/main/docs/diagrams/mvp.png, is it correct that the data transfer is going through the EDCs instance and not directly done between the 2 data storage endpoints (fileshares) ? If correct, how is the system planning to scale ? In order to avoid becoming a bottleneck, does the infrastructure deployed for the EDC itself need to match the specifications (CPU, Bandwidth, Latency) of the data storage endpoints (fileshares) ? Best regards, |
Beta Was this translation helpful? Give feedback.
-
Hi @ticapix, the MVP scenario you just linked follows requirements of initaitves that currently use the EDC to support messages by the International Data Space, which currently follow the approach of a data transfer through the Connector component. The important aspect here: while this is possible with the EDC and used for the MVP along the user's requirements, it is not the intended way the EDC handles data transfer and data flow. So you get it right from the MVP diagram point of view, but the EDC is already able to deal differently with the shortcomings of this approach, which btw (due to its limits) is currently also being discussed to be changed in the IDS initiative. |
Beta Was this translation helpful? Give feedback.
-
Hi @mspiekermann , Thank you. The out-of-band communication scenario is ensured with a sidecar (envoy) that is deployed by the data storage provider[1] with administrative rights[2] on the storage controllers[3] ? Pierre [1]: Data storage provider is the organisation managing the storage. Could be the same organisation than the data owner if this is a self-managed tenant or a CSP if this is an off-the-shelf managed subscription. [2] administrative rights:
[3]: storage controllers: entity with logical and direct access to the data (CEPH/S3/SWIFT storage nodes, HDFS datanodes, FTP server, Kafka brokers, ...) |
Beta Was this translation helpful? Give feedback.
-
There is not only that one possibility i would say, how this could be enabled. Adding control and observability with the sidecar pattern is something that defenitly should be considered. Envoy would just be one example for a proxy capability, that is often used with the sidecar pattern to add control to micorservices, so no direct link here. |
Beta Was this translation helpful? Give feedback.
-
Hi @ticapix, As @mspiekermann mentioned, sidecars are one deployment scenario if someone is using K8S or similar infrastructure. However, this could also be generalized. For example, the data transfer plane could be a separate set of services operated by one of the parties to the transfer or a third-party. Access to the data plane could be via an API that each connector uses. This would also enable the scenarios you outlined above with minimal infrastructure. One concrete example of this could be a healthcare dataspace where the data must travel across a secure, private network layer. The EDC is essentially a transfer process coordinator. It can run separate from the data plane itself as long as there is a control link between the EDC and the latter. Does this answer your question? |
Beta Was this translation helpful? Give feedback.
-
Hi @jimmarino Not really. I understand the data/control plane arch. However it requires the control plane to have control over the data plane, which I don't see yet for EDC. The exemple of K8s was given, but K8s is a container orchestrator not a data storage. Unless EDC plans to deploy at runtime storage services (mongodb, kafka, HDFS, S3/SWIFT, ...) inside a K8s cluster ? We described 4 scenario so far:
Have I capture all the scenario ? |
Beta Was this translation helpful? Give feedback.
-
@ticapix sorry for the delay, we were tied up with the EDC Hackathon last week. I suspect we may be talking about different things. With the EDC, data transfer generally happens out-of-band, that is, the flow of bits is carried out by other infrastructure. This could be a Kafka cluster, cloud events, blob storage transfer, etc. Two EDC participants will coordinate with this infrastructure on their respective ends. For example, in a transfer between AWS S3 and Azure Object Storage, the EDC on the Azure side will setup a receiving container and temporary access credentials for the EDC on the S3 end to signal its infrastructure to stream out of a bucket to the Azure destination. I'm simplifying the flow a bit (for example instead of pushing data, "pull" scenarios can also be enabled) but this captures the overall architecture. Usage control can be further applied, for example, the destination Azure container could be confined to a geographic region. There are also hooks in the EDC to conduct asynchronous provisioning of resources prior to data transfer. This could potentially be used to deploy storage resources, prep data or other tasks. Does this answer your question? |
Beta Was this translation helpful? Give feedback.
-
Hi @jimmarino We're getting closer :)
and
That the coordinator mode as described above, provisioning managed services. Is that correct ? |
Beta Was this translation helpful? Give feedback.
-
My mental model is that the EDC is always a "coordinator" in that it manages the contract and data flow process (which are asynchronous state machines). How much it "manages" depends on the deployment. For example, an EDC may provision cloud resources (or send a request to other infrastructure to do so), or not provision anything, in which case it would use "static" or "pre-deployed" infrastructure. A simple example would be S3 buckets. An EDC could be setup to dynamically create them or use pre-existing buckets. |
Beta Was this translation helpful? Give feedback.
-
Ok. |
Beta Was this translation helpful? Give feedback.
-
@ticapix Sorry if my explanation was unclear but the EDC can enforce access control. There are a number of ways to do this. In the case where storage endpoints are used, the EDC can be configured to create either temporary or long-term access credentials. It can also revoke those credentials at any time for long-running/non-finite processes (e.g., streams). With some infrastructures, it is even possible for the EDC to push down policy to the network layer for tighter control. One large scale project building on EDC in the travel industry is using EDC to control access to a Data Service (API). Access to that service can be revoked, updated, or limited throughout the data access lifecycle. EDC can use OPA to establish controls and different infrastructure layers. Many users provision EDC setups using Terraform. However, EDC, OPA, Terraform, and Ansible are completely different technologies that solve different problems. Terraform, Ansible, and OPA do not solve the problem of policy-based data sharing. In turn, the EDC does not solve the problem of provisioning deployments or configuring policy in a generic runtime environment. Basically, EDC solves the problem of how different entities can:
We plan on releasing Milestone 1 on Friday, December 3. Of course, there is still a lot of work to be done, but this release will provide many of the key pieces of our vision for dataspaces as well as a strong foundation for GAIA-X use cases that center on data sharing. I hope this clarifies things. |
Beta Was this translation helpful? Give feedback.
-
Hi @jimmarino The sequence of actions is still unclear to me. I agree that, if EDC was granted control over the user's tenant:
However, once the first token has been shared - or access granted - I don't get how EDC knows when to update its state machine. After step 8, how is EDC aware of what actions must be taken to update the access rights ? Pierre sequenceDiagram
autonumber
actor bob as Bob
participant data_bob as Bob's storage endpoint<br>(object storage, HDFS, FTP, ...)
participant edc_bob as Bob's EDC instance
participant edc_alice as Alice's EDC instance
participant process_alice as Alice's processing workload<br>(Spark, Flink, Beam, notebook, ...)
actor alice as Alice
Note over bob, alice: scenario: Alice wants to get data from Bob
bob ->> edc_bob: delegates rights to
alice ->> edc_alice: delegates rights to
edc_bob --> edc_alice: find each other and negociate access control
edc_bob ->> data_bob: configures access
edc_bob ->> edc_alice: shares access token
edc_alice ->> process_alice: shares access token
process_alice ->> data_bob: requests data
data_bob ->> process_alice: sends data
|
Beta Was this translation helpful? Give feedback.
-
A few questions in order to answer that:
|
Beta Was this translation helpful? Give feedback.
-
If the goal is to have EDC acting as one time access control, then my question is not relevant.
BTW, none of the above answer one of my other questions: How EDC control which data is being accessed but we can finish this thread first. |
Beta Was this translation helpful? Give feedback.
-
Part of the issue we are having is that our discussion is too abstract, and we don't have a common set of terminology to settle on. My suggestion would be to back up and define a specific use case we want to address. Once we have that defined, we can map that to an EDC architecture. Otherwise, I'm afraid we will go in circles, and I won't do a good job answering your questions. That said, Catena-X, EONA-X, the Mobility DataSpace and others (there will be public announcements soon) are committed to using the EDC as the foundation of their dataspace platforms. Shall we pick a specific use case from one of those projects and walk through a detailed analysis of how the EDC is being used? As I mentioned, our Milestone 1 release is due on Friday, so what we walkthrough will be actual code, not plans or diagrams. This should allow us to drill down to a very concrete level quickly. If you want, I can pick a use case and involve relevant people from one of those projects in the discussion. Let me know. |
Beta Was this translation helpful? Give feedback.
-
The sequence diagram above is pretty clear to me, so was the table with the different modes, but ok. Considering that none of the usecases I heard or have in mind seems to work with EDC, could you provide one ? Thank you, |
Beta Was this translation helpful? Give feedback.
-
@ticapix Please don't jump to conclusions. We never said the use cases you have in mind won't work. We are having difficulty arriving at a common set of terminology. We will outline a very specific use case based on an actual project that is currently being developed. I believe this will allow us to make progress as well as keep the discussion grounded in concrete terms. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
All reactions