Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add deduplication ratio to 'ipfs dag stat' #9787

Merged
merged 27 commits into from
Jun 6, 2023

Conversation

arthurgavazza
Copy link
Contributor

@arthurgavazza arthurgavazza commented Apr 1, 2023

This adds deduplication information on a list of Cids for the ipfs dag stat command,as requested in #9762. ex:
running ipfs dag stat QmcQs9AFNh4mHKqFLJ2Zo7WPSottWi9YDdEoqdk3jv5RyE QmSg4hRo947Lkp9qKn3KJhjjCFGNmtw8VLoXhUCC9mLRgU QmSg4hRo947Lkp9qKn3KJhjjCFGNmtw8VLoXhUCC9mLRgU writes the following output to stdout:

CID                                           	Blocks         	Size
QmcQs9AFNh4mHKqFLJ2Zo7WPSottWi9YDdEoqdk3jv5RyE	2500000        	27
QmSg4hRo947Lkp9qKn3KJhjjCFGNmtw8VLoXhUCC9mLRgU	2500000        	27
QmSg4hRo947Lkp9qKn3KJhjjCFGNmtw8VLoXhUCC9mLRgU	2500000        	27

Summary

Total Size: 54
Unique Blocks: 2
Shared Size: 27
Ratio: 1.500000

and with the --enc=json option:

{
    "UniqueBlocks": 2,
    "TotalSize": 54,
    "SharedSize": 27,
    "Ratio": 1.5,
    "DagStats": [
        {
            "Cid": "QmcQs9AFNh4mHKqFLJ2Zo7WPSottWi9YDdEoqdk3jv5RyE",
            "Size": 27,
            "NumBlocks": 1
        },
        {
            "Cid": "QmSg4hRo947Lkp9qKn3KJhjjCFGNmtw8VLoXhUCC9mLRgU",
            "Size": 27,
            "NumBlocks": 1
        },
        {
            "Cid": "QmSg4hRo947Lkp9qKn3KJhjjCFGNmtw8VLoXhUCC9mLRgU",
            "Size": 27,
            "NumBlocks": 1
        }
    ]
}


Fixes #9762

@arthurgavazza
Copy link
Contributor Author

@Jorropo this is just a draft, I'm still working on it, I'll add the tests and improve some things before starting with the bonus points. Can you take a look at this, just to see if I'm in the right direction? I don't know if that's what you wanted for the json and text encoders.

@Jorropo Jorropo self-requested a review April 4, 2023 02:40
@arthurgavazza
Copy link
Contributor Author

@Jorropo can you review this draft? I'm kind of blocked on this feature, I just need to know if I'm on the right direction and if this satisfies the issue basic requirements so I can continue working on it and make the improvements you mentioned.

@BigLep
Copy link
Contributor

BigLep commented Apr 19, 2023

Just wanted to say thanks @arthurgavazza for your contribution. The maintainers have been busy with https://2023.ipfs-thing.io/ but will be back into more normal flow next week.

@arthurgavazza
Copy link
Contributor Author

@BigLep thanks for letting me know!

Copy link
Contributor

@Jorropo Jorropo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx a lot <3 this is good.
I have a few comments.

core/commands/dag/dag.go Outdated Show resolved Hide resolved
core/commands/dag/dag.go Outdated Show resolved Hide resolved
core/commands/dag/dag.go Outdated Show resolved Hide resolved
Comment on lines 342 to 344
if s.TotalSize > 0 {
s.Ratio = float32(s.redundantSize) / float32(s.TotalSize)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the if check and let s.Ratio being NaN and implement custom marshal which omits Ratio if it is NaN. But that fine, probably.

core/commands/dag/dag.go Outdated Show resolved Hide resolved
core/commands/dag/dag.go Outdated Show resolved Hide resolved
core/commands/dag/stat.go Show resolved Hide resolved
core/commands/dag/stat.go Outdated Show resolved Hide resolved
core/commands/dag/dag.go Outdated Show resolved Hide resolved
core/commands/dag/dag.go Outdated Show resolved Hide resolved
core/commands/dag/dag.go Outdated Show resolved Hide resolved
@arthurgavazza
Copy link
Contributor Author

@Jorropo thx for the feedback! I'll work on this ASAP

arthurgavazza and others added 5 commits May 2, 2023 21:06
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
@Jorropo
Copy link
Contributor

Jorropo commented May 3, 2023

For ci fix please run:

find -type f -name "go.mod" | while read f; do cd $(dirname "${f}"); go mod tidy; cd -; done
go fmt ./...

@BigLep
Copy link
Contributor

BigLep commented May 11, 2023

@arthurgavazza: are you to fix CI? It would be great to include in the next release.

@BigLep BigLep marked this pull request as ready for review May 11, 2023 06:04
@BigLep BigLep requested a review from a team as a code owner May 11, 2023 06:04
@Jorropo
Copy link
Contributor

Jorropo commented May 11, 2023

I think some fixes after the review havn't been pushed yet to. 🙂

@arthurgavazza
Copy link
Contributor Author

@BigLep @Jorropo I've been busy with work this week. I still have to finish some fixes, I'm gonna try to do this until tomorrow night if that's ok

@Jorropo
Copy link
Contributor

Jorropo commented May 11, 2023

@arthurgavazza you don't have any deadline, feel free to work on this when you have time. It will get in the release cycle whenever that happens.
Thx for your time this is appreciated.

assert.Nil(t, err)
stat := node.RunIPFS("dag", "stat", "--progress=false", node1Cid, node2Cid)
str := stat.Stdout.String()
expected := "\nCID \tBlocks \tSize\nbafyreibmdfd7c5db4kls4ty57zljfhqv36gi43l6txl44pi423wwmeskwy\t2 \t53\nbafyreie3njilzdi4ixumru4nzgecsnjtu7fzfcwhg7e6s4s5i7cnbslvn4\t2 \t53\n\nSummary\nTotal Size: 145\nUnique Blocks: 3\nShared Size: 53\nRatio: 1.365517\n\n\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the ` syntax, this is very hard to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for that, I'm gonna make this more readable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jorropo I'm having a hard time comparing the text output with a string literal in this test, I added a test using the json encoded output so that I can validate the deduplication behaviour. Is there any other way that I could test the text output?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arthurgavazza json is good.

For the text one: the point of golden tests is that the code of the test is very dumb.
You create a dataset ahead of time that show all the behaviour you want (here it would be a .car fixture) and then you == or diff the output with some hardcoded values you create at the same time you make your fixture.

test/cli/dag_test.go Outdated Show resolved Hide resolved
test/cli/dag_test.go Show resolved Hide resolved
Co-authored-by: Jorropo <jorropo.pgm@gmail.com>
@arthurgavazza
Copy link
Contributor Author

@Jorropo sorry for taking too long to make the changes, I was very busy this last few weeks. I added a new test, comparing the text output with a fixture containing the expected results. Can you please take a look at the latest changes?

@hacdias hacdias requested a review from Jorropo June 5, 2023 09:42
core/commands/dag/dag.go Outdated Show resolved Hide resolved
Copy link
Member

@hacdias hacdias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very happy that you also moved the old relevant tests from .sh to .go 😃

@hacdias hacdias changed the title Feat/add deduplication ratio stat feat: add deduplication ratio to 'ipfs dag stat' Jun 5, 2023
Co-authored-by: Henrique Dias <hacdias@gmail.com>
Copy link
Contributor

@Jorropo Jorropo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx <3

test/cli/dag_test.go Outdated Show resolved Hide resolved
@hacdias hacdias requested a review from Jorropo June 6, 2023 13:16
@Jorropo Jorropo merged commit 726eabe into ipfs:master Jun 6, 2023
@Jorropo
Copy link
Contributor

Jorropo commented Jun 6, 2023

Thx ❤️🎉

@arthurgavazza
Copy link
Contributor Author

@Jorropo thx for all the support! ❤️ I'm really glad I could help. About the improvements you mentioned on the issue. Can I open a new issue with them and start working on it?

@arthurgavazza arthurgavazza deleted the feat/add-deduplication-ratio-stat branch June 6, 2023 13:21
@Jorropo
Copy link
Contributor

Jorropo commented Jun 6, 2023

@arthurgavazza which improvements ?

@arthurgavazza
Copy link
Contributor Author

arthurgavazza commented Jun 6, 2023

@Jorropo
Bonus points (you can go for it if you want to have fun but I don't care about it for now):

  1. You execute the multiple traversals in parallel of one an other. That means all iterations of the loop starts a goroutine and you wg.Wait() the final result before printing it.
  2. You add computation of internal compression ratios too (that means CIDs that share blocks with themselves multiple times).
  3. You optimize the code by caching results of the traversal at any point to only traverse blocks once, even if they are duplicated across the different CIDs (hard).
  4. You compute a result in a single pass (medium if done alone, very hard if combined with 1 and 3).
  5. You output rich progress information similar to the final output (instead of just some total amount blocks).

@Jorropo
Copy link
Contributor

Jorropo commented Jun 6, 2023

@arthurgavazza I don't care that much about thoses, 1, 3 and 4 are just performance related (and what you have now is ok). 2 and 5 are very minor quality of life.

Feel free to do it if that is fun for you, but if you want to make something that would change Kubo's usability I think there will be better use of your time than over optimizing an underused endpoint. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

ipfs dag stat should accept multiple CIDs and show deduplication ratio
4 participants