Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add deduplication ratio to 'ipfs dag stat' #9787

Merged
merged 27 commits into from
Jun 6, 2023
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
f1604da
checkpoint1
arthurgavazza Mar 31, 2023
ac3e6b0
feat: achieve correct minimal output
arthurgavazza Mar 31, 2023
ce81fd0
feat: start adding json encoding
arthurgavazza Mar 31, 2023
6dffb58
fix: fmt
arthurgavazza Mar 31, 2023
946fdd6
feat: add custom json encoder and format text encoder
arthurgavazza Apr 1, 2023
63d271b
fix: remove test json files
arthurgavazza Apr 1, 2023
fbc1c9c
Update core/commands/dag/dag.go
arthurgavazza May 3, 2023
eac10df
fix: check error on cdv write
arthurgavazza May 3, 2023
27547d3
fix: tsv title format
arthurgavazza May 3, 2023
7f38e7e
fix: print line with Fprintln
arthurgavazza May 3, 2023
76e5f20
fix: use boxo traverse
arthurgavazza May 3, 2023
b48dcb0
refactor: calculate summary with only one method
arthurgavazza May 3, 2023
9bbef7f
refactor: make simplifications
arthurgavazza May 3, 2023
620f73c
fix: add support to progressive information
arthurgavazza May 12, 2023
9146d95
fix: fix ci
arthurgavazza May 12, 2023
ee82e97
test: add a winner test using a fixture
arthurgavazza May 16, 2023
6641fb3
Merge branch 'master' into feat/add-deduplication-ratio-stat
arthurgavazza May 16, 2023
8d79bcb
Update test/cli/dag_test.go
arthurgavazza May 16, 2023
f1840ca
fix: deduplication calculations and tests
arthurgavazza May 21, 2023
864e91b
fix: lint
arthurgavazza May 21, 2023
b689c10
fix: CI
arthurgavazza May 21, 2023
220856f
Merge branch 'master' into feat/add-deduplication-ratio-stat
arthurgavazza May 21, 2023
f2051db
test: add text output tests
arthurgavazza Jun 4, 2023
0df029e
Merge branch 'master' into feat/add-deduplication-ratio-stat
arthurgavazza Jun 4, 2023
08f837c
Update core/commands/dag/dag.go
Jorropo Jun 5, 2023
538f5bb
Update test/cli/dag_test.go
Jorropo Jun 5, 2023
336a98c
FIXUP
Jorropo Jun 5, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 103 additions & 6 deletions core/commands/dag/dag.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
package dagcmd

import (
"encoding/csv"
"encoding/json"
"fmt"
"io"

Expand Down Expand Up @@ -275,13 +277,83 @@ CAR file follows the CARv1 format: https://ipld.io/specs/transport/car/carv1/
}

// DagStat is a dag stat command response

Jorropo marked this conversation as resolved.
Show resolved Hide resolved
type DagStat struct {
Size uint64
NumBlocks int64
Cid cid.Cid `json:",omitempty"`
Size uint64 `json:",omitempty"`
NumBlocks int64 `json:",omitempty"`
}

func (s *DagStat) String() string {
return fmt.Sprintf("Size: %d, NumBlocks: %d", s.Size, s.NumBlocks)
return fmt.Sprintf("%s %d %d", s.Cid.String()[:20], s.Size, s.NumBlocks)
}

func (s *DagStat) MarshalJSON() ([]byte, error) {
type Alias DagStat
/*
We can't rely on cid.Cid.MarshalJSON since it uses the {"/": "..."}
format. To make the output consistent and follow the Kubo API patterns
we use the Cid.String method
*/
return json.Marshal(struct {
Cid string `json:"Cid"`
*Alias
}{
Cid: s.Cid.String(),
Alias: (*Alias)(s),
})
}

func (s *DagStat) UnmarshalJSON(data []byte) error {
/*
We can't rely on cid.Cid.UnmarshalJSON since it uses the {"/": "..."}
format. To make the output consistent and follow the Kubo API patterns
we use the Cid.Parse method
*/
type Alias DagStat
aux := struct {
Cid string `json:"Cid"`
*Alias
}{
Alias: (*Alias)(s),
}
if err := json.Unmarshal(data, &aux); err != nil {
return err
}
Cid, err := cid.Parse(aux.Cid)
if err != nil {
return err
}
s.Cid = Cid
return nil
}

type DagStatSummary struct {
redundantSize uint64 `json:"-"`
UniqueBlocks int `json:",omitempty"`
TotalSize uint64 `json:",omitempty"`
SharedSize uint64 `json:",omitempty"`
Ratio float32 `json:",omitempty"`
DagStatsArray []*DagStat `json:"DagStats,omitempty"`
}

func (s *DagStatSummary) String() string {
return fmt.Sprintf("Total Size: %d\nUnique Blocks: %d\nShared Size: %d\nRatio: %f", s.TotalSize, s.UniqueBlocks, s.SharedSize, s.Ratio)
}

func (s *DagStatSummary) incrementTotalSize(size uint64) {
s.TotalSize += size
}
func (s *DagStatSummary) incrementRedundantSize(size uint64) {
s.redundantSize += size
}
func (s *DagStatSummary) appendStats(stats *DagStat) {
s.DagStatsArray = append(s.DagStatsArray, stats)
}

func (s *DagStatSummary) calculateSummary() {
s.Ratio = float32(s.redundantSize) / float32(s.TotalSize)
s.SharedSize = s.redundantSize - s.TotalSize
}

// DagStatCmd is a command for getting size information about an ipfs-stored dag
Expand All @@ -296,24 +368,49 @@ Note: This command skips duplicate blocks in reporting both size and the number
`,
},
Arguments: []cmds.Argument{
cmds.StringArg("root", true, false, "CID of a DAG root to get statistics for").EnableStdin(),
cmds.StringArg("root", true, true, "CID of a DAG root to get statistics for").EnableStdin(),
},
Options: []cmds.Option{
cmds.BoolOption(progressOptionName, "p", "Return progressive data while reading through the DAG").WithDefault(true),
},
Run: dagStat,
Type: DagStat{},
Type: DagStatSummary{},
PostRun: cmds.PostRunMap{
cmds.CLI: finishCLIStat,
},
Encoders: cmds.EncoderMap{
cmds.Text: cmds.MakeTypedEncoder(func(req *cmds.Request, w io.Writer, event *DagStat) error {
cmds.Text: cmds.MakeTypedEncoder(func(req *cmds.Request, w io.Writer, event *DagStatSummary) error {
fmt.Fprintln(w)
csvWriter := csv.NewWriter(w)
csvWriter.Comma = '\t'
header := []string{fmt.Sprintf("%-*s", 46, "CID"), fmt.Sprintf("%-15s", "Blocks"), "Size"}
if err := csvWriter.Write(header); err != nil {
return err
}
for _, dagStat := range event.DagStatsArray {
numBlocksStr := fmt.Sprint(dagStat.NumBlocks)
err := csvWriter.Write([]string{
dagStat.Cid.String(),
fmt.Sprintf("%-15s", numBlocksStr),
fmt.Sprint(dagStat.Size),
})
if err != nil {
return err
}
}
csvWriter.Flush()
fmt.Fprint(w, "\nSummary\n")
_, err := fmt.Fprintf(
w,
"%v\n",
event,
)
fmt.Fprint(w, "\n\n")
return err
}),
cmds.JSON: cmds.MakeTypedEncoder(func(req *cmds.Request, w io.Writer, event *DagStatSummary) error {
return json.NewEncoder(w).Encode(event)
},
),
},
}
111 changes: 64 additions & 47 deletions core/commands/dag/stat.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,70 +6,80 @@ import (
"os"

"github.com/ipfs/boxo/coreiface/path"
mdag "github.com/ipfs/boxo/ipld/merkledag"
"github.com/ipfs/boxo/ipld/merkledag/traverse"
cid "github.com/ipfs/go-cid"
cmds "github.com/ipfs/go-ipfs-cmds"
"github.com/ipfs/kubo/core/commands/cmdenv"
"github.com/ipfs/kubo/core/commands/e"

mdag "github.com/ipfs/boxo/ipld/merkledag"
cmds "github.com/ipfs/go-ipfs-cmds"
)

// TODO cache every cid traversal in a dp cache
// if the cid exists in the cache, don't traverse it, and use the cached result
// to compute the new state

func dagStat(req *cmds.Request, res cmds.ResponseEmitter, env cmds.Environment) error {
progressive := req.Options[progressOptionName].(bool)

api, err := cmdenv.GetApi(env, req)
if err != nil {
return err
}

rp, err := api.ResolvePath(req.Context, path.New(req.Arguments[0]))
if err != nil {
return err
}

if len(rp.Remainder()) > 0 {
return fmt.Errorf("cannot return size for anything other than a DAG with a root CID")
}

nodeGetter := mdag.NewSession(req.Context, api.Dag())
obj, err := nodeGetter.Get(req.Context, rp.Cid())
if err != nil {
return err
}

dagstats := &DagStat{}
err = traverse.Traverse(obj, traverse.Options{
DAG: nodeGetter,
Order: traverse.DFSPre,
Func: func(current traverse.State) error {
dagstats.Size += uint64(len(current.Node.RawData()))
dagstats.NumBlocks++

if progressive {
if err := res.Emit(dagstats); err != nil {
return err
}
}
return nil
},
ErrFunc: nil,
SkipDuplicates: true,
})
if err != nil {
return fmt.Errorf("error traversing DAG: %w", err)
}
cidSet := cid.NewSet()
dagStatSummary := &DagStatSummary{DagStatsArray: []*DagStat{}}
for _, a := range req.Arguments {
rp, err := api.ResolvePath(req.Context, path.New(a))
if err != nil {
return err
}
if len(rp.Remainder()) > 0 {
return fmt.Errorf("cannot return size for anything other than a DAG with a root CID")
}

if !progressive {
if err := res.Emit(dagstats); err != nil {
arthurgavazza marked this conversation as resolved.
Show resolved Hide resolved
obj, err := nodeGetter.Get(req.Context, rp.Cid())
if err != nil {
return err
}
dagstats := &DagStat{Cid: rp.Cid()}
dagStatSummary.appendStats(dagstats)
err = traverse.Traverse(obj, traverse.Options{
DAG: nodeGetter,
Order: traverse.DFSPre,
Func: func(current traverse.State) error {
dagstats.Size += uint64(len(current.Node.RawData()))
dagstats.NumBlocks++
if !cidSet.Has(current.Node.Cid()) {
dagStatSummary.incrementTotalSize(dagstats.Size)
}
dagStatSummary.incrementRedundantSize(dagstats.Size)
cidSet.Add(current.Node.Cid())
if progressive {
if err := res.Emit(dagStatSummary); err != nil {
return err
}
}
return nil
},
ErrFunc: nil,
SkipDuplicates: true,
})
if err != nil {
return fmt.Errorf("error traversing DAG: %w", err)
}
}

dagStatSummary.UniqueBlocks = cidSet.Len()
dagStatSummary.calculateSummary()

if err := res.Emit(dagStatSummary); err != nil {
return err
}
return nil
}

func finishCLIStat(res cmds.Response, re cmds.ResponseEmitter) error {
var dagStats *DagStat
var dagStats *DagStatSummary
for {
v, err := res.Next()
if err != nil {
Expand All @@ -78,13 +88,20 @@ func finishCLIStat(res cmds.Response, re cmds.ResponseEmitter) error {
}
return err
}

out, ok := v.(*DagStat)
if !ok {
switch out := v.(type) {
case *DagStatSummary:
dagStats = out
if dagStats.Ratio == 0 {
length := len(dagStats.DagStatsArray)
if length > 0 {
currentStat := dagStats.DagStatsArray[length-1]
fmt.Fprintf(os.Stderr, "CID: %s, Size: %d, NumBlocks: %d\n", currentStat.Cid, currentStat.Size, currentStat.NumBlocks)
}
}
default:
return e.TypeErr(out, v)

}
dagStats = out
fmt.Fprintf(os.Stderr, "%v\r", out)
}
return re.Emit(dagStats)
}
2 changes: 2 additions & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,8 @@ require (
golang.org/x/sys v0.6.0
)

require github.com/ipfs/go-merkledag v0.10.0

require (
github.com/AndreasBriese/bbloom v0.0.0-20190825152654-46b345b51c96 // indirect
github.com/Kubuxu/go-os-helper v0.0.1 // indirect
Expand Down
1 change: 1 addition & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -444,6 +444,7 @@ github.com/ipfs/go-log/v2 v2.3.0/go.mod h1:QqGoj30OTpnKaG/LKTGTxoP2mmQtjVMEnK72g
github.com/ipfs/go-log/v2 v2.5.1 h1:1XdUzF7048prq4aBjDQQ4SL5RxftpRGdXhNRwKSAlcY=
github.com/ipfs/go-log/v2 v2.5.1/go.mod h1:prSpmC1Gpllc9UYWxDiZDreBYw7zp4Iqp1kOLU9U5UI=
github.com/ipfs/go-merkledag v0.10.0 h1:IUQhj/kzTZfam4e+LnaEpoiZ9vZF6ldimVlby+6OXL4=
github.com/ipfs/go-merkledag v0.10.0/go.mod h1:zkVav8KiYlmbzUzNM6kENzkdP5+qR7+2mCwxkQ6GIj8=
github.com/ipfs/go-metrics-interface v0.0.1 h1:j+cpbjYvu4R8zbleSs36gvB7jR+wsL2fGD6n0jO4kdg=
github.com/ipfs/go-metrics-interface v0.0.1/go.mod h1:6s6euYU4zowdslK0GKHmqaIZ3j/b/tL7HTWtJ4VPgWY=
github.com/ipfs/go-metrics-prometheus v0.0.2 h1:9i2iljLg12S78OhC6UAiXi176xvQGiZaGVF1CUVdE+s=
Expand Down
39 changes: 39 additions & 0 deletions test/cli/dag_test.go
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jorropo I finished some of the fixes and I'm gonna start coding the tests. I think that for testing this functionality I'm gonna need a few test files, so that I can work with objects that have multiple blocks. Should I add a test files folder in test/cli ?

Copy link
Contributor

@Jorropo Jorropo May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we keep tests in the test/cli folder, idk if we would want to change it, seems a bit overkill, you can put helpers elsewhere.
I'm surprised you would need more than one file, A golden test is good enough

  1. manually create a dag that expose the deduping behaviour we want and store it as a .car file in the repo
  2. in the test ipfs dag import it.
  3. Compare the result of ipfs dag stat ... vs a set of known correct values from the car file above (called a golden file).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the tips!

Jorropo marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
package cli

import (
"os"
"testing"

"github.com/ipfs/kubo/test/cli/harness"
"github.com/stretchr/testify/assert"
)

const (
fixtureFile = "./fixtures/TestDagStat.car"
node1Cid = "bafyreibmdfd7c5db4kls4ty57zljfhqv36gi43l6txl44pi423wwmeskwy"
node2Cid = "bafyreie3njilzdi4ixumru4nzgecsnjtu7fzfcwhg7e6s4s5i7cnbslvn4"
fixtureCid = "bafyreifrm6uf5o4dsaacuszf35zhibyojlqclabzrms7iak67pf62jygaq"
)

// The Fixture file represents a dag where 2 nodes of size = 46B each, have a common child of 53B
// when traversing the DAG from the root's children (node1 and node2) we count (46 + 53)x2 bytes (counting redundant bytes) = 198
// since both nodes share a common child of 53 bytes we actually had to read (46)x2 + 53 = 145 bytes
// we should get a dedup ratio of 198/145 that results in approximatelly 1.3655173

func TestDag(t *testing.T) {
t.Parallel()
t.Run("ipfs dag stat", func(t *testing.T) {
t.Parallel()
node := harness.NewT(t).NewNode().Init().StartDaemon()
// Import fixture
r, err := os.Open(fixtureFile)
assert.Nil(t, err)
defer r.Close()
err = node.IPFSDagImport(r, fixtureCid)
assert.Nil(t, err)
arthurgavazza marked this conversation as resolved.
Show resolved Hide resolved
stat := node.RunIPFS("dag", "stat", "--progress=false", node1Cid, node2Cid)
str := stat.Stdout.String()
expected := "\nCID \tBlocks \tSize\nbafyreibmdfd7c5db4kls4ty57zljfhqv36gi43l6txl44pi423wwmeskwy\t2 \t53\nbafyreie3njilzdi4ixumru4nzgecsnjtu7fzfcwhg7e6s4s5i7cnbslvn4\t2 \t53\n\nSummary\nTotal Size: 145\nUnique Blocks: 3\nShared Size: 53\nRatio: 1.365517\n\n\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the ` syntax, this is very hard to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for that, I'm gonna make this more readable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jorropo I'm having a hard time comparing the text output with a string literal in this test, I added a test using the json encoded output so that I can validate the deduplication behaviour. Is there any other way that I could test the text output?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arthurgavazza json is good.

For the text one: the point of golden tests is that the code of the test is very dumb.
You create a dataset ahead of time that show all the behaviour you want (here it would be a .car fixture) and then you == or diff the output with some hardcoded values you create at the same time you make your fixture.

assert.Equal(t, expected, str)
})
}
Binary file added test/cli/fixtures/TestDagStat.car
Binary file not shown.