Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add deduplication ratio to 'ipfs dag stat' #9787

Merged
merged 27 commits into from
Jun 6, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
f1604da
checkpoint1
arthurgavazza Mar 31, 2023
ac3e6b0
feat: achieve correct minimal output
arthurgavazza Mar 31, 2023
ce81fd0
feat: start adding json encoding
arthurgavazza Mar 31, 2023
6dffb58
fix: fmt
arthurgavazza Mar 31, 2023
946fdd6
feat: add custom json encoder and format text encoder
arthurgavazza Apr 1, 2023
63d271b
fix: remove test json files
arthurgavazza Apr 1, 2023
fbc1c9c
Update core/commands/dag/dag.go
arthurgavazza May 3, 2023
eac10df
fix: check error on cdv write
arthurgavazza May 3, 2023
27547d3
fix: tsv title format
arthurgavazza May 3, 2023
7f38e7e
fix: print line with Fprintln
arthurgavazza May 3, 2023
76e5f20
fix: use boxo traverse
arthurgavazza May 3, 2023
b48dcb0
refactor: calculate summary with only one method
arthurgavazza May 3, 2023
9bbef7f
refactor: make simplifications
arthurgavazza May 3, 2023
620f73c
fix: add support to progressive information
arthurgavazza May 12, 2023
9146d95
fix: fix ci
arthurgavazza May 12, 2023
ee82e97
test: add a winner test using a fixture
arthurgavazza May 16, 2023
6641fb3
Merge branch 'master' into feat/add-deduplication-ratio-stat
arthurgavazza May 16, 2023
8d79bcb
Update test/cli/dag_test.go
arthurgavazza May 16, 2023
f1840ca
fix: deduplication calculations and tests
arthurgavazza May 21, 2023
864e91b
fix: lint
arthurgavazza May 21, 2023
b689c10
fix: CI
arthurgavazza May 21, 2023
220856f
Merge branch 'master' into feat/add-deduplication-ratio-stat
arthurgavazza May 21, 2023
f2051db
test: add text output tests
arthurgavazza Jun 4, 2023
0df029e
Merge branch 'master' into feat/add-deduplication-ratio-stat
arthurgavazza Jun 4, 2023
08f837c
Update core/commands/dag/dag.go
Jorropo Jun 5, 2023
538f5bb
Update test/cli/dag_test.go
Jorropo Jun 5, 2023
336a98c
FIXUP
Jorropo Jun 5, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 93 additions & 6 deletions core/commands/dag/dag.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
package dagcmd

import (
"encoding/csv"
"encoding/json"
"fmt"
"io"

Expand Down Expand Up @@ -275,13 +277,75 @@ CAR file follows the CARv1 format: https://ipld.io/specs/transport/car/carv1/
}

// DagStat is a dag stat command response

Jorropo marked this conversation as resolved.
Show resolved Hide resolved
type DagStat struct {
Size uint64
NumBlocks int64
Cid cid.Cid `json:",omitempty"`
Size uint64 `json:",omitempty"`
NumBlocks int64 `json:",omitempty"`
}

func (s *DagStat) String() string {
return fmt.Sprintf("Size: %d, NumBlocks: %d", s.Size, s.NumBlocks)
return fmt.Sprintf("%s %d %d", s.Cid.String()[:20], s.Size, s.NumBlocks)
}

func (s *DagStat) MarshalJSON() ([]byte, error) {
type Alias DagStat
return json.Marshal(&struct {
arthurgavazza marked this conversation as resolved.
Show resolved Hide resolved
Cid string `json:"Cid"`
*Alias
}{
Cid: s.Cid.String(),
Alias: (*Alias)(s),
})
}

func (s *DagStat) UnmarshalJSON(data []byte) error {
type Alias DagStat
aux := &struct {
arthurgavazza marked this conversation as resolved.
Show resolved Hide resolved
Cid string `json:"Cid"`
*Alias
}{
Alias: (*Alias)(s),
}
if err := json.Unmarshal(data, &aux); err != nil {
return err
}
Cid, err := cid.Parse(aux.Cid)
if err != nil {
return err
}
s.Cid = Cid
return nil
}
arthurgavazza marked this conversation as resolved.
Show resolved Hide resolved

type DagStatSummary struct {
redundantSize uint64 `json:"-"`
UniqueBlocks int `json:",omitempty"`
TotalSize uint64 `json:",omitempty"`
SharedSize uint64 `json:",omitempty"`
Ratio float32 `json:",omitempty"`
DagStatsArray []*DagStat `json:"DagStats,omitempty"`
}

func (s *DagStatSummary) String() string {
return fmt.Sprintf("Total Size: %d\nUnique Blocks: %d\nShared Size: %d\nRatio: %f", s.TotalSize, s.UniqueBlocks, s.SharedSize, s.Ratio)
}

func (s *DagStatSummary) incrementTotalSize(size uint64) {
s.TotalSize += size
}
func (s *DagStatSummary) incrementRedundantSize(size uint64) {
s.redundantSize += size
}

func (s *DagStatSummary) calculateRatio() {
if s.TotalSize > 0 {
s.Ratio = float32(s.redundantSize) / float32(s.TotalSize)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the if check and let s.Ratio being NaN and implement custom marshal which omits Ratio if it is NaN. But that fine, probably.

}

func (s *DagStatSummary) calculateSharedSize() {
arthurgavazza marked this conversation as resolved.
Show resolved Hide resolved
s.SharedSize = s.redundantSize - s.TotalSize
}

// DagStatCmd is a command for getting size information about an ipfs-stored dag
Expand All @@ -296,24 +360,47 @@ Note: This command skips duplicate blocks in reporting both size and the number
`,
},
Arguments: []cmds.Argument{
cmds.StringArg("root", true, false, "CID of a DAG root to get statistics for").EnableStdin(),
cmds.StringArg("root", true, true, "CID of a DAG root to get statistics for").EnableStdin(),
},
Options: []cmds.Option{
cmds.BoolOption(progressOptionName, "p", "Return progressive data while reading through the DAG").WithDefault(true),
},
Run: dagStat,
Type: DagStat{},
Type: DagStatSummary{},
PostRun: cmds.PostRunMap{
cmds.CLI: finishCLIStat,
},
Encoders: cmds.EncoderMap{
cmds.Text: cmds.MakeTypedEncoder(func(req *cmds.Request, w io.Writer, event *DagStat) error {
cmds.Text: cmds.MakeTypedEncoder(func(req *cmds.Request, w io.Writer, event *DagStatSummary) error {
fmt.Println()
arthurgavazza marked this conversation as resolved.
Show resolved Hide resolved
csvWriter := csv.NewWriter(w)
csvWriter.Comma = '\t'
header := []string{fmt.Sprintf("%-*s", 46, "CID"), fmt.Sprintf("%-15s", "Blocks"), "Size"}
csvWriter.Write(header)
arthurgavazza marked this conversation as resolved.
Show resolved Hide resolved
for _, dagStat := range event.DagStatsArray {
numBlocksStr := fmt.Sprint(dagStat.NumBlocks)
err := csvWriter.Write([]string{
dagStat.Cid.String(),
fmt.Sprintf("%-15s", numBlocksStr),
fmt.Sprint(dagStat.Size),
})
if err != nil {
return err
}
}
csvWriter.Flush()
fmt.Print("\nSummary\n\n")
arthurgavazza marked this conversation as resolved.
Show resolved Hide resolved
_, err := fmt.Fprintf(
w,
"%v\n",
event,
)
fmt.Print("\n\n")
arthurgavazza marked this conversation as resolved.
Show resolved Hide resolved
return err
}),
cmds.JSON: cmds.MakeTypedEncoder(func(req *cmds.Request, w io.Writer, event *DagStatSummary) error {
return json.NewEncoder(w).Encode(event)
},
),
},
}
99 changes: 49 additions & 50 deletions core/commands/dag/stat.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,73 +3,73 @@ package dagcmd
import (
"fmt"
"io"
"os"

"github.com/ipfs/boxo/coreiface/path"
"github.com/ipfs/boxo/ipld/merkledag/traverse"
"github.com/ipfs/kubo/core/commands/cmdenv"
"github.com/ipfs/kubo/core/commands/e"

mdag "github.com/ipfs/boxo/ipld/merkledag"
cid "github.com/ipfs/go-cid"
cmds "github.com/ipfs/go-ipfs-cmds"
"github.com/ipfs/go-merkledag/traverse"
arthurgavazza marked this conversation as resolved.
Show resolved Hide resolved
"github.com/ipfs/kubo/core/commands/cmdenv"
"github.com/ipfs/kubo/core/commands/e"
)

func dagStat(req *cmds.Request, res cmds.ResponseEmitter, env cmds.Environment) error {
progressive := req.Options[progressOptionName].(bool)

api, err := cmdenv.GetApi(env, req)
if err != nil {
return err
}

rp, err := api.ResolvePath(req.Context, path.New(req.Arguments[0]))
if err != nil {
return err
}

if len(rp.Remainder()) > 0 {
return fmt.Errorf("cannot return size for anything other than a DAG with a root CID")
}

nodeGetter := mdag.NewSession(req.Context, api.Dag())
obj, err := nodeGetter.Get(req.Context, rp.Cid())
if err != nil {
return err
}

dagstats := &DagStat{}
err = traverse.Traverse(obj, traverse.Options{
DAG: nodeGetter,
Order: traverse.DFSPre,
Func: func(current traverse.State) error {
dagstats.Size += uint64(len(current.Node.RawData()))
dagstats.NumBlocks++

if progressive {
if err := res.Emit(dagstats); err != nil {
return err
}
}
return nil
},
ErrFunc: nil,
SkipDuplicates: true,
})
if err != nil {
return fmt.Errorf("error traversing DAG: %w", err)
}
cidSet := cid.NewSet()
dagStatSummary := &DagStatSummary{DagStatsArray: []*DagStat{}}
for _, a := range req.Arguments {
rp, err := api.ResolvePath(req.Context, path.New(a))
if err != nil {
return err
}
if len(rp.Remainder()) > 0 {
return fmt.Errorf("cannot return size for anything other than a DAG with a root CID")
}

if !progressive {
if err := res.Emit(dagstats); err != nil {
arthurgavazza marked this conversation as resolved.
Show resolved Hide resolved
obj, err := nodeGetter.Get(req.Context, rp.Cid())
if err != nil {
return err
}
dagstats := &DagStat{Cid: rp.Cid()}
err = traverse.Traverse(obj, traverse.Options{
DAG: nodeGetter,
Order: traverse.DFSPre,
Func: func(current traverse.State) error {
dagstats.Size += uint64(len(current.Node.RawData()))
dagstats.NumBlocks++
if !cidSet.Has(current.Node.Cid()) {
dagStatSummary.incrementTotalSize(dagstats.Size)
}
dagStatSummary.incrementRedundantSize(dagstats.Size)
cidSet.Add(current.Node.Cid())
return nil
},
ErrFunc: nil,
SkipDuplicates: true,
})
if err != nil {
return fmt.Errorf("error traversing DAG: %w", err)
}
dagStatSummary.DagStatsArray = append(dagStatSummary.DagStatsArray, dagstats)
}

dagStatSummary.UniqueBlocks = cidSet.Len()
dagStatSummary.calculateSharedSize()
dagStatSummary.calculateRatio()
if err := res.Emit(dagStatSummary); err != nil {
return err
}
return nil
}

func finishCLIStat(res cmds.Response, re cmds.ResponseEmitter) error {
var dagStats *DagStat

var dagStats *DagStatSummary
for {
v, err := res.Next()
if err != nil {
Expand All @@ -78,13 +78,12 @@ func finishCLIStat(res cmds.Response, re cmds.ResponseEmitter) error {
}
return err
}

out, ok := v.(*DagStat)
if !ok {
switch out := v.(type) {
case *DagStatSummary:
dagStats = out
default:
return e.TypeErr(out, v)
}
dagStats = out
fmt.Fprintf(os.Stderr, "%v\r", out)
}
return re.Emit(dagStats)
}
2 changes: 2 additions & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,8 @@ require (
golang.org/x/sys v0.6.0
)

require github.com/ipfs/go-merkledag v0.10.0

require (
github.com/AndreasBriese/bbloom v0.0.0-20190825152654-46b345b51c96 // indirect
github.com/Kubuxu/go-os-helper v0.0.1 // indirect
Expand Down
1 change: 1 addition & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -444,6 +444,7 @@ github.com/ipfs/go-log/v2 v2.3.0/go.mod h1:QqGoj30OTpnKaG/LKTGTxoP2mmQtjVMEnK72g
github.com/ipfs/go-log/v2 v2.5.1 h1:1XdUzF7048prq4aBjDQQ4SL5RxftpRGdXhNRwKSAlcY=
github.com/ipfs/go-log/v2 v2.5.1/go.mod h1:prSpmC1Gpllc9UYWxDiZDreBYw7zp4Iqp1kOLU9U5UI=
github.com/ipfs/go-merkledag v0.10.0 h1:IUQhj/kzTZfam4e+LnaEpoiZ9vZF6ldimVlby+6OXL4=
github.com/ipfs/go-merkledag v0.10.0/go.mod h1:zkVav8KiYlmbzUzNM6kENzkdP5+qR7+2mCwxkQ6GIj8=
github.com/ipfs/go-metrics-interface v0.0.1 h1:j+cpbjYvu4R8zbleSs36gvB7jR+wsL2fGD6n0jO4kdg=
github.com/ipfs/go-metrics-interface v0.0.1/go.mod h1:6s6euYU4zowdslK0GKHmqaIZ3j/b/tL7HTWtJ4VPgWY=
github.com/ipfs/go-metrics-prometheus v0.0.2 h1:9i2iljLg12S78OhC6UAiXi176xvQGiZaGVF1CUVdE+s=
Expand Down
28 changes: 28 additions & 0 deletions test/cli/dag_test.go
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jorropo I finished some of the fixes and I'm gonna start coding the tests. I think that for testing this functionality I'm gonna need a few test files, so that I can work with objects that have multiple blocks. Should I add a test files folder in test/cli ?

Copy link
Contributor

@Jorropo Jorropo May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we keep tests in the test/cli folder, idk if we would want to change it, seems a bit overkill, you can put helpers elsewhere.
I'm surprised you would need more than one file, A golden test is good enough

  1. manually create a dag that expose the deduping behaviour we want and store it as a .car file in the repo
  2. in the test ipfs dag import it.
  3. Compare the result of ipfs dag stat ... vs a set of known correct values from the car file above (called a golden file).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the tips!

Jorropo marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
package cli

import (
"testing"

"github.com/ipfs/kubo/test/cli/harness"
"github.com/stretchr/testify/assert"
)

func TestDag(t *testing.T) {
t.Parallel()
t.Run("ipfs add, adds file", func(t *testing.T) {
t.Parallel()
node := harness.NewT(t).NewNode().Init().StartDaemon()

output := node.IPFSAddStr("hello world")
output2 := node.IPFSAddStr("hello world 2")
output3 := node.IPFSAddStr("hello world 3")
assert.NotEqual(t, "", output)

stat := node.RunIPFS("dag", "stat", output, output2, output3)
str := stat.Stdout.String()
err := stat.Stderr.String()
assert.NotEqual(t, "", str)
assert.Nil(t, err)

})
}