duplihere

What

Copy & Paste finder for source files or any structured utf-8 text files.

Why

A number of different copy and paste detectors exist. Some examples include:

So why write another? I've wanted a simple tool, one that works like simian, but is open source and free for everyone. Thus this project was born. In general I think writing a lexer and tokenizing the source isn't needed. There is a ton of code that is very much copy and pasted verbatim. Developers are lazy, they don't change things :-)

How

duplihere - 0.9.0 - find duplicate text

usage: duplihere [-pj -l <number> -i <file name> -t <thread number>] -f <pattern or specific file>

Find duplicate lines of text in one or more text files.

The duplicated text can be at different levels of indention,
but otherwise needs to be identical.

More information: https://github.com/tasleson/duplihere

argument:                                        description
    -p, --print                                  print duplicate text [default: false]
    -j, --json                                   output JSON [default: false]
    -l, --lines <number>                         minimum number of duplicate lines [default: 6]
    -f, --file <pattern or specific file>        pattern or file eg. "**/*.[h|c]" recursive, "*.py", "file.ext", can repeat [required]
    -i, --ignore <file name>                     file containing hash values to ignore, one per line
    -t, --threads <thread number>                number of threads to utilize. Set to 0 to match #cpu cores [default: 4]

An example where we re-curse in a directory for python files and a directory that contains python files ...

$ duplihere -l 10 -p -f '/home/user/somewhere/**/*.py' -f '/tmp/*.py'

An example showing JSON output (not finalized)

$ duplihere -f /home/tasleson/projects/linux/init/main.c -l 5 -j

{
  "num_lines": 5,
  "num_ignored": 0,
  "duplicates": [
    {
      "key": 11558319874972720381,
      "num_lines": 5,
      "files": [
        [
          "/home/tasleson/projects/linux/init/main.c",
          830
        ],
        [
          "/home/tasleson/projects/linux/init/main.c",
          864
        ]
      ]
    }
  ]
}

Status

Tool has enough features and functionality for meaningful results. With the latest multi-thread support it's quite fast on big source trees. Current graph of memory and CPU consumption while examining the Linux kernel source tree for duplicates. Run against Linux 6.5 branch (~24M lines) and all available CPU cores. Chart generated with psrecord.

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
proto		proto
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

duplihere

What

Why

How

Status

About

Releases

Packages

Contributors 2

Languages

License

tasleson/duplihere

Folders and files

Latest commit

History

Repository files navigation

duplihere

What

Why

How

Status

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages