-
Notifications
You must be signed in to change notification settings - Fork 23
merge_records
[merge_records] merges records in the stream based on two specified keys with
values that are used as identifiers. Merging is done by splitting the stream and saving
all records with identifier A
to one file and all records with identifier B
to another file. These
files are then sorted based on the A
and B
values and merged according to the chosen merging scheme
of which there are five:
- AandB - only emit merged records (Default).
- AorB - emit
A
records or merged records (i.e. allA
records and merged records). - BorA - emit
B
records or merged records (i.e. allB
records and merged records). - AnotB - emit
A
records that could not be merged withB
. - BnotA - emit
B
records that could not be merged withA
.
It is important that there are no duplicate identifier values - the behaviour is not warrented and you computer will probably explode.
It is important that there is no common keys in the records that are to be merged because the values will be overwritten.
... | merge_records [options]
[-? | --help] # Print full usage description.
[-k <list> | --keys=<list>] # Keys (A and B) which values are used for merging. Append n for numeric values.
[-m <string> | --merge=<string>] # Merge AandB, AorB, BorA, AnotB, or BnotA - Default=AandB
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following two tables in the following files:
cat test1.tab
2 test1:2
3 test1:3
4 test1:4
cat test2.tab
test2:1 1
test2:2 2
test2:3 3
We read in the first table from test1.tab
using [read_tab] with the -k
switch to name the first column A
and
the second column V1
remembering that it is important that there is no collisions between any column keys!:
read_tab -i test1.tab -k A,V1
A: 2
V1: test1:2
---
A: 3
V1: test1:3
---
A: 4
V1: test1:4
---
The resulting stream shows a number of table records with a key A
and a key V1
.
Now we read in the next table with another round of [read_tab] using the -k
switch
to name the first column V0
and second column B
like this:
read_tab -i test1.tab -k A,V1 | read_tab -i test2.tab -k V0,B
A: 2
V1: test1:2
---
A: 3
V1: test1:3
---
A: 4
V1: test1:4
---
V0: test2:1
B: 1
---
V0: test2:2
B: 2
---
V0: test2:3
B: 3
---
Now we can use [merge_records] to merge the records on key A
and key B
using the default merge scheme AandB
that
outputs only merged records:
read_tab -i test1.tab -k A,V1 | read_tab -i test2.tab -k V0,B | merge_records -k A,B
A: 2
V0: test2:2
V1: test1:2
---
A: 3
V0: test2:3
V1: test1:3
---
If we change the merging scheme from AandB
to AorB
using the -m
switch then all A
records and all merged records
will be output,
read_tab -i test1.tab -k A,V1 | read_tab -i test2.tab -k V0,B | merge_records -k A,B -m AorB
A: 2
V0: test2:2
V1: test1:2
---
A: 3
V0: test2:3
V1: test1:3
---
A: 4
V1: test1:4
---
Similarly, if we change to BorA
we get this:
read_tab -i test1.tab -k A,V1 | read_tab -i test2.tab -k V0,B | merge_records -k A,B -m BorA
V0: test2:1
B: 1
---
A: 2
V0: test2:2
V1: test1:2
---
A: 3
V0: test2:3
V1: test1:3
---
is your friend.)
Finally, we can get all records that are in test1.tab
but not in test2.tab
by using AnotB with the -m switch:
read_tab -i test1.tab -k A,V1 | read_tab -i test2.tab -k V0,B | merge_records -k A,B -m AnotB
A: 4
V1: test1:4
---
Or using BnotA:
read_tab -i test1.tab -k A,V1 | read_tab -i test2.tab -k V0,B | merge_records -k A,B -m BnotA
V0: test2:1
B: 1
---
[read_tab]
[rename_keys]
[add_ident]
Martin Asser Hansen - Copyright (C) - All rights reserved.
July 2008
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
[merge_records] is part of the Biopieces framework.