I have the following test data :
a b
a c
b a
b c
b d
c a
c b
c d
d b
d c
and I want to remove lines v u
when line u v
already exists using unix command. For example here I want to obtain :
a b
a c
b c
b d
c d
I've tried with an awk script but on a long file it takes too much time :
{
if(NR==1){
n1=$1
n2=$2
test=0
k=0
i = 0
column1[i]=$1
column2[i]=$2
printf "%s %s\n", column1[i], column2[i]
}
else{
for(k=0; k<=i;k++){
if(column1[k]==$2){
test=1
tmp=i
break
}
}
if(test==1){
if(column2[tmp]==$1){
n1=$1
n2=$2
}
}
else if(n1!=$1||n2!=$2){
n1=$1
n2=$2
i++
column1[i]=$1
column2[i]=$2
printf "%s %s\n", column1[i], column2[i]
}
test=0
}
}
Does someone have an idea ?
I think this can be achieved pretty simply:
awk '!seen[$1,$2]++ && !seen[$2,$1]' file
This only prints lines (the default action) when the first and second column have not yet been seen in either order.
The array seen
keeps track of every pair of fields by setting a key containing the first and second field. The expression !seen[key]++
is only true the first time that a specific key
is tested because the value in the array is incremented each time.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments