print lines that have similar columns with multiple delimiters


I have two files:


dn_id101_400_CT_TC    string1
dn_id111_60_TT_AA    string2


dn_id101_400_XX_XX    diffstring1
dn_id400_40_XY_YX    diffstring2
dn_id111_60_GG_CC    diffstring3

I want to print the lines from file2.txt if the first three elements separated by _ from file1.txt are present in the line in file2.txt. Here is my desired output:

dn_id101_400_XX_XX    diffstring1
dn_id111_60_GG_CC    diffstring3

Is there a way to to do this? Maybe by changing the delimiter of an awk? I'm not sure how to handle multiple delimiters in an awk command. Here's an example of what I'd like to use:

awk -F"\t" 'FNR==NR {a[$1]; next}; $1 in a' file1.txt file2.txt

You can do:

$ awk -F"\t" '     
            {s=$1; sub(/_[[:upper:]]+_[[:upper:]]+$/, "", s)} 
    FNR==NR { arr[s]++} 
    FNR<NR && (s in arr)' f1 f2
dn_id101_400_XX_XX  diffstring1
dn_id111_60_GG_CC   diffstring3

That assumes that /_[[:upper:]]+_[[:upper:]]+$/ correctly describes the part you need to remove to make the data keys overlap between the two files.

If you want to go left to right (irrespective of the number of _ after the first three) use split instead:

$ awk -F"\t" '     
            { split($1, a, /_/); s=a[1]"_"a[2]"_"a[3]} 
    FNR==NR { arr[s]++} 
    FNR<NR && (s in arr)' f1 f2

