How to connect two variables by an overlapping name in excel or linux

user4311883

I first tried to solve it myself with excel and later in Linux, but I cannot seem to solve it. So hope someone can help me with solving this! And not really good in explaining things, so if anything is unclear, please let me know.

I have a list of around 30,000 genes starting with Bol and their orthologous genes starting with At, meaning the Bol genes are of the same function as At. I know the GO terms (function terms of genes) for the At genes, but not for the Bol genes. But as I also have a list connecting the At genes to the Bol, I can connect the GO terms to the Bol genes with the help of the overlapping At genes. Once that list is finished I want to count how many of each GO term I have in the total Bol gene set.

The files I have look like these:

Small example of sheet 1

Bol040727   AT1G01010
Bol040726   AT1G01020
Bol040729   AT1G01020
Bol040725   AT1G01030
Bol040724   AT1G01040
Bol040723   AT1G01050
Bol018442   AT1G01050
Bol040721   AT1G01070
Bol018443   AT1G01070
Bol040718   AT1G01090
Bol018445   AT1G01090
Bol040717   AT1G01100
Bol018446   AT1G01100
Bol040716   AT1G01110
Bol040715   AT1G01120
Bol018447   AT1G01120
Bol000520   AT1G01120
Bol000521   AT1G01120
Bol040714   AT1G01140
Bol040713   AT1G01160
Bol018449   AT1G01160
Bol040712   AT1G01170
Bol040711   AT1G01180
Bol040710   AT1G01190
Bol018450   AT1G01190
Bol040709   AT1G01200
Bol018452   AT1G01210
Bol018453   AT1G01220
Bol018454   AT1G01225
Bol040708   AT1G01230
Bol018455   AT1G01230
Bol040707   AT1G01240
Bol018456   AT1G01240
Bol040706   AT1G01250
Bol018457   AT1G01250

Small example of sheet 2

AT1G01010   GO:0045449  GO:0005575  GO:0003700  GO:0007275          
AT1G01020   GO:0005783  GO:0016020  GO:0003674  GO:0006665  GO:0016125      
AT1G01030   GO:0005575  GO:0003700  GO:0009908  GO:0045449  GO:0048366      
AT1G01040   GO:0005634  GO:0008026  GO:0003725  GO:0005515  GO:0004525  GO:0006396
AT1G01046   GO:0005575  GO:0003674  GO:0008150              
AT1G01050   GO:0006796  GO:0005737  GO:0016020  GO:0005634  GO:0004427  GO:0008152  
AT1G01060   GO:0003677  GO:0003700  GO:0009651  GO:0009737  GO:0009733  GO:0009723
AT1G01070   GO:0016020                      
AT1G01073   GO:0012505                      
AT1G01073   GO:0003674  GO:0008150                  
AT1G01080   GO:0009507  GO:0009570  GO:0005634  GO:0003676  GO:0000166  GO:0009941
AT1G01090   GO:0009507  GO:0006096  GO:0008152  GO:0009941  GO:0055114  GO:0009536
AT1G01100   GO:0005886  GO:0003735  GO:0006414  GO:0022626  GO:0005829  GO:0005634
AT1G01110   GO:0005739  GO:0003674                  
AT1G01115   GO:0012505                      
AT1G01115   GO:0003674  GO:0008150                  
AT1G01120   GO:0022626  GO:0005783  GO:0016020  GO:0008415  GO:0009409  GO:0009416
AT1G01130   GO:0005575  GO:0003674  GO:0008150

So AT1G01073 and AT1G01115 are in duplicate, It should be:

AT1G01073 GO:0012505 GO:0003674 GO:0008150
AT1G01115 GO:0012505 GO:0003674 GO:0008150 

And as output I would like for instance to get this

Bol040727 AT1G01010 GO:0045449  GO:0005575  GO:0003700  GO:0007275
Bol040726 AT1G01020 GO:0005783  GO:0016020  GO:0003674  GO:0006665  GO:0016125

etc.

I tried to connect sheet 1 with sheet 2 with VLOOKUP in excel, but it seems that some genes in sheet 2 are in duplicate but with different GO terms. So I thought about finding the duplicated genes and paste the GO terms together to create a list of only unique At genes with GO terms. Finding the duplicated At genes is no problem, but how to paste the GO terms together in an easy and quick way (there are a lot of duplicated At genes). And counting the GO terms I can easily do if everything is in one column, but here it is not.

So is there anyone who knows how to solve this or who knows another maybe easier solution? C++ code in Linux or a simple perl/ python script is also fine by me. The columns in the files are tab delimited.

Thanks a lot!

Hynek -Pichi- Vychodil

Let's assume your original data are in files sheet1 and sheet2. So one way is to read sheet2 first and collect all GO functions for ATs. Then read sheet1 and make a new version of sheet1 with GOs.

use strict;
use warnings;
use autodie;

use constant FN_GENES     => 'sheet1';
use constant FN_FUNCTIONS => 'sheet2';

# Make lookup of GOs for ATs
my %functions;    # translation lookup for accumulating functions
{
    open my $fh, '<', FN_FUNCTIONS;
    while (<$fh>) {
        chomp;    # chomp out newline
        my ( $at, @gos ) = split;  # split line by whitespace (including '\t')

        # treat $functions{$at} as hash and set keys by hash splice to undef
        # its more efficient way to write
        # $functions{$at}{$_} = undef for @gos;
        @{ $functions{$at} }{@gos} = ();
    }
}

# test print of %functions content (it's lookup)
# print "$_: @{[sort keys %{$functions{$_}}]}\n" for sort keys %functions;

# sort GOs (change hash keys to sorted array)
for my $gos ( values %functions ) {
    $gos = [ sort keys %$gos ];
}

# read genes and look up for GOs
{
    open my $fh, '<', FN_GENES;
    while (<$fh>) {
        chomp;
        my ( $bol, $at ) = split;

        # GOs for current Bol
        # Use empty array if missing (array ref is always true)
        my $gos = $functions{$at} || [];

        # output tab delimited row
        print join( "\t", $bol, $at, @$gos ), "\n";
    }
}

Example output:

Bol040727       AT1G01010       GO:0003700      GO:0005575      GO:0007275      GO:0045449
Bol040726       AT1G01020       GO:0003674      GO:0005783      GO:0006665      GO:0016020      GO:0016125
Bol040729       AT1G01020       GO:0003674      GO:0005783      GO:0006665      GO:0016020      GO:0016125
Bol040725       AT1G01030       GO:0003700      GO:0005575      GO:0009908      GO:0045449      GO:0048366
Bol040724       AT1G01040       GO:0003725      GO:0004525      GO:0005515      GO:0005634      GO:0006396      GO:0008026
Bol040723       AT1G01050       GO:0004427      GO:0005634      GO:0005737      GO:0006796      GO:0008152      GO:0016020
Bol018442       AT1G01050       GO:0004427      GO:0005634      GO:0005737      GO:0006796      GO:0008152      GO:0016020
Bol040721       AT1G01070       GO:0016020
Bol018443       AT1G01070       GO:0016020
Bol040718       AT1G01090       GO:0006096      GO:0008152      GO:0009507      GO:0009536      GO:0009941      GO:0055114
Bol018445       AT1G01090       GO:0006096      GO:0008152      GO:0009507      GO:0009536      GO:0009941      GO:0055114
Bol040717       AT1G01100       GO:0003735      GO:0005634      GO:0005829      GO:0005886      GO:0006414      GO:0022626
Bol018446       AT1G01100       GO:0003735      GO:0005634      GO:0005829      GO:0005886      GO:0006414      GO:0022626
Bol040716       AT1G01110       GO:0003674      GO:0005739
Bol040715       AT1G01120       GO:0005783      GO:0008415      GO:0009409      GO:0009416      GO:0016020      GO:0022626
Bol018447       AT1G01120       GO:0005783      GO:0008415      GO:0009409      GO:0009416      GO:0016020      GO:0022626
Bol000520       AT1G01120       GO:0005783      GO:0008415      GO:0009409      GO:0009416      GO:0016020      GO:0022626
Bol000521       AT1G01120       GO:0005783      GO:0008415      GO:0009409      GO:0009416      GO:0016020      GO:0022626
Bol040714       AT1G01140
Bol040713       AT1G01160
Bol018449       AT1G01160
Bol040712       AT1G01170
Bol040711       AT1G01180
Bol040710       AT1G01190
Bol018450       AT1G01190
Bol040709       AT1G01200
Bol018452       AT1G01210
Bol018453       AT1G01220
Bol018454       AT1G01225
Bol040708       AT1G01230
Bol018455       AT1G01230
Bol040707       AT1G01240
Bol018456       AT1G01240
Bol040706       AT1G01250
Bol018457       AT1G01250

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

How to connect two variables when storing it to localStorage?

SQL ,How to connect two IDs with name in SQL

How to connect two VCAN ports in Linux?

How to "Connect" two variables between two React Components?

How to merge two overlapping dataframes

How to stop overlapping two div?

How to use domain name to connect via ssh into your linux server

How can I Connect two Linux virtual machine on VMware

How to print counts based on two variables found in two columns including the name of the variables from a big datafile?

Is there an easy way to connect the tangents of two overlapping Bezier curves?

how to find non unique combinations of two variables in Excel?

How to create a count table of two variables in excel pivot

How can I pass variables betweet two Subs in Excel VBA?

How do I seperate two overlapping legends for a dual axis ggplot, one legend representing variables by shape and another representing groups by color?

How to merge two cells in excel with same field name

How to prevent overlapping the contents of two column

RxJs how to merge two overlapping observable into one

Two column grid how to prevent overlapping

How to capture onTouch events on two overlapping views?

How to detect overlapping between two datetime in SQL?

How to synchronise two overlapping JavaScript onclick events

How to remove overlapping sequences from two datatables?

How to check if two round divs are overlapping

Comparing two numeric variables in Linux

How to Join two tables on two separate variables with one variable having the same name

How to name class variables

How to name variables with units?

Connect two observables to find name by id

Pandas: How to merge two dataframes on two different overlapping timeseries