I first tried to solve it myself with excel and later in Linux, but I cannot seem to solve it. So hope someone can help me with solving this! And not really good in explaining things, so if anything is unclear, please let me know.
I have a list of around 30,000 genes starting with Bol
and their orthologous genes starting with At
, meaning the Bol
genes are of the same function as At
. I know the GO
terms (function terms of genes) for the At
genes, but not for the Bol genes. But as I also have a list connecting the At
genes to the Bol
, I can connect the GO
terms to the Bol
genes with the help of the overlapping At
genes. Once that list is finished I want to count how many of each GO
term I have in the total Bol
gene set.
The files I have look like these:
Small example of sheet 1
Bol040727 AT1G01010
Bol040726 AT1G01020
Bol040729 AT1G01020
Bol040725 AT1G01030
Bol040724 AT1G01040
Bol040723 AT1G01050
Bol018442 AT1G01050
Bol040721 AT1G01070
Bol018443 AT1G01070
Bol040718 AT1G01090
Bol018445 AT1G01090
Bol040717 AT1G01100
Bol018446 AT1G01100
Bol040716 AT1G01110
Bol040715 AT1G01120
Bol018447 AT1G01120
Bol000520 AT1G01120
Bol000521 AT1G01120
Bol040714 AT1G01140
Bol040713 AT1G01160
Bol018449 AT1G01160
Bol040712 AT1G01170
Bol040711 AT1G01180
Bol040710 AT1G01190
Bol018450 AT1G01190
Bol040709 AT1G01200
Bol018452 AT1G01210
Bol018453 AT1G01220
Bol018454 AT1G01225
Bol040708 AT1G01230
Bol018455 AT1G01230
Bol040707 AT1G01240
Bol018456 AT1G01240
Bol040706 AT1G01250
Bol018457 AT1G01250
Small example of sheet 2
AT1G01010 GO:0045449 GO:0005575 GO:0003700 GO:0007275
AT1G01020 GO:0005783 GO:0016020 GO:0003674 GO:0006665 GO:0016125
AT1G01030 GO:0005575 GO:0003700 GO:0009908 GO:0045449 GO:0048366
AT1G01040 GO:0005634 GO:0008026 GO:0003725 GO:0005515 GO:0004525 GO:0006396
AT1G01046 GO:0005575 GO:0003674 GO:0008150
AT1G01050 GO:0006796 GO:0005737 GO:0016020 GO:0005634 GO:0004427 GO:0008152
AT1G01060 GO:0003677 GO:0003700 GO:0009651 GO:0009737 GO:0009733 GO:0009723
AT1G01070 GO:0016020
AT1G01073 GO:0012505
AT1G01073 GO:0003674 GO:0008150
AT1G01080 GO:0009507 GO:0009570 GO:0005634 GO:0003676 GO:0000166 GO:0009941
AT1G01090 GO:0009507 GO:0006096 GO:0008152 GO:0009941 GO:0055114 GO:0009536
AT1G01100 GO:0005886 GO:0003735 GO:0006414 GO:0022626 GO:0005829 GO:0005634
AT1G01110 GO:0005739 GO:0003674
AT1G01115 GO:0012505
AT1G01115 GO:0003674 GO:0008150
AT1G01120 GO:0022626 GO:0005783 GO:0016020 GO:0008415 GO:0009409 GO:0009416
AT1G01130 GO:0005575 GO:0003674 GO:0008150
So AT1G01073
and AT1G01115
are in duplicate, It should be:
AT1G01073 GO:0012505 GO:0003674 GO:0008150
AT1G01115 GO:0012505 GO:0003674 GO:0008150
And as output I would like for instance to get this
Bol040727 AT1G01010 GO:0045449 GO:0005575 GO:0003700 GO:0007275
Bol040726 AT1G01020 GO:0005783 GO:0016020 GO:0003674 GO:0006665 GO:0016125
etc.
I tried to connect sheet 1 with sheet 2 with VLOOKUP
in excel, but it seems that some genes in sheet 2 are in duplicate but with different GO
terms. So I thought about finding the duplicated genes and paste the GO
terms together to create a list of only unique At genes with GO
terms. Finding the duplicated At genes is no problem, but how to paste the GO
terms together in an easy and quick way (there are a lot of duplicated At genes). And counting the GO
terms I can easily do if everything is in one column, but here it is not.
So is there anyone who knows how to solve this or who knows another maybe easier solution? C++ code in Linux or a simple perl/ python script is also fine by me. The columns in the files are tab delimited.
Thanks a lot!
Let's assume your original data are in files sheet1
and sheet2
. So one way is to read sheet2
first and collect all GO
functions for AT
s. Then read sheet1
and make a new version of sheet1
with GO
s.
use strict;
use warnings;
use autodie;
use constant FN_GENES => 'sheet1';
use constant FN_FUNCTIONS => 'sheet2';
# Make lookup of GOs for ATs
my %functions; # translation lookup for accumulating functions
{
open my $fh, '<', FN_FUNCTIONS;
while (<$fh>) {
chomp; # chomp out newline
my ( $at, @gos ) = split; # split line by whitespace (including '\t')
# treat $functions{$at} as hash and set keys by hash splice to undef
# its more efficient way to write
# $functions{$at}{$_} = undef for @gos;
@{ $functions{$at} }{@gos} = ();
}
}
# test print of %functions content (it's lookup)
# print "$_: @{[sort keys %{$functions{$_}}]}\n" for sort keys %functions;
# sort GOs (change hash keys to sorted array)
for my $gos ( values %functions ) {
$gos = [ sort keys %$gos ];
}
# read genes and look up for GOs
{
open my $fh, '<', FN_GENES;
while (<$fh>) {
chomp;
my ( $bol, $at ) = split;
# GOs for current Bol
# Use empty array if missing (array ref is always true)
my $gos = $functions{$at} || [];
# output tab delimited row
print join( "\t", $bol, $at, @$gos ), "\n";
}
}
Example output:
Bol040727 AT1G01010 GO:0003700 GO:0005575 GO:0007275 GO:0045449
Bol040726 AT1G01020 GO:0003674 GO:0005783 GO:0006665 GO:0016020 GO:0016125
Bol040729 AT1G01020 GO:0003674 GO:0005783 GO:0006665 GO:0016020 GO:0016125
Bol040725 AT1G01030 GO:0003700 GO:0005575 GO:0009908 GO:0045449 GO:0048366
Bol040724 AT1G01040 GO:0003725 GO:0004525 GO:0005515 GO:0005634 GO:0006396 GO:0008026
Bol040723 AT1G01050 GO:0004427 GO:0005634 GO:0005737 GO:0006796 GO:0008152 GO:0016020
Bol018442 AT1G01050 GO:0004427 GO:0005634 GO:0005737 GO:0006796 GO:0008152 GO:0016020
Bol040721 AT1G01070 GO:0016020
Bol018443 AT1G01070 GO:0016020
Bol040718 AT1G01090 GO:0006096 GO:0008152 GO:0009507 GO:0009536 GO:0009941 GO:0055114
Bol018445 AT1G01090 GO:0006096 GO:0008152 GO:0009507 GO:0009536 GO:0009941 GO:0055114
Bol040717 AT1G01100 GO:0003735 GO:0005634 GO:0005829 GO:0005886 GO:0006414 GO:0022626
Bol018446 AT1G01100 GO:0003735 GO:0005634 GO:0005829 GO:0005886 GO:0006414 GO:0022626
Bol040716 AT1G01110 GO:0003674 GO:0005739
Bol040715 AT1G01120 GO:0005783 GO:0008415 GO:0009409 GO:0009416 GO:0016020 GO:0022626
Bol018447 AT1G01120 GO:0005783 GO:0008415 GO:0009409 GO:0009416 GO:0016020 GO:0022626
Bol000520 AT1G01120 GO:0005783 GO:0008415 GO:0009409 GO:0009416 GO:0016020 GO:0022626
Bol000521 AT1G01120 GO:0005783 GO:0008415 GO:0009409 GO:0009416 GO:0016020 GO:0022626
Bol040714 AT1G01140
Bol040713 AT1G01160
Bol018449 AT1G01160
Bol040712 AT1G01170
Bol040711 AT1G01180
Bol040710 AT1G01190
Bol018450 AT1G01190
Bol040709 AT1G01200
Bol018452 AT1G01210
Bol018453 AT1G01220
Bol018454 AT1G01225
Bol040708 AT1G01230
Bol018455 AT1G01230
Bol040707 AT1G01240
Bol018456 AT1G01240
Bol040706 AT1G01250
Bol018457 AT1G01250
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments