This is my first time ever using R so please forgive me if this question is not worded properly. I have a .csv file that I imported into R and I am trying to summarize some data. each row of data if for a given year, study site, and area and each column has the number of species present. There are 4 columns for each species as there were 4 surveys where the species could have been seen.
I am trying to get the sum of each species by year and study site. Columns 5:8 are one species, 9:12 another, 13:16 another and so on. Here is the code that I thought would summarize columns 5:8 by year (YYYY) and study area (SAR):
aggregate(test[,5:8],by = list("SAR","YYYY"), FUN = sum, na.rm = TRUE)
This gives me the error message that "argument must have the same length". Can anyone help me through this initial step?
Here is some of the data:
SAR YYYY GRID_ID WID col1 col2 col3 col4
BCPALP 2005 1 1189 NA NA 0 0
BCPALP 2005 1 1190 0 NA 0 0
BCPALP 2005 1 1191 0 0 NA NA
BCPALP 2005 1 1192 0 NA NA NA
BCPALP 2005 1 1194 NA NA 1 NA
BCPALP 2005 1 1195 NA NA 1 NA
BCPALP 2005 1 1196 0 NA 0 NA
BCPALP 2005 1 1198 0 NA 0 NA
BCPALP 2005 1 1199 0 NA 0 0
I'm hoping to get an output that is something like this:
SAR YYYY total of columns 1:4
BCPALP 2005 2
This is the code I just tried.
aggregate(cbind("col1", "col2", "col3", "col4")~SAR+YYYY, test, FUN=sum, na.rm=TRUE, na.action=NULL)
It gives me an error message that states "variable lengths differ (found for 'SAR')".
I went back and checked the data and all the variable lengths are the same.
We can use either aggregate
or data.table
or dplyr
. If we use the formula method for aggregate
, we need to set the na.action=NULL
when there are NA
values in different columns. By default, the na.action=na.omit
, so if there is a single NA in one of the columns, that row will be removed from the calculation.
aggregate(cbind(col1, col2, col3, col4)~SAR+YYYY, test,
FUN=sum, na.rm=TRUE, na.action=NULL)
# SAR YYYY col1 col2 col3 col4
#1 BCPALP 2005 0 0 2 0
Using dplyr
, we group by 'SAR', 'YYYY', and use summarise_each
to get the sum
of each of the 'col'.
library(dplyr)
test %>%
group_by(SAR, YYYY) %>%
summarise_each(funs(sum=sum(., na.rm=TRUE)), 5:ncol(test))
# SAR YYYY col1 col2 col3 col4
# (chr) (int) (int) (int) (int) (int)
#1 BCPALP 2005 0 0 2 0
Or with data.table
. We convert the 'data.frame' to 'data.table' (setDT(test)
), grouped by 'SAR', 'YYYY', we loop though the Subset of Data.table (.SD
) and get the sum
. The columns to be looped are specified in the .SDcols
.
library(data.table)
setDT(test)[, lapply(.SD, sum, na.rm=TRUE), by = .(SAR, YYYY),
.SDcols= 5:ncol(test)]
# SAR YYYY col1 col2 col3 col4
#1: BCPALP 2005 0 0 2 0
Suppose after aggregating we need to get the row wise sum for columns 'col1:col4', then 'col5:col8' etc.
DT <- setDT(test1)[, lapply(.SD, sum, na.rm=TRUE),
by = .(SAR, YYYY), .SDcols= 5:ncol(test1)]
DT1 <- melt(DT, id.var=c('SAR', 'YYYY'))[, i1 := as.numeric(gl(.N, 4, .N)),
.(SAR, YYYY)]
dcast(DT1, SAR+YYYY~i1, value.var='value', sum)
test <- structure(list(SAR = c("BCPALP", "BCPALP",
"BCPALP", "BCPALP",
"BCPALP", "BCPALP", "BCPALP", "BCPALP", "BCPALP"), YYYY = c(2005L,
2005L, 2005L, 2005L, 2005L, 2005L, 2005L, 2005L, 2005L),
GRID_ID = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), WID = c(1189L, 1190L, 1191L,
1192L, 1194L, 1195L, 1196L, 1198L, 1199L), col1 = c(NA, 0L, 0L,
0L, NA, NA, 0L, 0L, 0L), col2 = c(NA, NA, 0L, NA, NA, NA, NA,
NA, NA), col3 = c(0L, 0L, NA, NA, 1L, 1L, 0L, 0L, 0L), col4 = c(0L,
0L, NA, NA, NA, NA, NA, NA, 0L)), .Names = c("SAR", "YYYY",
"GRID_ID",
"WID", "col1", "col2", "col3", "col4"), class = "data.frame",
row.names = c(NA, -9L))
set.seed(24)
m1 <- matrix(sample(c(NA,0:5), 9*4, replace=TRUE),ncol=4,
dimnames=list(NULL, paste0('col', 5:8)))
test1 <- cbind(test, m1)
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments