I have an Excel file (Available at Google Drive) with data which was saved with a very strange format in order to get printed easily:
And every table repeats daily for over 5 years. I need to analyze this data and tried to get a relational format in order to load it in R/Python-like tools and get only 5 columns:
DATE | CLIENT NAME | TEST | MEASUREMENT | TESTER |
---|---|---|---|---|
01-01-2023 | JOHN SMITH | METABOLYTE A | 0.01 | PhD. IMA GU |
01-01-2023 | JOHN SMITH | METABOLYTE B | 10 | PhD. IMA GU |
01-01-2023 | JOHN SMITH | PCR | NEGATIVE | PhD. IMA GU |
01-01-2023 | JOHN SMITH | MUTATION | +++ | PhD. IMA GU |
01-01-2023 | ALBUS DUMBLE | PREGNANT | NEGATIVE | TECH. GUIVER |
01-01-2023 | ALBUS DUMBLE | GLUCOSE | 121 | TECH. GUIVER |
02-01-2023 | MAYDAY JUNE | METABOLYTE A | 0.01 | PhD. IMA GU |
02-01-2023 | JOHN SMITH | METABOLYTE A | 0.01 | TECH. GUIVER |
02-01-2023 | JOHN SMITH | METABOLYTE B | 10 | TECH. GUIVER |
02-01-2023 | JOHN SMITH | PCR | NEGATIVE | TECH. GUIVER |
02-01-2023 | JOHN SMITH | MUTATION | +++ | TECH. GUIVER |
So, in order to get a conversion from non-relational data to relational table I have applied text-mining techniques available at this GitHub repo. But, basically, have converted everything into one column with tidyr::pivot_longer()
. Is there any optimal function or method to detect and tabulate this kind of data, or should i try to do it with a loop (+843 files)?
My attempt is based on fact, that the entries are formatted identically, so we can use kind of 'moving window'.
a <- openxlsx::read.xlsx(xlsxFile = "/home/sapi/Downloads/ENERO_2023_prueba.xlsx",
colNames = FALSE
)
Now we have to define data frame for data storage. Comments like # [2,2] +0, +1
corresponds to row and column of a
(loaded excel).
entry <- data.frame(
NOMBRE = character(), # [2,2] +0, +1
FECHA = character(), # [2,6] +0, +5
MUESTRA = character(), # [3,3] +1, +2
place = character(), # [3,5] +1, +4
COLOR = character(), # [6,3] +4, +2
ASPECTO = character(), # [7,3] +5< +2
DENSIDAD = double(), # [8,3] +6, +2
PH = character(), # [9,3] +7, +2
# ...
LEUCOCITOS = character(),#[19,3] +17, +2
BACTERIAS = character(), # [6,7] +4, +6
PIOCITOS = character() # [7,7] +5, +6
# ...
)
Now we have to find all rows with NOMBRE
nombre_rows <- which(a[,"X1"] == "NOMBRE")
and use it in loop like:
for (i in 1:length(nombre_rows)) {
x <- nombre_rows[i]
nombre_cols <- which(a[x,] == "NOMBRE") # the same for columns
for (j in 1:length(nombre_cols)) {
y <- nombre_cols[j]
entry <- data.frame(
NOMBRE = a[x, y+1],
FECHA = a[x, y+5],
MUESTRA = a[x+1, y+2],
place = a[x+1, y+4],
COLOR = a[x+4, y+2],
ASPECTO = a[x+5, y+2],
DENSIDAD = a[x+6, y+2],
PH = a[x+7, y+2],
# ...
LEUCOCITOS = a[x+17, y+2],
BACTERIAS = a[x+4, y+6],
PIOCITOS = a[x+5, y+6]
# ...
) |> rbind(entry)
}
}
And finally the data:
head(entry)
#> NOMBRE FECHA MUESTRA place COLOR ASPECTO
#> 1 RUANO EDITH 44957.0 ORINA CEXTERNA AMARILLO LIG.TURBIO
#> 2 CUNIN ELVIRA 44957.0 ORINA HOSPÌTALIZACION AMARILLO TURBIO
#> 3 LOACHAMIN MARIA 44957.0 ORINA CEXTERNA AMARILLO TRANSPARENTE
#> 4 MANZANO RAUL 44957.0 ORINA CEXTERNA AMARILLO LIG.TURBIO
#> 5 MERCHAN IVAN 44957.0 ORINA HIDRATACION ANARANJADO TURBIO
#> 6 ACERO ANTHONY 44957.0 ORINA CEXTERNA AMARILLO LIG.TURBIO
#> DENSIDAD PH LEUCOCITOS BACTERIAS PIOCITOS
#> 1 1005.0 8.0 NEGATIVO + 1-2/CAMPO
#> 2 1020.0 5.0 +++ ++ CAMPO LLENO
#> 3 1005.0 6.0 NEGATIVO OCASIONALES 1-2/CAMPO
#> 4 1010.0 7.0 NEGATIVO + 3-7/CAMPO
#> 5 1015.0 6.0 ++ ++ 50-60/CAMPO
#> 6 1010.0 5.0 NEGATIVO + 0-2/CAMPO
Created on 2023-06-18 with reprex v2.0.2
You should extend the entry
data frame to grab all variables from your data. And then loop it through all excels you have.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments