In the dataset I'm working on, the Adult dataset, the missing values are indicated with the "?"
string, and I want to discard the rows containing missing values.
In the documentation of the method df.dropna()
there is no argument that offers the possibility of passing a custom value to interpret as the null/missing value,
I know I can simply solve the problem with something like:
df_str = df.select_dtypes(['object']) # get the columns containing the strings
for col in df_str.columns:
df = df[df[col] != '?']
but I was wondering if there is a standard way of achieving this using Pandas apis which possibly offers more flexibility all while being faster.
If you're importing the data from CSV for example, you could use the parameter na_values
to define additional strings to recognise as NA/NaN.
Example:
import pandas as pd
from io import StringIO
data = \
"""
A;B;C
1;2;?
4;?;6
?;8;9
"""
df = pd.read_csv(StringIO(data),
delimiter=';',
na_values='?')
The resulting dataframe looks like this:
A | B | C |
---|---|---|
1 | 2 | NaN |
4 | NaN | 6 |
NaN | 8 | 9 |
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments