Der Umgang mit fehlenden Daten (NA)
The challenges of descriptive statistics - Part 1
They are a constant, unloved companion in data analysis and yet unavoidable: the missing values. Often abbreviated as “NA” (not available), there are many different approaches to dealing with missing values. In this article we analyse how missing values occur, what the differences are and what methods can be used to minimise the loss of information due to missing values.
Data points can be missing for a variety of reasons, but most often for the simple reason that the data for a feature is not available. In medical datasets, this may mean that additional tests were performed on some patients but not on others. In questionnaires, missing values often occur simply because a participant overreads a question and thus does not answer it. The removal of outliers and errors in a measurement instrument or in data entry are also common reasons for missing values. Only a small proportion of missing values are structural. For example, a participant could answer the question “Do you smoke?” in the negative and thus make an answer to the follow-up question about the number of cigarettes smoked per day obsolete.
Types of missing values
Once the missing values have been created, a distinction is made between explicitly and implicitly missing values in data sets. Explicitly missing values are marked as such in a data set, which differs depending on the software. In Excel, an empty cell would be such a marker; in other programming languages, other symbols such as “NA” or “.” within a cell stand for a missing value.
Implicitly missing values are simply not present in a data set, which is why they are much more difficult to detect. The following example will demonstrate the difference.
This data set contains both an explicit and an implicit missing value. The explicitly missing value is that for March, implicitly missing is the value for August, which is much more difficult to find. Especially in larger data sets, a check for the presence of implicitly missing values is of great importance. In the above example, the existing months would be compared with a complete list of months and the entry AUGUST
would be added. In this way, an implicitly missing value becomes an explicitly missing value. Analysts like to summarise the difference with the saying “An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence”.
The structure of missing values
In the next step, the structure of missing values is analysed in more detail, as the further procedure depends on this. A distinction is made between three reasons:
- Missing completely at random (MCAR)
- Missing at random (MAR)
- Missing not at random (MNAR)
Let us assume, for example, that we have a data set with height, weight and blood pressure of different persons, in which information about the weight of several persons is missing. Missing completely at random would mean that the missing values occur completely at random and do not depend on the weight variable (only entries from overweight persons are missing) or on the other two variables. Missing at Random means that the missing values occur randomly within the weight variable but depend on one of the other two variables. For example, values of people with low blood pressure could be systematically missing, but they could have all possible weight characteristics. In the example, missing not at random would mean that values of persons with special weight characteristics, such as overweight persons, are structurally missing.
Missing not at random is problematic in modelling because the structure of the missing values may hide relevant information. To distinguish missing at random from missing not at random, it is necessary to obtain more precise information about the missing values.
A possible setting of missing not at random can occur in drug trials where people who experience side effects drop out of the trial and eventually only those who tolerated the drug excellently participate in the trial. This would cause biased results, which is why it must always be precisely documented in studies when a participant drops out.
If missing not at random values are present, the further procedure must be questioned; the procedures presented in the next section can only be applied if the missing values are at least missing at random.
Picture 1: aus https://wikis.fu-berlin.de/display/fustat/Vom+Umgang+mit+fehlenden+Werten
Dealing with missing data
How best to deal with missing values has already been the subject of entire books. We only give a brief overview of the most common procedures here and recommend Raghunathan’s book for further reading (see Sources). An overview (sorted by complexity):
Exclusion of cases with missing values
This method is the simplest and most intuitive. All observations from which the value of a variable is unknown are completely removed from the data set. This is only possible with a sufficiently large data set with enough observations. If the missing values do not occur randomly, but according to a pattern, the application of the method is problematic and leads to false results. In any case, however, a loss of efficiency accompanies the exclusion, since collected information is not included in the analysis.
Replace with mean/median
The simplest way to replace the missing values is to substitute with the mean. In our example, the missing weight values would be replaced by the average weight in the data set. This method has the advantage that the values for height and blood pressure of the individuals can be used in the analysis and the mean value of the weight variable remains the same. However, this method also has some disadvantages. Often implausible values arise, e.g. it makes no sense to assign the Austrian average weight of 74 kg to a man with a height of 2 metres.
Much more problematic, however, is that the dispersion of the weight variable is artificially corrected downwards, as the data points accumulate at the mean value. Also, correlations between the variable whose values are missing and other variables can no longer be calculated correctly, since the values used cannot show any correlation with other variables.
Estimating values with a linear model (regression imputation)
This method exploits the correlations between the variables and calculates a model to estimate the missing values. To do this, the variables must explain each other to some degree in order to obtain reasonable estimates of the missing values. For example, the weight variable could be approximated quite well by height. The values obtained lie graphically exactly on the regression line, which also leads to biased variances and covariances, so caution is advised here as well.
Hot Deck Imputation
Hot deck imputation takes a different approach. It searches the data set for “similar” observations and replaces the missing value of the variable with the value of another observation. In this way, implausible values of the variable can be avoided as they are pulled from the data set. A disadvantage is that after imputation, the added values can no longer be distinguished from the collected values and thus the uncertainty of the imputation is no longer taken into account.
The most complex procedure that we would like to present in this article is multiple imputation. It takes into account the uncertainty that arises from imputation by replacing the missing value not with a single value, but with a series of possible values. This results in several data sets for which the following analysis is carried out separately. Provided that at least one missing at random condition is fulfilled, unbiased results are obtained by combining the results. The disadvantage of this method is the higher complexity and the associated longer computing time.
Dealing with missing data is complex and multi-layered and a structural approach to treatment is very important. Depending on the procedure used, one must be aware of the advantages and disadvantages. We would like to offer a short recipe for handling, which, however, must be individually adapted to the respective application case:
- Determining the frequency of missing values
- Check for implicitly missing values
- Analysis of the structure of the missing values (MCAR, MAR, MNAR)
- Consciously choose imputation procedures – work out opportunities and risks
- Apply procedure
- Summary and interpretation
- Raghunathan, Trivellore (2016), Missing Data Analysis in Practice, Chapman & Hall