As you clean your data, create an action list off issues to look for. This list will supplement the standard set of cleaning operations for citizen science data discussed in this post. Additionally we will walk through some errors found in a sanitary sewage overflow (SSO) dataset from Maryland Department to demonstrate some advanced issues found in externally sourced datasets.
Tackling data QA/QC is best done with a variety of techniques and software solutions. With more tools in your toolbox, you can confidently and efficiently tackle your data’s problems. Although we are blessed with a wide universe of tools and functions, there is no substitute for methodically familiarizing yourself with the data. You should be aware of the simple but ever important statistics like your number of observations, value ranges, and other general descriptors. An intimate knowledge of your dataset in conjunction with the tools discussed in this post will equip you to catch and correct any errors your data might have.
Here are a few of the most common data problems and the tools I like to use to solve them. Note that if you are using Google Sheets, Excel, or any other tools without automatic logging like a script, make sure you record any dataset manipulation in your documentation!
Outlier Values. Often data that hasn’t been cleaned will contain values that fall outside the logical range of possibility. For example, if your water quality monitoring dataset contains pH readings, all of your values should fall between 1 and 14. For data with a known range like pH you can use the MIN and MAX functions in Excel or Google Sheets to quickly find any values that fall outside your desired range. For variables with a broader range of values like precipitation or fecal coliform you can use other measures like average or standard deviation to identify possible outliers that deviate significantly from the norm. Sometimes these values are accurate, and other times they warrant removal. At the very least you should be aware of them and how they affect your data.
Spatial Errors. Most environmental data is spatial in nature, the data represents information collected at a specific location and will either contain addresses or coordinate pairs like latitude and longitude. Ensuring your data is in the right place is equally important as the data itself. If you are unfamiliar with GIS software packages like QGIS or ArcMap, you can easily use Google Earth to check your data for location accuracy. Simply open Google Earth, select File -> Import, and follow the prompts to use your latitude and longitude variables to map your points. You’ll quickly recognize any points which fall outside your geographic scope — see gif. Common spatial issues include: swapped latitude and longitude pairs, longitude values missing a ‘-’ (if you’re in the western hemisphere), and misplaced decimal points, e.g. -768.9992 instead of -76.8992. Solving spatial errors is crucial if you plan to do any mapping with your data in Water Reporter or any other platform.
Spelling Mistakes. Spelling errors might seem trivial but they are a common issue in crowd sourced citizen science data. Take for example a qualitative category for weather. Allowed values could include sunny, partly sunny, or cloudy. Your dataset might have erroneous or misspelled observations like overcast, partly cloudy, or suny(sic). A really nice tool to find these pesky values is Google Sheets’ UNIQUE command (Excel’s solution is a bit more laborious). This function lets you quickly identify spelling mistakes and correct them by using Ctrl + F to search for the value and correcting.
In addition to addressing your dataset’s specific concerns, outlier values, spatial errors, and spelling mistakes represent a large portion of common data issues. Checking for these three issues is often sufficient QA/QC. However, as you become better acquainted with your datasets during the QA/QC process, you might uncover problems that weren’t obvious during your data collection. Here are a couple issues I found in my SSO dataset.
SSO events increased dramatically after 2016. Knowing your dataset and its real world context can help to answer quirky inconsistencies. A good example is the sharp rise in SSOs after 2016. Baltimore was sued by the EPA in 2002 and after 14 years of inaction, the City got its act together in 2016. Knowing that SSO events were relatively consistent over time, and the increase in scrutiny over SSO’s in 2016, it’s clear that reporting quality increased. Taking this into account, I decided to cut out all events pre 2016 as to not skew any visualizations or statistics derived from the data.
Statistical artifact of 99 gallons. The SSO dataset is sourced from self reporting municipalities who are subject to fines when an SSO occurs and additional fines if the spill is over 99 gallons. As a consequence there is an abundance of 0’s and 99's in the dataset (roughly 300 out of 1400 observations). There are a variety of ways to fix this issue. One method is to remove all 99’s, take an average of the dataset, and then apply that value to all observations containing 99. Another technique is to split the 99 observations into thirds and apply first, second, and third quartile values of your dataset to each of the three bins. For the purposes of the project I decided to take the easy route and leave the values in and simply document the error in subsequent data products. If you plan to leave errors in your data, first evaluate the end use. If the plan is to send the data off to someone else, it’s not ideal to leave such an egregious error but for my visualization purposes it was acceptable.
It can be difficult to catch and fix every error in a dataset. Remember that like anything, data analysis is a learning process and the first go-round isn’t always perfect. It’s OK if you uncover more issues during your data analysis phase, so long as you fix and document the problems. Documentation will quicken your learning process and make you are aware of mistakes you made previously and how to find them earlier on.
If you missed the last piece about data collection, you can read it here. Next Friday’s piece on data analysis will contain the tofu and potatoes of this series so make sure to check it out!