Questions posed during the starting process should inform what datasets you need and where you can access them. For water quality organizations, it’s helpful to differentiate between internal data sources like your own water quality monitoring data versus external datasets like weather, pollution, and demographic data.
When collecting and organizing internal data it’s crucial to gather and analyze all information about sampling methodology. This is especially applicable if your combining data across historical periods. For example, did your organization switch from collecting samples at the surface to the middle of the water column? Did you change your dissolved oxygen probe from one manufacturer to the other? Did your station move from the east side to the west side of a river? Changes like these could have significant statistical implications. At the very least, document these differences so when it comes time to analyze your data, your aware of possible statistical artifacts.
Complimenting your internal datasets with external sources is a great way to add depth and nuance to your own data. Look for datasets from federal and state agencies. These sources tend to be more reliable in addition to having greater spatial coverage, broader timeframes, and higher data granularity.
Explore this shortlist of federal sources including weather, demographic, and water quality data.
- National Water Quality Portal
- NOAA Weather
- USGS Monitoring Stations
- EPA Toxic Release Inventory
- American Fact Finder
Whether your downloading external files or uploading internal data, store them in a ‘raw data’ folder and make a ‘working’ copy before you begin any editing. This insures you have a fallback in the inevitable event of making mistakes during your analysis process. Next step is to read all metadata associated with your datasets.
While researching Baltimore’s sewage issue, I decided to combine multiple data sources to construct a narrative around a particular period of intense precipitation, sewage overflows, and bacteria levels. I collected three dataset. Water quality data from Blue Water Baltimore, precipitation data from NOAA, and sewage overflow data from Maryland Department of the Environment.
Good data, will come with some form of metadata — a file documenting variables, creation date, data issues, etc. Mistakes can often be avoided by familiarizing yourself with the metadata. Note in the metadata example above from NOAA that ‘9999’ indicates missing data. Missing values like this are good to be aware of during data processing as they can significantly skew statistics and vizulations, particularly averages and charts. As a test, you should be able to explain what each row and column represent and any pitfalls or data limitations related to your dataset. If your dataset doesn’t include any metadata, generate your own!
In addition to provided metadata, I listed out my datasets and key information which will help me shape my data processing and cleaning. The simple act of having all this information in one place can help kickstart your thinking about what work needs to be accomplished before any analysis occurs.
Note that I included a bullet for concerns to be mindful of later on. The Blue Water Baltimore Enterococcus data was particularly difficult to manage because the data was collected in semi-regular weekly intervals unlike the daily sewage and precipitation data. I decided to create a monthly average of stations exceeding a particular threshold instead of a daily or weekly average. I found this solution struck the best balance between effort and data granularity. Data regularity is a pretty common situation when dealing with citizen science data without perfectly consistent sampling cycles. Often these data conundrums are difficult problems to solve in one day. I find it useful to list my concerns and then ruminate on how best to address them. Decisions made early on can make or break your project so spend quality time thinking about your data concerns.