Drilling Detection with Machine Learning Part 1: Getting to Know The Training Data
Intern Sasha Bylsma explains the first steps in teaching computers how to detect oil and gas well pads from satellite imagery.
[This blog post is the first in a three-part series describing SkyTruth’s effort to automate the detection of oil and gas well pads around the world using machine learning. This tool will allow local communities, conservationists, researchers, policymakers, and journalists to see for themselves the growth of drilling in the areas they care about. By the end of the series, we will have demystified the basic principles of machine learning applied to satellite imagery, and will have a technical step-by-step guide for others to follow. At SkyTruth, we aim to inspire people to protect the environment as well as educate those who want to learn from our work to develop applications themselves that protect people and the planet.]
Detecting environmental disturbances and land use change requires two things: the right set of technological tools to observe the Earth and a dedicated team to discover, analyze, and report these changes. SkyTruth has both. It is this process of discovery, analysis, and publication — this form of indisputable transparency that SkyTruth offers by bringing to light the when, the where, and hopefully the who of environmental wrongdoings — that appealed to me most about this organization, and what ultimately led me to apply for their internship program this summer.
In my first weeks as an intern, I was tasked with analyzing dozens of ocean satellite images, searching for oil slicks on the sea surface left behind by vessels, which show up as long black streaks. As a student with an emerging passion for Geographic Information System science (GIS), I was eager to find a more efficient way to scan the oceans for pollution. I wished I could simply press a “Next” button and have tiles of imagery presented to me, so I could search for patterns of oil quickly. It was a relief to find out that my coworkers were developing such a solution. Instead of relying on me and others to scan imagery and recognize the patterns of oil, they were training a computer to do it for us, a project called Cerulean. They were using machine learning and computer vision to teach a model to learn the visual characteristics of oil spills in images, pixel by pixel, after giving it many examples. I was really interested in getting involved in this work, so I asked if I could join a different project using machine learning: detecting new oil and gas drilling sites being built in important habitat areas in Argentina.
Creating the training data
One of my first tasks was to organize a handful of existing polygons that we would use to create training data, which is what we call the information that is used to teach the model. Using Google Earth Engine, I placed the polygons over imagery collected from the European Space Agency’s Sentinel-2 satellites. The imagery is pretty coarse – these satellites collect images with a 10 meter spatial resolution. This means that every pixel in the image covers 100 square meters on the ground. The imagery can also be pretty cloudy, so one of the first things that I had to do was remove the clouds by creating cloud-free composite images. Basically, this combines several images of the same place, but at different times, and only uses the pixels that aren’t cloudy. This allowed me to create a single, cloud-free image of each of our sample areas. Once we’d done that, we were ready to make examples for the model to take in.
Figure 1 shows a visual representation of the process that my colleagues and I developed to create the training data. On the left, we have a view of two well pads in Colorado from the default satellite base map in Google Earth Engine. This is the same imagery that you would see in the “Satellite” view of Google Maps; it’s very high resolution commercial satellite imagery, so it’s easy to see objects like these drilling sites in great detail. In the middle is a Sentinel-2 image of the same well pads. Sentinel-2 imagery is publicly available for free, and it is the imagery source that we use for our model. On the right, we have the Sentinel-2 image overlain with the well pad polygons that I’ve manually drawn.
Figure 1: Overlaying well pad polygons onto Sentinel-2 images
From here, we want to be able to select an area that captures each well pad and its surroundings. To accomplish this, we take the center of each blue well pad polygon, create a buffer of 200 meters, and then select a random spot within that circular buffer zone to drop a point, which appears below as a red dot. Figure 2 illustrates this step.
These red dots are then used as the center of a 256 pixel by 256 pixel square – what we call a patch – that will house the well pad and its surroundings. I’ve illustrated what this box would look like in Figure 3, just using the left well pad for simplicity.
Next, we need to classify the image into “well pad” and “no well pad” labels. We create a binary mask with white representing well pad areas and black representing no well pad areas. This mask covers the entire image, and Figure 4 is a closeup of the two well pads we’ve been looking at with the boundary box visible as well.
Finally, let’s zoom into the extent of the white boundary and put it all together.
This pair of small pictures in Figure 5 – the image patch on the left and the image’s label on the right – is what the model sees. Well, almost. Let’s break it down. Every image is made up of pixels that have numerical values for the amount of red, green, and blue in that pixel — three colors. If you can imagine each pixel as being three numbers deep, you can then imagine that the colored image on the left is a matrix with the dimensions 256 x 256 x 3. The right image is a matrix with the dimensions 256 x 256 x 1, since it only has one channel storing the label: 0’s for black pixels and 1’s for white pixels. One of these pairs – an image and its label – constitutes a single example that will go into the model.
Thousands of examples
In order for the model to learn, we needed to create thousands of examples. So, we mapped well pads in Colorado, New Mexico, Utah, Nevada, Texas, Pennsylvania, West Virginia, and Argentina to use for training examples. My team and I tried to keep a couple of additional things in mind. First, we created a dataset of “look-alike” well pads, meaning that we found areas that the model could easily mistake for a well pad (such as square parking lots, plots of farmland, housing developments, etc.) and made labels for them. This indicates to the model that although these examples share similar features, they are not well pads, and this strengthens the neural connections of the model by refining its definition of what a well pad is, by showing it what a well pad isn’t.
Second, we made sure to capture some variation in the appearance of well pads. While some are very bright in contrast to their landscape, others were darker than their surroundings, and some, especially in the American West, are essentially a dirt patch on desert land. By collecting training data of both the obvious well pads and the harder-to-distinguish ones, we added variance and complexity to the model. Since in the real world some well pads are old, some have been overgrown by vegetation, and some are covered with equipment, it’s important to include several examples of these special cases in the training data so that the model can recognize well pads regardless of the condition that they might be in.
Prepping the training data for the model
To complete the process of preparing the training data, I packaged up the examples as TFRecords, a data format that is ideal for working with TensorFlow, a popular machine learning platform. These TFRecords will be fed into the model so that it can learn the visual characteristics of well pads well enough to be able to detect drilling sites in previously unseen imagery.
Now that we’ve discussed how to develop the training data, Geospatial Analyst Brendan Jarrell will explain how we developed our model in the second post in this series.