Training Computers to See What We See

To analyze satellite data for environmental impacts, computers need to be trained to recognize objects.

The vast quantities of satellite image data available these days provide tremendous opportunities for identifying environmental impacts from space. But for mere humans, there’s simply too much — there are only so many hours in the day. So at SkyTruth, we’re teaching computers to analyze many of these images for us, a process called machine learning. The potential for advancing conservation with machine learning is tremendous. Once taught, computers potentially can detect features such as roads infiltrating protected areas, logging decimating wildlife habitat, mining operations growing beyond permit boundaries, and other landscape changes that reveal threats to biodiversity and human health. Interestingly, the techniques we use to train computers rely on the same techniques used by people to identify objects.

Common Strategies for Detecting Objects

When people look at a photograph, they find it quite easy to identify shapes, features, and objects based on a combination of previous experience and context clues in the image itself. When a computer program is asked to describe a picture, it relies on the same two strategies. In the image above, both humans and computers attempting to extract meaning and identify object boundaries would use similar visual cues:

  • Colors (the bedrock is red)
  • Shapes (the balloon is oval)
  • Lines (the concrete has seams)
  • Textures (the balloon is smooth)
  • Sizes (the feet are smaller than the balloon)
  • Locations (the ground is at the bottom)
  • Adjacency (the feet are attached to legs)
  • Gradients (the bedrock has shadows)

While none of the observations in parentheses capture universal truths, they are useful heuristics: if you have enough of them, you can have some confidence that you’ve interpreted a given image correctly.

Pixel Mask

If our objective is to make a computer program that can find the balloon in the picture above as well as a human can, then we first need to create a way to compare the performances of computers and humans. One solution is to task both a person and a computer to identify, or “segment,” all the pixels that are part of the balloon. If results from the computer agree with those from the person, then it is fair to say that the computer has found the balloon. The results  are captured in an image called a “mask,” in which every pixel is either black (not balloon) or white (balloon), like the following image.

However, unlike humans, most computers don’t wander around and collect experiences on their own. Computers require datasets of explicitly annotated examples, called “training data,” to learn to identify and distinguish specific objects within data. The black and white mask above is one such example. After seeing enough examples of an object, a computer will have embedded some knowledge about what differentiates balloons from their surroundings.

Well Pad Discovery

At SkyTruth, we are starting our machine learning process with oil and gas well pads. Well pads are the base of operations for most active oil and gas drilling sites in the United States, and we are identifying them as a way to quantify the impact of these extractive industries on the natural environment and neighboring communities. Well pads vary greatly in how they appear. Just take a look at how different these three are from each other.

Given this diversity, we need to provide the computer many examples, so that the machine learning model we are creating can distinguish between important features that characterize well pads (e.g. having an access road) and unimportant ones that are allowed to vary (e.g. the shape of the well pad, or the color of its surroundings). Our team generates masks (the black and white pixel labels) for these images by hand, and inputs them as “training data” into the computer. We provide both the image and its mask separately to the machine learning model, but for the rest of this post we will superimpose the mask in blue.

Finally, our machine learning model looks at each image (about 120 of them), learns a little bit from the mask provided with it, and then moves onto the next image. After looking at each picture once, it has already reached 92% accuracy. But we can then tell it to go back and look at each one again (about 30 times), and add a little more detail to its learning, until it reaches almost 98% accuracy.

After the model is trained, we can feed it raw satellite images and ask it to create a mask that identifies all the pixels belonging to any well pads in the picture. Here are some actual outputs from our trained machine learning model:

The top three images show well pads that were correctly identified, and fairly well masked — note the blue mask overlaying the well pads. The bottom three images do not contain well pads, and you can see that our model ignores forests, fields, and houses very well in the first two images, but is a little confused by parking lots — it has masked the parking lot in the third image in blue (incorrectly), as if it were a well pad. This is reasonable, as parking lots share many features with well pads — they are usually rectangular, gray, contain vehicles, and have an access road. This is not the end of the machine learning process; rather it is a first pass through that informs us of a need to capture more images of parking lots and further train the model that those are negative examples.

When working on image segmentation, there are a number of challenges that we need to mitigate. 

Biased Training Data

Predictions that the computer makes are based solely on training data, so it is possible for idiosyncrasies in the training data set to be encoded (unintentionally) as meaningful. For instance, imagine a model that detects a person’s happiness from a picture of their face. If it is only shown open-mouth smiles in the training data, then it is possible that when presented with real world images, it classifies closed-mouth smiles as unhappy.

This challenge often affects a model in unanticipated ways because those biases can be inherent in the data scientist. We try to mitigate this by making sure that our training dataset comes from the same set of images as those that we need to be automatically classified. Two examples of how biased data might creep into our work are: training a machine learning model on well pads in Pennsylvania and then asking it to identify pads from California (bias in the data source), or training a model on well pads centered in the picture, and then asking it to identify them when halfway out of the image (bias in the data preprocessing).

Garbage In, Garbage Out

The predictions that the computer makes can only be as good as the samples that we provide in the training data. For instance, if the person responsible for training accidentally includes the string of a balloon in half of the images created for the training dataset and excludes it in the other half, then the model will be confused about whether or not to mask the string in its predictions. We try to mitigate this by adhering to strict explicit guidelines about what constitutes the boundary of a well pad.

Measuring Success

In most other machine learning systems, it is useful to measure success as a product of two factors. First, was the guess right or wrong? And second, how confident was the guess? However, in image segmentation, that is not a great metric, because the model can be overwhelmed by an imbalance between the number of pixels in each class. For instance, imagine the task is to find a single ant on a large white wall. Out of 1000 white pixels, only 1 is gray. If your model makes a mask that searches long and hard and guesses that one pixel correctly, then it gets 100% accuracy. However, a much simpler model would say there is no ant, that every pixel is white wall, and get rewarded with 99.9% accuracy. This second model is practically unusable, but is very easy for a training algorithm to achieve.

We mitigate this issue by using a metric known as the F-beta score, which for our purposes avoids objects that are very small being ignored in favor of ones that are very large. If you’re hungry for a more technical explanation of this metric, check out the Wikipedia page.

Next Steps

In the coming weeks we will be creating an online environment in which our machine learning model can be installed and fed images with minimal human guidance. Our objective is to create two pipelines: the first allows training data to flow into the model, so it can learn. The second allows new images from satellites to flow into the model, so it can perform image segmentation and tell us the acreage dedicated to these well pads.

We’ll keep you posted as our machine learning technology progresses. 

Note: Title was updated 10/2/19