New Data Available on the Footprint of Surface Mining in Central Appalachia

The area of Central Appalachia impacted by surface mining has increased — by an amount equal to the size of Liechtenstein — despite a decline in coal production.

SkyTruth is releasing an update for our Central Appalachian Surface Mining data showing the extent of surface mining in Central Appalachia. While new areas continue to be mined, adding to the cumulative impact of mining on Appalachian ecosystems, the amount of land being actively mined has declined slightly.

This data builds on our work published last year in the journal PLOS One, in which we produced the first map to ever show the footprint of surface mining in this region. We designed the data to be updated annually. Today we are releasing the data for 2016, 2017, and 2018.

Mountaintop mine near Wise, Virginia. Copyright Alan Gignoux; Courtesy Appalachian Voices; 2014-2.

Coal production from surface mines, as reported to the US Energy Information Administration (EIA), has declined significantly for the Central Appalachian region since its peak in 2008. Likewise, the area of land being actively mined each year has steadily decreased since 2007. But because new land continues to be mined each year, the overall disturbance to Appalachian ecosystems has increased. From 2016 to 2018 the newly mined areas combined equaled 160 square kilometers – an area the size of the nation of Liechtenstein. One of the key findings of our research published in PLOS ONE was that the amount of land required to extract a single metric ton of coal had tripled from approximately 10 square meters in 1985 to nearly 30 square meters in 2015. Our update indicates that this trend still holds true for the 2016-2018 period: Despite the overall decrease in production, in 2016 approximately 40 square meters of land were disturbed per metric ton of coal produced – an all time high. This suggests that it is getting harder and harder for companies to access the remaining coal.

Active mine area (blue) and reported surface coal mine production in Central Appalachia (red) as provided by the US Energy Information Administration (EIA). The amount of coal produced has declined much more dramatically than the area of active mining.

This graph shows the disturbance trend for surface coal mining in Central Appalachia. Disturbance is calculated by dividing the area of actively mined land by the reported coal production for Central Appalachia as provided by the EIA.

Tracking the expansion of these mines is only half the battle. We are also developing landscape metrics to assess the true impact of mining on Appalachian communities and ecosystems. We are working to generate a spectral fingerprint for each identified mining area using satellite imagery. This fingerprint will outline the characteristics of each site; including the amount of bare ground present and information about vegetation regrowing on the site. In this way we will track changes and measure recovery by comparing the sites over time to a healthy Appalachian forest.

Mining activity Southwest of Charleston, WV. Land that was mined prior to 2016 is visible in yellow, and land converted to new mining activity between 2016 and 2018 is displayed in red.

Recovery matters. Under federal law, mine operators are required to post bonds for site reclamation in order “to ensure that the regulatory authority has sufficient funds to reclaim the site in the case the permittee fails to complete the approved reclamation plan.” In other words, mining companies set aside money in bonds to make sure that funds are available to recover their sites for other uses once mining ends. If state inspectors determine that mine sites are recovered adequately, then mining companies recover their bonds.

But the regulations are opaque and poorly defined; most states set their own requirements for bond release and requirements vary depending on the state, the inspector, and local landscapes. And as demand for coal steadily declines, coal companies are facing increasing financial stress, even bankruptcy. This underlines the importance of effective bonding that actually protects the public from haphazardly abandoned mining operations that may be unsafe, or unusable for other purposes.

We are now working to track the recovery of every surface coal mine in Central Appalachia. By comparing these sites to healthy Appalachian forests we will be able to grade recovery. This will allow us to examine how fully these sites have recovered, determine to what degree there is consistency in what qualifies for bond-release, and to what extent the conditions match a true Appalachian forest.

Training Computers to See What We See

To analyze satellite data for environmental impacts, computers need to be trained to recognize objects.

The vast quantities of satellite image data available these days provide tremendous opportunities for identifying environmental impacts from space. But for mere humans, there’s simply too much — there are only so many hours in the day. So at SkyTruth, we’re teaching computers to analyze many of these images for us, a process called machine learning. The potential for advancing conservation with machine learning is tremendous. Once taught, computers potentially can detect features such as roads infiltrating protected areas, logging decimating wildlife habitat, mining operations growing beyond permit boundaries, and other landscape changes that reveal threats to biodiversity and human health. Interestingly, the techniques we use to train computers rely on the same techniques used by people to identify objects.

Common Strategies for Detecting Objects

When people look at a photograph, they find it quite easy to identify shapes, features, and objects based on a combination of previous experience and context clues in the image itself. When a computer program is asked to describe a picture, it relies on the same two strategies. In the image above, both humans and computers attempting to extract meaning and identify object boundaries would use similar visual cues:

  • Colors (the bedrock is red)
  • Shapes (the balloon is oval)
  • Lines (the concrete has seams)
  • Textures (the balloon is smooth)
  • Sizes (the feet are smaller than the balloon)
  • Locations (the ground is at the bottom)
  • Adjacency (the feet are attached to legs)
  • Gradients (the bedrock has shadows)

While none of the observations in parentheses capture universal truths, they are useful heuristics: if you have enough of them, you can have some confidence that you’ve interpreted a given image correctly.

Pixel Mask

If our objective is to make a computer program that can find the balloon in the picture above as well as a human can, then we first need to create a way to compare the performances of computers and humans. One solution is to task both a person and a computer to identify, or “segment,” all the pixels that are part of the balloon. If results from the computer agree with those from the person, then it is fair to say that the computer has found the balloon. The results  are captured in an image called a “mask,” in which every pixel is either black (not balloon) or white (balloon), like the following image.

However, unlike humans, most computers don’t wander around and collect experiences on their own. Computers require datasets of explicitly annotated examples, called “training data,” to learn to identify and distinguish specific objects within data. The black and white mask above is one such example. After seeing enough examples of an object, a computer will have embedded some knowledge about what differentiates balloons from their surroundings.

Well Pad Discovery

At SkyTruth, we are starting our machine learning process with oil and gas well pads. Well pads are the base of operations for most active oil and gas drilling sites in the United States, and we are identifying them as a way to quantify the impact of these extractive industries on the natural environment and neighboring communities. Well pads vary greatly in how they appear. Just take a look at how different these three are from each other.

Given this diversity, we need to provide the computer many examples, so that the machine learning model we are creating can distinguish between important features that characterize well pads (e.g. having an access road) and unimportant ones that are allowed to vary (e.g. the shape of the well pad, or the color of its surroundings). Our team generates masks (the black and white pixel labels) for these images by hand, and inputs them as “training data” into the computer. We provide both the image and its mask separately to the machine learning model, but for the rest of this post we will superimpose the mask in blue.

Finally, our machine learning model looks at each image (about 120 of them), learns a little bit from the mask provided with it, and then moves onto the next image. After looking at each picture once, it has already reached 92% accuracy. But we can then tell it to go back and look at each one again (about 30 times), and add a little more detail to its learning, until it reaches almost 98% accuracy.

After the model is trained, we can feed it raw satellite images and ask it to create a mask that identifies all the pixels belonging to any well pads in the picture. Here are some actual outputs from our trained machine learning model:

The top three images show well pads that were correctly identified, and fairly well masked — note the blue mask overlaying the well pads. The bottom three images do not contain well pads, and you can see that our model ignores forests, fields, and houses very well in the first two images, but is a little confused by parking lots — it has masked the parking lot in the third image in blue (incorrectly), as if it were a well pad. This is reasonable, as parking lots share many features with well pads — they are usually rectangular, gray, contain vehicles, and have an access road. This is not the end of the machine learning process; rather it is a first pass through that informs us of a need to capture more images of parking lots and further train the model that those are negative examples.

When working on image segmentation, there are a number of challenges that we need to mitigate. 

Biased Training Data

Predictions that the computer makes are based solely on training data, so it is possible for idiosyncrasies in the training data set to be encoded (unintentionally) as meaningful. For instance, imagine a model that detects a person’s happiness from a picture of their face. If it is only shown open-mouth smiles in the training data, then it is possible that when presented with real world images, it classifies closed-mouth smiles as unhappy.

This challenge often affects a model in unanticipated ways because those biases can be inherent in the data scientist. We try to mitigate this by making sure that our training dataset comes from the same set of images as those that we need to be automatically classified. Two examples of how biased data might creep into our work are: training a machine learning model on well pads in Pennsylvania and then asking it to identify pads from California (bias in the data source), or training a model on well pads centered in the picture, and then asking it to identify them when halfway out of the image (bias in the data preprocessing).

Garbage In, Garbage Out

The predictions that the computer makes can only be as good as the samples that we provide in the training data. For instance, if the person responsible for training accidentally includes the string of a balloon in half of the images created for the training dataset and excludes it in the other half, then the model will be confused about whether or not to mask the string in its predictions. We try to mitigate this by adhering to strict explicit guidelines about what constitutes the boundary of a well pad.

Measuring Success

In most other machine learning systems, it is useful to measure success as a product of two factors. First, was the guess right or wrong? And second, how confident was the guess? However, in image segmentation, that is not a great metric, because the model can be overwhelmed by an imbalance between the number of pixels in each class. For instance, imagine the task is to find a single ant on a large white wall. Out of 1000 white pixels, only 1 is gray. If your model makes a mask that searches long and hard and guesses that one pixel correctly, then it gets 100% accuracy. However, a much simpler model would say there is no ant, that every pixel is white wall, and get rewarded with 99.9% accuracy. This second model is practically unusable, but is very easy for a training algorithm to achieve.

We mitigate this issue by using a metric known as the F-beta score, which for our purposes avoids objects that are very small being ignored in favor of ones that are very large. If you’re hungry for a more technical explanation of this metric, check out the Wikipedia page.

Next Steps

In the coming weeks we will be creating an online environment in which our machine learning model can be installed and fed images with minimal human guidance. Our objective is to create two pipelines: the first allows training data to flow into the model, so it can learn. The second allows new images from satellites to flow into the model, so it can perform image segmentation and tell us the acreage dedicated to these well pads.

We’ll keep you posted as our machine learning technology progresses.

Update 2019-12-13:

In a major step forward, we set up Google SQL and Google Storage environments to house a larger database of training data, containing over 2000 uniquely generated polygons that cover multiple states in the Colorado River Basin. The GeoJSON is publicly available for download at These data were used as fodder for a deep learning neural network, which was trained in this iPython notebook. We reached DICE accuracies up to 86.3%. The trained models were then used to run inference on sites that were permitted for drilling to identify the extent of the well pads in this second iPython notebook.

Visualizing the Expansion of Fracking in Pennsylvania: Part 3

If you have been following the first two posts in this series, you have been introduced to Pennsylvania’s hottest commodity: natural gas. The state has experienced a drilling boom with the development of the Utica and Marcellus shale formations, which underlie approximately 60% of the state. With Dry Natural Gas reserves estimated around 89.6 trillion cubic feet in 2017 (roughly ⅕ of the US total), natural gas development will likely play a big part in Pennsylvania’s future. The method for extracting natural gas from porous rock underneath the Earth’s surface, usually horizontal drilling paired with hydraulic fracturing (or “fracking”), is an extremely disruptive industrial process that could present significant human health and environmental repercussions (see also this compendium of public health studies related to fracking). Allegheny County, the focal point of SkyTruth’s previous analyses, has survived largely unscathed to this point, but developers have high hopes of expanding into the county.  

In order to see just how quickly natural gas development can expand, Allegheny residents need not look far. Allegheny’s neighbor to the south, Washington County, has become a critical site of natural gas production for the state of Pennsylvania. Not only does Washington County rank second in production among all Pennsylvania counties, but it also recently moved ahead of Susquehanna County as the home of the most active wells in Pennsylvania. Washington County is considered a part of the Pittsburgh metropolitan area, with a population of approximately 207,000. Though this is a fraction of the population of Allegheny County, its close proximity could prove indicative of what is to come in the county if stricter regulations are not put in place. In our final entry of this series, we will examine the expansion of drilling and fracking in Washington County, with eyes toward how the trends here might carry over to Allegheny County.



The area shown above lies close to the town of West Finley, PA and surrounds the perimeter of the Four Seasons Camping Resort (shown in the center of this image series). This area is right on the PA/WV border, within the heart of the Utica and Marcellus formations. These images show the growth of drilling infrastructure in a relatively low population setting.


The image above (courtesy Google) gives us a closer look at one of the drilling fluid impoundments which can be seen at the top left corner of the previous scene. SkyTruth recently wrapped up its 2017 FrackFinder update, which mapped the extent of new drilling in Pennsylvania between 2015 and 2017. According to our findings, the average size of one of these impoundment is 1.4 acres, slightly larger than the average football field. These ponds sometimes hold fresh water, and at other times are temporarily storing leftover fluid used in the hydraulic fracturing process which can contain volatile, toxic chemical additives.



This second area sees significant well pad development from 2008 to 2017. Located right outside the small town of Bentleyville, PA, several wells are constructed along this bend of I-70. This area is made up of former coal towns.  Mining facilities dot the landscape, indicating that residents of this area are no strangers to resource extraction.



This third series of images shows the massive development of the agricultural land surrounding Cross Creek Lake, located right outside of West Middletown. Cross Creek County Park (outlined in black), which encompasses the lake and its surrounding area, is the largest park in the county and serves as a convenient day retreat for residents of the city of Washington, PA, Washington County’s largest city. Many people come to the lake to fish, but the fracking operations in the park could prove to be detrimental to the health of the lake’s fish, according to recent research.



This close-up on an area at the Southwestern portion of the park (courtesy Google Earth) shows a children’s playground that lies just under 1500 feet away from an active drilling site (at lower right). This is well within the proximity suggested to be potentially hazardous to public health.



This final image series is taken from right outside the Washington County towns of McGovern and Houston. The drilling operations, which pop up in just four years, are located in close proximity to developing neighborhoods, parks, The Meadows Racetrack and Casino, and the Allison Park Elementary School. Unlike the other images depicted throughout this evaluation, this development takes place around a well established suburban area, where public safety could be at risk should disaster strike at one of these drilling locations.



The image above (courtesy Google) presents yet another example of just how close these drilling sites are built to residential areas in some instances. Massive industrial development could be seen and heard from one’s back porch!

This is all happening directly south of Allegheny County, so it is plausible that similar development could take place there.

Allegheny County is in an unique situation given its location, its population density, and its relatively low levels of natural gas development. As pressures on Allegheny County mount, we hope that these bird’s eye view evaluations of drilling in nearby counties will help to enlighten and inform policy moving forward. To see SkyTruth’s analysis of the effect that setback distances can potentially have on natural gas development in Allegheny County, please follow the link provided here.

This is the final entry in a three-part series visually chronicling the expansion of fracking across Pennsylvania.  This series is meant to complement our work mapping setback distances and potential adverse public health consequences in Allegheny County, PA.  For more about our setbacks work, please check out our blog post and interactive web app. To read the first entry in this series, please follow this link. To see the second entry in the series, click here.


Using Artificial Intelligence to Save the Planet

A letter from our founder, John Amos

The trends aren’t encouraging:  Industrialization, urban development, deforestation, overfishing, mining and pollution are accelerating the rate of global warming and damaging ecosystems around the world. The pace of environmental destruction has never been as great as it is today. Despite this grim assessment, I believe there’s reason to be hopeful for a brighter future.

I’m optimistic because of a new and powerful conservation opportunity: the explosion of satellite and computing technology that now allows us to see what’s happening on the ground and on the water, everywhere, in near real-time.

Up until now we’ve been inspiring people to take action by using satellites to show them what’s already happened to the environment, typically months or even years ago. But technology has evolved dramatically since I started SkyTruth, and today we can show people what’s happening right now, making it possible to take action that can minimize or even stop environmental damage before it occurs. For example, one company, Planet, now has enough satellites in orbit to collect high-resolution imagery of all of the land area on Earth every day. Other companies and governments are building and launching fleets of satellites that promise to multiply and diversify the stream of daily imagery, including radar satellites that operate night and day and can see through clouds, smoke and haze.

A few of the Earth Observation systems in orbit.
Just a few of the Earth-observation satellites in orbit. Image courtesy NASA.

The environmental monitoring potential of all this new hardware is thrilling to our team here at SkyTruth, but it also presents a major challenge: it simply isn’t practical to hire an army of skilled analysts to look at all of these images, just to identify the manageable few that contain useful information.

Artificial intelligence is the key to unlocking the conservation power of this ever-increasing torrent of imagery.

Taking advantage of the same machine-learning technology Facebook uses to detect and tag your face in a friend’s vacation photo, we are training computers to analyze satellite images and detect features of interest in the environment: a road being built in a protected area, logging encroaching on a popular recreation area, a mining operation growing beyond its permit boundary, and other landscape and habitat alterations that indicate an imminent threat to biodiversity, ecosystem integrity, and human health.  By applying this intelligence to daily satellite imagery, we can make it possible to detect changes happening in the environment in near real-time. Then we can immediately alert anyone who wants to know about it, so they can take action if warranted: to investigate, to document, to intervene.

We call this program Conservation Vision.

And by leveraging our unique ability to connect technology and data providers, world-class researchers and high-impact conservation partners, we’re starting to catalyze action and policy success on the ground.

We’re motivated to build this approach to make environmental information available to people who are ready and able to take action. We’ve demonstrated our ability to do this through our partnership with Google and Oceana with the launch and rapid growth of Global Fishing Watch, and we’re already getting positive results automating the detection of fracking sites around the world. We have the technology. We have the expertise. We have the track record of innovation for conservation. And we’ve already begun the work.

Stay tuned for more updates and insights on how you can be part of this cutting-edge tool for conservation. 

Using machine learning to map the footprint of fracking in central Appalachia

Fossil fuel production has left a lasting imprint on the landscapes and communities of central and northern Appalachia.  Mountaintop mining operations, pipeline right-of-ways, oil and gas well pads, and hydraulic fracturing wastewater retention ponds dot the landscapes of West Virginia and Pennsylvania.  And although advocacy groups have made progress pressuring regulated industries and state agencies for greater transparency, many communities in central and northern Appalachia are unaware of, or unclear about, the extent of human health risks that they face from exposure to these facilities.  

A key challenge is the discrepancy that often exists between what is on paper and what is on the landscape.  It takes time, money, and staff (three rarities for state agencies always under pressure to do more with less) to map energy infrastructure, and to keep those records updated and accessible for the public.  But with advancements in deep learning, and with the increasing amount of satellite imagery available from governments and commercial providers, it might be possible to track the expansion of energy infrastructure—as well as the public health risks that accompany it—in near real-time.

Figure 1.  Oil and gas well pad locations, 2005 – 2015.

Mapping the footprint of oil and gas drilling, especially unconventional drilling or “fracking,” is a critical piece of SkyTruth’s work.  Since 2013, we’ve conducted collaborative image analysis projects called “FrackFinder” to fill the gaps in publicly available information about the location of fracking operations in the Marcellus and Utica Shale.  In the past, we relied on several hundred volunteers to identify and map oil and gas well pads throughout Ohio, Pennsylvania, and West Virginia.  But we’ve been working on a new approach: automating the detection of oil and gas well pads with machine learning.  Rather than train several hundred volunteers to identify well pads in satellite imagery, we developed a machine learning model that could be deployed across thousands of computers simultaneously.  Machine learning is at the heart of today’s companies. It’s the technology that enables Netflix to recommend new shows that you might like, or that allows digital assistants like Google, Siri, or Alexa to understand requests like, “Hey Google, text Mom I’ll be there in 20 minutes.”

Examples are at the core of machine learning.  Rather than try to “hard code” all of the characteristics that define a modern well pad (they are generally square, generally gravel, and generally littered with industrial equipment), we teach computers what they look like by using examples.  Lots of examples. Like, thousands or even millions of them, if we can find them. It’s just like with humans: the more examples of something that you see, the easier it is to recognize that thing later. So, where did we get a few thousand images of well pads in Pennsylvania?  

We started with SkyTruth’s Pennsylvania oil and gas well pad dataset. The dataset contains well pad locations identified in National Agriculture Imagery Program (NAIP) aerial imagery from 2005, 2008, 2010, 2013, and 2015 (Figure 1).  We uploaded this dataset to Google Earth Engine, and used it to create a collection of 10,000 aerial images in two classes: “well pad” and “non-well pad.” We created the training images by buffering each well pad by 100 meters, clipping the NAIP imagery to the bounding box, and exporting each image.


The images above show three training examples from our “well pad” class. The images below show three training examples taken from our “non-well pad” class.


We divided the dataset into three subsets: a training set with 4,000 images of each class, a validation set with 500 images of each class, and a test set with 500 images of each class.  We combined this work in Google Earth Engine with Google’s powerful TensorFlow deep learning library.  We used our 8,000 training images (4,000 from each class, remember) and TensorFlow’s high-level Keras API to train our machine learning model.  So what, exactly, does that mean? Well, basically, it means that we showed the model thousands and thousands of examples of what well pads are (i.e., images from our “well pad” class) and what well pads aren’t (i.e., images from our “non-well pad” class).  We trained the model for twenty epochs, meaning that we showed the model the entire training set (8,000 images, remember) twenty times.  So, basically, the model saw 160,000 examples, and over time, it “learned” what well pads look like.

Our best model run returned an accuracy of 84%, precision and recall measures of 87% and 81%, respectively, and a false positive rate and false negative rate of 0.116 and 0.193, respectively.  We’ve been pleased with our initial model runs, but there is plenty of room for improvement. We started with the VGG16 model architecture that comes prepackaged with Keras (Simonyan and Zisserman 2014, Chollet 2018).  The VGG16 model architecture is no longer state-of-the-art, but it is easy to understand, and it was a great place to begin.  

After training, we ran the model on a few NAIP images to compare its performance against well pads collected by SkyTruth volunteers for our 2015 Pennsylvania FrackFinder project.  Figures 4 and 6 depict the model’s performance on two NAIP images near Williamsport, PA. White bounding boxes indicate landscape features that the model predicted to be well pads.  Figures 5 and 7 depict those same images with well pads (shown in red) delineated by SkyTruth volunteers.

Figure 4.  Well pads detected by our machine learning algorithm in NAIP imagery from 2015.
Figure 5.  Well pads detected by SkyTruth volunteers in NAIP imagery from 2015.
Figure 6.  Well pads detected by our machine learning algorithm in NAIP imagery from 2015.
Figure 7.  Well pads detected by SkyTruth volunteers in NAIP imagery from 2015.

One of the first things that stood out to us was that our model is overly sensitive to strong linear features.  In nearly every training example, there is a clearly-defined access road that connects to the well pad. As a result, the model regularly classified large patches of cleared land or isolated developments (e.g., warehouses) at the end of a linear feature as a well pad.  Another major weakness is that our model is also overly sensitive to active well pads.  Active well pads tend to be large, gravel squares with clearly defined edges. Although these well pads may be the biggest concern, there are many “reclaimed” and abandoned well pads that lack such clearly defined edges.  Regrettably, our model is overfit to highly-visible active wells pads, and it performs poorly on lower-visibility drilling sites that have lost their square shape or that have been revegetated by grasses.

Nevertheless, we think this is a good start.  Despite a number of false detections, our model was able to detect all of the well pads previously identified by volunteers in images 5 and 7 above.  In several instances, false detections consisted of energy infrastructure that, although not active well pads, remain of high interest to environmental and public health advocates as well as state regulators: abandoned well pads, wastewater impoundments, and recent land clearings.  NAIP imagery is only collected every two or three years, depending on funding. So, tracking the expansion of oil and gas drilling activities in near real-time will require access to a high resolution, near real-time imagery stream (like Planet, for instance).  For now, we’re experimenting with more current model architectures and with reconfiguring the model for semantic segmentation — extracting polygons that delineate the boundaries of well pads which can be analyzed in mapping software by researchers and our partners working on the ground.

Keep checking back for updates.  We’ll be posting the training data that we created, along with our initial models, as soon as we can.