Drilling Detection with Machine Learning Part 3: Making and mapping predictions

SkyTruth Technical Program Director Ry Covington, PhD explains challenges to generating meaningful predictions from the machine learning model, and outlines solutions.

[This is the final post in a 3-part blog series describing SkyTruth’s effort to automate the detection of oil and gas well pads around the world using machine learning. We hope that this series – and the resources we’ve provided throughout – will help other developers and scientists working on conservation issues to learn from our experience and build their own exciting tools.  You can read the two previous posts in the series here and here.]

SkyTruth Intern Sasha Bylsma and Geospatial Analyst Brendan Jarrell explained how we create training data and implement a machine learning model to detect oil and gas well pads. So, now that we’ve got a trained model, we just need to run it on a few satellite images and put the predictions on a map.  Seems easy enough…  

We started with some Sentinel-2 imagery collected over the Neuquén Basin in central Argentina. This is one of the most heavily drilled areas in Argentina, and we’ve used the Google Earth Engine platform to export a few Sentinel-2 images that we could work with.  

Screenshot of the Neuquén basin in Argentina. 

The images come out of Earth Engine as GeoTIFFs – a pretty standard file format for overhead imagery. We’ve used some Earth Engine magic to reduce the file size of each image so that they’re easier to handle, but they’re still a bit big for the model. The model expects simple, small patches of images: 256 pixels high, 256 pixels wide, and three channels (e.g., Red, Green, Blue) deep. Our Sentinel-2 GeoTIFFs are about 11,000 pixels high by 11,000 pixels wide, so that left us with a few things to figure out:

  • First, the model is expecting small, simple patches of images – no frills, just pixel values. That means that we have to take the geographic information that’s imbedded in the original GeoTIFF and set aside.  So, how do we do that? 
  • Second, how can we evenly slice up the full image into the small patches that the model is expecting?
  • Third, for every small image patch that the model sees, it returns a small, single channel prediction image with values between zero and one. The closer a pixel is to one, the more likely it is that pixel belongs to a drilling site.  But once the model makes predictions on all of the small images, how do we put them back together in the right order to get the original image’s size and shape?
  • Lastly, how do we take the prediction image and convert it into polygons that we can overlay on a map?

These were all things that we’d never done before, so it’s taken us quite a bit of trial and error to figure out how to make things work. In fact, we’re still working on them – we’ve got a workflow in place, but we’re always trying to refine and improve it. For now, let’s just look at what we’ve got working. 

Step 1. Converting our GeoTIFF to a NumPy array

We started off with a pretty small subset of a Sentinel-2 image that we could experiment with. It’s 1,634 pixels high, 1,932 pixels wide, and 3 channels deep. In developer’s parlance, its shape is: (1634, 1932, 3). The image is of Mari Menuco Lake in Argentina. There are a few dozen drilling sites along the southern portion of the lake that seemed like an ideal place to test out our workflow. Once we had everything working like expected, we’d run the larger Sentinel-2 images through.  

First, we used the GDAL Python API to load our image and collect its (a) geotransform and (b) projection. So, what are these two things? Well, basically, the geotransform is the formula that GDAL uses to go from pixel space (think rows and columns) to geographic space (think x [longitude] and y [latitude]), and the projection is just the coordinate reference system of the image. After we had those two pieces of information set aside for later, we pushed all of the image bands into an NumPy array. 

 

# Get geographic information.
projection = tiff.GetProjection()           

# Set spatial reference.
spatial_ref = osr.SpatialReference()     # Create empty spatial reference.
spatial_ref.ImportFromWkt(projection)    # Read the "projection" string.

# Collect all the bands in the .tif image.
bands = [tiff.GetRasterBand(band+1) for band in range(tiff.RasterCount)]

# Read each band as an array.
arrays = [band.ReadAsArray() for band in bands] 

# Combine into a single image. 
image = np.array(arrays)

# Format as (height, width, channels).
image = image.transpose(1,2,0)

GDAL reads and writes images differently than NumPy, so the last thing we did was transpose the axes to put our image in the format that we needed: height, width, channels.   

Step 2. Slicing our NumPy array and running predictions 

The next bit was tricky for us to figure out. We needed to take our image – 1634 by 1932 by 3 (1634, 1932, 3) – and slice it into even squares of (256, 256, 3). Our first problem: neither 1634 nor 1932 divides by 256 evenly, so we needed to figure out how to make the image patches overlap just enough to get a clean division.  

Our second problem: we also needed to keep track of where each patch lived in the larger image so that we could put the predictions back together in the right order later. We ended up giving each patch an ID and collecting the coordinates of their top-left pixel (their minimum x and minimum y). We pushed that information into a pandas dataframe – basically, a 2-D matrix of rows and columns – that we could set aside to rebuild our prediction image later.  

Many thanks to CosmiQ Works and all of the participants in the SpaceNet Challenges; the code snippets and GitHub repos that they’ve made available were immensely helpful for us as we tried to figure out how to implement this step.

 

# Set some variables.
patchSize = 256
overlap = 0.2
height, width, bands = image.shape
imgs, positions = [], []
columns = ['xmin', 'ymin']

# Work through the image and bin it up.
for x in range(0, width-1, int(patchSize*(1-overlap))):    
   for y in range(0, height-1, int(patchSize*(1-overlap))):
      
       # Get top-left pixel.
       xmin = min(x, width-patchSize)
       ymin = min(y, height-patchSize)

       # Get image patch.
       patch = image[ymin:ymin+patchSize, xmin:xmin+patchSize]

       # Set position.
       pos = [xmin, ymin]

       # Add to array.
       imgs.append(patch)
       positions.append(pos)

# Convert to NumPy array.
imageArray = np.array(imgs) / 255.0

# Create position datataframe.
df = pd.DataFrame(positions, columns=columns)
df.index = np.arange(len(positions))

Once we had the new array of patches – 80 patches of 256 by 256 by 3 – it was easy to run the model and generate some predictions.

# And, go. Don't forget the batch dimension.
predictions = model.predict(imageArray, batch_size=20, steps=4)

Step 3. Rebuilding our image

The model returns an array of predictions – (80, 256, 256, 1). The prediction values range from zero to one. So, a pixel value of .82 means that the model is 82% confident that pixel belongs to a drilling site.  

Side by side comparison of an image and its prediction.

We used the pandas dataframe that we made earlier to put all of these predictions back together in the right order and get the original image’s size and shape. The dataframe was where we recorded the ID and top-left pixel (their minimum x and minimum y) of each patch. First, we created an empty image that is the same size and shape as the original. Next, we went through the dataframe, took out each patch, and added it to the new empty image in the right spot (its top-left pixel).  

 

# Create numpy zeros of appropriate shape.
empty_img = np.zeros((height, width, 1))

# Create another zero array to record where pixels get overlaid.
overlays = np.zeros((height, width, 1))

# Iterate through patches.
for index, item in positions.iterrows():

   # Grab values for each row / patch.
   [xmin, ymin] = item

   # Grab the right patch.
   slice = predictions[index]
  
   x0, x1 = xmin, xmin + patchSize
   y0, y1 = ymin, ymin + patchSize

   # Add img_slice to empty_img.
   empty_img[y0:y1, x0:x1] += slice

   # Update count of overlapping pixels.
   overlays[y0:y1, x0:x1] += np.ones((patchSize, patchSize, 1))            

# Normalize the image to get our values between 0 and 1. 
rebuilt_img = np.divide(empty_img, overlay_count)

Most of our team are visual thinkers, so the easiest way for us to imagine rebuilding the image is like covering a blank sheet of white paper in pink sticky-notes, and then smoothing them all down to get a new, blank sheet of pink paper.   

Step 4. Converting our predictions to polygons

After rebuilding our prediction array to be the same size and shape as the original satellite image, we used the GDAL Python API to convert it into polygons that could go on a map. To try and clean things up a bit, we started by selecting only those pixels where the model was more than 70% confident they belonged to a drilling site. We set anything under that threshold to zero. This just helped us to eliminate some noise and clean up the predictions a bit. With that done, we used GDAL to convert the cleaned up prediction image into polygons and reattach the spatial information that we set aside at the beginning (i.e., the geotransform and the projection).  

  

# Band to use.
sourceBand = output.GetRasterBand(1)

# Set up shapefile.
shpDrv = ogr.GetDriverByName("ESRI Shapefile")                                
outFile = shpDrv.CreateDataSource("/gdrive/My Drive/detections.shp")      
layer = outFile.CreateLayer("detections", srs=spatial_ref)                    

# Add field.
idField = ogr.FieldDefn("id", ogr.OFTInteger)               
layer.CreateField(idField)

# And, convert. 
gdal.Polygonize(sourceBand, sourceBand, layer, 0, [], None)

And at this point we had our shapefile. From there, it was easy to upload that shapefile as an asset into our Earth Engine account and have a look at our predictions over satellite imagery. We did a bit of clean up and editing to make sure that all of the polygons look good — what’s sometimes called “human-in-the-loop” modeling — but, for the most part, we were all done.

Screenshot of the polygons in EE.

Lastly, we did a quick assessment to see how well our from-scratch workflow functioned. In the image above, red points are drilling sites that we got correct (true positives), green points are drilling sites that we missed (false negatives), and blue points are places where the model thought there was a drilling site when there wasn’t (false positives). Here are the numbers: 

Total number of validated ground truth points: 239
True Positives: 107
False Positives: 50
False Negatives: 82
Precision: 0.6815286624203821
Recall: 0.5661375661375662
F1-score: 0.6184971098265896

Precision, recall, and F1-score are all just metrics for understanding how a model performs. Your spam folder offers a good example. Imagine that your email’s spam model makes 30 predictions. Precision would measure the percentage of emails flagged as spam that it correctly classified as spam. Recall would measure the percentage of actual spam emails that it correctly classified as spam. Often, these two things are in tension – if the model is more aggressive (i.e., lowers the threshold for an email to be classified as spam) the recall will go up since they’d be capturing more actual spam emails. But the precision would go down, because they’re also capturing more emails overall, and it’s pretty likely that many of those won’t be spam. The F1-score builds off of precision and recall, and it’s probably easiest to think of it as a measure of overall accuracy. In the context of the drilling site work, our precision and recall numbers mean that 68% of the things we’re predicting as drilling sites actually are drilling sites, but we’re only picking up 56% of the drilling sites that are actually out there in the world. 

We hope that this series has helped to demystify the machine learning workflow for others. Figuring out how to build a machine learning data processing pipeline from scratch was an exciting challenge for us. We’re encouraged by our progress so far, but as the metrics above indicate, there’s still lots of room for improvement. We will keep at it because this technology stack — integrating remote sensing with cloud computing and machine learning — is at the heart of our Conservation Vision initiative to automate the analysis of imagery, to solve challenging conservation problems. 

So please stay tuned for updates in the future.  And don’t hesitate to send us a note if you have questions about what we’ve done or ideas about how we can improve things going forward.  We welcome the collaboration! 

 

SkyTruth 2020: What to Expect in the New Year

Oil pollution at sea, mountaintop mining, Conservation Vision and more on SkyTruth’s agenda.

SkyTruth followers know that we generated a lot of momentum in 2019, laying the groundwork for major impact in 2020. Here’s a quick list of some of our most important projects underway for the new year.

Stopping oil pollution at sea: SkyTruth has tracked oil pollution at sea for years, alerting the world to the true size of the BP oil spill, tracking the ongoing leak at the Taylor Energy site until the Coast Guard agreed to take action, and flagging bilge dumping in the oceans. Bilge dumping occurs when cargo vessels and tankers illegally dump oily wastewater stored in the bottom of ships into the ocean. International law specifies how this bilge water should be treated to protect ocean ecosystems. But SkyTruth has discovered that many ships bypass costly pollution prevention equipment by simply flushing the bilge water directly into the sea.

In 2019 SkyTruth pioneered the identification of bilge dumping and the vessels responsible for this pollution by correlating satellite imagery of oily slicks with Automatic Identification System (AIS) broadcasts from ships. For the first time, we can ID the perps of this devastating and illegal practice.

PERKASA AIS track

Figure 1. SkyTruth identified the vessel PERKASA dumping bilge water via AIS broadcast track overlain on Sentinel-1 image. 

But the Earth’s oceans are vast, and there’s only so much imagery SkyTruthers can analyze. So we’ve begun automating the detection of bilge dumping using an Artificial Intelligence (AI) technique called machine learning. With AI, SkyTruth can analyze thousands of satellite images of the world’s oceans every day –- a process we call Conservation Vision — finding tiny specks on the oceans trailing distinctive oily slicks, and then naming names, so that the authorities and the public can catch and shame those skirting pollution laws when they think no one is looking.

A heads up to polluters: SkyTruth is looking. 

We got a big boost last month when Amazon Web Services (AWS) invited SkyTruth to be one of four nonprofits featured in its AWS re:Invent Hackathon for Good, and awarded SkyTruth one of seven AWS Imagine Grants. We’ll be using the funds and expertise AWS is providing to expand our reach throughout the globe and ensure polluters have nowhere to hide.

Protecting wildlife from the bad guys: Many scientists believe the Earth currently is facing an extinction crisis, with wildlife and their habitats disappearing at unprecedented rates.   

But SkyTruth’s Conservation Vision program using satellite imagery and machine learning can help. Beginning in 2020, SkyTruth is partnering with Wildlife Conservation Society to train computers to analyze vast quantities of image data to alert rangers and wildlife managers to threats on the ground. These threats include roads being built in protected areas, logging encroaching on important habitats, mining operations growing beyond permit boundaries, and temporary shelters hiding poachers. With better information, protected area managers can direct overstretched field patrols to specific areas and catch violators in the act, rather than arriving months after the fact.  It can alert rangers before they discover a poaching camp by chance (and possibly find themselves surprised and outgunned).

To make this revolution in protected area management possible we will be building a network of technology and data partners, academic researchers, and other tech-savvy conservationists to make the algorithms, computer code, and analytical results publicly available for others to use. By publicly sharing these tools, Conservation Vision will enable others around the world to apply the same cutting-edge technologies to protecting their own areas of concern, launching a new era of wildlife and ecosystem protection. In 2020 we expect to undertake two pilot projects in different locations to develop, refine, and test Conservation Vision and ultimately transform wildlife protection around the world.

Identifying mountaintop mining companies that take the money and run. SkyTruth’s Central Appalachia Surface Mining database has been used by researchers and advocates for years to document the disastrous environmental and health impacts of mountaintop mining. Now, SkyTruth is examining how well these devastated landscapes are recovering.

Figure 2. Mountaintop mine near Wise, Virginia. Copyright Alan Gignoux; Courtesy Appalachian Voices; 2014-2.

To do this, we are generating a spectral fingerprint using satellite imagery for each identified mining area. This fingerprint will outline the characteristics of each site, including the amount of bare ground present and information about vegetation regrowth. In this way we will track changes and measure recovery by comparing the sites over time to a healthy Appalachian forest. 

Under federal law, mining companies are required to set aside money in bonds to make sure that funds are available to recover their sites for other uses once mining ends. But the rules are vague and vary by state. If state inspectors determine that mine sites are recovered adequately, then mining companies reclaim their bonds, even if the landscape they leave behind looks nothing like the native forest they destroyed. In some cases, old mines are safety and health hazards as well as useless eyesores, leaving communities and taxpayers to foot the bill for recovery. SkyTruth’s analysis will provide the public, and state inspectors, an objective tool for determining when sites have truly recovered and bonds should be released, or when more should be done to restore local landscapes.

Characterizing toxic algal blooms from space: Harmful algal blooms affect every coastal and Great Lakes state in the United States. Normally, algae are harmless — simple plants that form the base of aquatic food webs. But under the right conditions, algae can grow out of control causing toxic blooms that can kill wildlife and cause illness in people. 

 SkyTruth is partnering with researchers at Kent State University who have developed a sophisticated technique for detecting cyanobacteria and other harmful algae in the western basin of Lake Erie — a known hotspot of harmful algal blooms. They hope to extend this work to Lake Okeechobee in Florida. But their method has limitations: It uses infrequently collected, moderate resolution 4-band multispectral satellite imagery to identify harmful blooms and the factors that facilitate their formation. SkyTruth is working to implement the Kent State approach in the more accessible Google Earth Engine cloud platform, making it much easier to generate updates to the analysis, and offering the possibility of automating the update on a regular basis.  We anticipate that this tool eventually will enable scientists and coastal managers to quickly identify which algal blooms are toxic, and which are not, simply by analyzing their characteristics on imagery.

Revealing the extent of fossil fuel drilling on public lands in the Colorado River Basin: Modern oil and gas drilling and fracking is a threat to public health, biodiversity and the climate. For example, researchers from Johns Hopkins University used our data on oil and gas infrastructure in Pennsylvania to examine the health effects on people living near these sites and found higher premature birth rates for mothers in Pennsylvania that live near fracking sites as well as increased asthma attacks.

The Trump Administration is ramping up drilling on America’s public lands, threatening iconic places such as Chaco Culture National Historical Park in New Mexico. Chaco Canyon is  a UNESCO World Heritage Site that contains the ruins of a 1,200 year-old city that is sacred to native people. According to the Center for Western Priorities, 91% of the public lands in Northwest New Mexico surrounding the Greater Chaco region are developed for oil and gas, and local communities complain of pollution, health impacts and more.

Figure 3. Chaco Canyon Chetro Ketl great kiva plaza. Photo courtesy of the National Park Service.

In 2020 SkyTruth will deploy a machine learning model we developed in 2019 that identifies oil and gas drilling sites in the Rocky Mountain West with 86.3% accuracy. We will apply it to the Greater Chaco Canyon region to detect all oil and gas drilling sites on high-resolution aerial survey photography. We hope to then use these results to refine and expand the model to the wider Colorado River Basin. 

Local activists in northwestern New Mexico have fought additional drilling for the past decade. Last year, New Mexico’s congressional delegation successfully led an effort to place a one-year moratorium on drilling within a 10-mile buffer around the park. Activists view this as a first step towards permanent protection. SkyTruth’s maps will help provide them with visual tools to fight for permanent protection.

A new SkyTruth website: We’ll keep you up to date about these projects and more on a new, revamped SkyTruth website under development for release later this year. Stay tuned for a new look and more great SkyTruthing in the year ahead!

CONSERVATION VISION

Using Artificial Intelligence to Save the Planet

A letter from our founder, John Amos

The trends aren’t encouraging:  Industrialization, urban development, deforestation, overfishing, mining and pollution are accelerating the rate of global warming and damaging ecosystems around the world. The pace of environmental destruction has never been as great as it is today. Despite this grim assessment, I believe there’s reason to be hopeful for a brighter future.

I’m optimistic because of a new and powerful conservation opportunity: the explosion of satellite and computing technology that now allows us to see what’s happening on the ground and on the water, everywhere, in near real-time.

Up until now we’ve been inspiring people to take action by using satellites to show them what’s already happened to the environment, typically months or even years ago. But technology has evolved dramatically since I started SkyTruth, and today we can show people what’s happening right now, making it possible to take action that can minimize or even stop environmental damage before it occurs. For example, one company, Planet, now has enough satellites in orbit to collect high-resolution imagery of all of the land area on Earth every day. Other companies and governments are building and launching fleets of satellites that promise to multiply and diversify the stream of daily imagery, including radar satellites that operate night and day and can see through clouds, smoke and haze.

A few of the Earth Observation systems in orbit.
Just a few of the Earth-observation satellites in orbit. Image courtesy NASA.

The environmental monitoring potential of all this new hardware is thrilling to our team here at SkyTruth, but it also presents a major challenge: it simply isn’t practical to hire an army of skilled analysts to look at all of these images, just to identify the manageable few that contain useful information.

Artificial intelligence is the key to unlocking the conservation power of this ever-increasing torrent of imagery.

Taking advantage of the same machine-learning technology Facebook uses to detect and tag your face in a friend’s vacation photo, we are training computers to analyze satellite images and detect features of interest in the environment: a road being built in a protected area, logging encroaching on a popular recreation area, a mining operation growing beyond its permit boundary, and other landscape and habitat alterations that indicate an imminent threat to biodiversity, ecosystem integrity, and human health.  By applying this intelligence to daily satellite imagery, we can make it possible to detect changes happening in the environment in near real-time. Then we can immediately alert anyone who wants to know about it, so they can take action if warranted: to investigate, to document, to intervene.

We call this program Conservation Vision.

And by leveraging our unique ability to connect technology and data providers, world-class researchers and high-impact conservation partners, we’re starting to catalyze action and policy success on the ground.

We’re motivated to build this approach to make environmental information available to people who are ready and able to take action. We’ve demonstrated our ability to do this through our partnership with Google and Oceana with the launch and rapid growth of Global Fishing Watch, and we’re already getting positive results automating the detection of fracking sites around the world. We have the technology. We have the expertise. We have the track record of innovation for conservation. And we’ve already begun the work.

Stay tuned for more updates and insights on how you can be part of this cutting-edge tool for conservation.