Community-Enhanced: Google Earth Engine Community Catalog Upgrades
When I created the Google Earth Engine Community catalog my primary focus was always on fostering collaboration and addressing any issues in our datasets with the help of valuable feedback from the geospatial community. What better way to learn from the community than from feedback? So this one is about a lesson learned and an adventure in fixing or updating the Digital Earth Africa Cropland mask. One of the emails I received captured what seemed to be this edge effect missing tiles around the corners of regions when looking at the datasets.
While the datasets were downloaded from the Amazon Open Data registry on AWS and while STAC objects are unique within their collection a multi-collection STAC has recurrent objects with the name id. This was not as obvious because bringing all images to the same folder even after using an AWS sync operation or using STAC download tools would simply overwrite or skip over existing files.
Note: This would happen if you were to bring all filtered, mask, and prob tif files to a single folder trying to combine them into a single collection. This is tied to the fact that edges between regions have recurrent tiles across collections and that is what created the missing tiles effect.
Finding the problem
One of my favorite parts of digging for a problem or debugging is looking at it from every single aspect. So I went digging into STAC. Spatio Temporal Asset Catalogs for the uninitiated serve out metadata and arrange metadata in hierarchies like collections of images or image collections and items or images themselves 🗂️🌄. The benefit of STAC is standardization and interoperability with other STAC manifests and each collection tree is unique in some way. While that is true multiple collections create an interesting issue where unique objects may be repeated just nested under the different collections. This was an assumption I made and now I had to check if the STAC ids across multiple collections could have the same ID. Turns out a few of them overlapped. So I made sure I made a mental note 📝. You can visualize the STAC collections using the STAC browser easily just search for crop 🔍📊.
All in the overlays
💡 Next on the agenda was the most apparent step: examining the actual datasets and pinpointing the missing tiles. 🧩 To ensure the utmost accuracy in tiles and versions, we re-downloaded all mask, probability, and filtered collections 🔄📥. The user’s sample coverage also revealed complete overlap, prompting us to diligently isolate each collection with their image objects separately 🗂️🔄🧩. This meticulous approach guarantees a more refined and comprehensive geospatial data experience 🛠️🌍🚀.
Avoiding Overwrites in Earth Engine Collections 🛰️
When I first created Earth Engine collections, my goal was to merge 8 different image collections into one comprehensive dataset. However, I encountered a challenge with the system index, which required uniqueness. To swiftly address this, I decided to conduct a quick file name check. The process involved gathering filenames for each collection and appending them to a master list. By running a “set” function to count unique objects only once, I was able to create 8 separate collections, each with its distinct length. It was a significant breakthrough as it shed light on the root of the problem — the edge tiles causing regions to overwrite each other 🚧💡. By implementing this file name check, I could ensure the integrity of my Earth Engine collections and foster a more streamlined geospatial data experience 🛠️🗺️🚀.
Fix and deploy
In my pursuit to optimize STAC metadata for Google Earth Engine, I crafted custom code to facilitate the conversion process. The remedy was simple yet effective — I incorporated the region name at the beginning of each file name. This minor tweak enabled me to merge all the images seamlessly into a single Earth Engine image collection. Moreover, I added an “id_no” field that retained the original item id, presenting a unique opportunity. Leveraging this, I harnessed the power of the aggregate histogram tool on the “id_no” field. The results were promising, with most image ids showing a count of 1. However, a few exceptions emerged, with a count of 2, providing additional validation for our solution from both ends 📊🔍💡.
The good news I was able to fix this. While there are straightforward recommendations, dataset, and catalog curations can significantly benefit from this type of additional information. Moving forward, our goal is to capture and highlight such insights, fostering a collaborative environment where users and dataset managers can learn together and propel our collective knowledge forward 📚🤝🚀. The updated and merged collections are now accessible to all, ready to empower your geospatial endeavors