Microsoft Building Footprints in GEE: Revisiting Scale & accessibility
Building footprints are probably one of the most visible modifications of the natural landscape. Building types vary, but the built class and the overall patterns of how these spaces evolve are tied to human action, growth, and decay. Mapping human settlements are not new and everything from Night Lights to open street map generated massive feature maps that have helped expand this understanding. These are inherently tied to census features, roads, and distribution networks and serve as an effective means of understanding migration.
So why an entire story for a dataset, you might ask? After all, we have been running the community catalog for over a year now with over 100 TB+ and 850+ dataset types. The story goes back to accessibility and what it took to navigate from in-memory reads to streaming GeoJSONs to understanding what really fits the user experience.
The MSBuildings dataset that I have ingested into Google Earth Engine includes earlier releases apart from Microsoft’s 777 Million Global building footprints and, in its final state, stands at 1 Billion+ footprints (1,069,059,359). This is perhaps the largest vector ingest that I have ingested from my end, and the three subfolders include the United States, Indonesia, and Nigeria. In addition, datasets such as Canada and Australia were merged into single vector composites.
High-Resolution Building Footprint
Large-scale building footprint mapping was an exercise in diving deep into finer grain understanding of building shape, size, and patterns. With Open Street Map, users were able to digitize the buildings of choice and improve the richness of existing databases on building information. You can extract OSM extract using sites like bbbike extractor or geofabrik. Unfortunately, while many neighborhoods, municipalities, and local governments further have this dataset, there are no cohesive and easy approaches to gathering them.
So in 2018, Microsoft started experimenting with releasing Building Footprint under an open license for consumption covered by the likes of the New York Times. This is one of the most significant single approach releases for a building dataset, followed by Microsoft releasing this across multiple countries and regions. Further updates and release notes followed.
Google also released the continent of Africa dataset in 2021 marking one of the most significant releases under a CC-BY 4.0 license and as part of Google’s Open Buildings initiative.
Performance across different approaches and models varies as expected and the type of data you need or want to use depends on the application in mind.
Microsoft Global Building Footprint Dataset
The MS global datasets contain 777M buildings from Bing Maps imagery between 2014 and 2021 including Maxar and Airbus imagery. The dataset does not include earlier releases like Canada, the US, Australia, and so on and is massive for direct use and application. Turns out you can search for all their releases simply using Github keywords like this. Not only that, you can get to the download links pretty easily.
Understanding the Source data: Sort and Split
For those who have worked with a few of these datasets, here are a few quick observations
- The size of data subsets can range from a few KB to multiple GB. While we have come a long way to reduce client-side dependency, this could still challenge those with limited hardware and bandwidth.
- The global ML release links to earlier releases that can be aggregated to ensure the collection is truly global and complete.
- While some datasets are released as GeoJSON some are released in Large GeoJSON format (GeoJSONL). While the zipped sizes are large enough to limit the type of hardware, the unzipped extracts are massive vector files. Hence the need to sort and split datasets.
Google Earth Engine & Vector Spaces
Google Earth Engine is powerful for raster analysis and has periodically improved vector capabilities, including releases such as FeatureView for quick tile rendering. However, table or vector ingests are limited to specific file types, including CSVs and Shapefiles. This means you have to get the files into either of these formats.
There are some interesting performance behaviors across Ingest.
- Ingest times are not necessarily linear across file sizes; it seems a complex geometry can take longer to ingest though this is not a consistent enough generalization.
- Extensive datasets were split into smaller subsets and ingested.
- Once the ingestion was completed, sub-parts in a folder could be merged, flattened, and exported with varying success.
Access to Accessibility
Accessibility goes beyond just making the data available; it can be limited by what you can and cannot do with any data. The attempt here was to make it at least available primarily to some users and capture the steps needed to create your own pipeline. For now, you can read about this dataset and access it here.
At 201 objects, the folder’s total size, including all datasets, is 62.03 GB, which might be owing to compression and optimization within Google Earth Engine. I am eager to see if users create their own subset feature view assets, which came out a few weeks ago. The entire process was a fascinating deep dive into challenges in data subsetting, flattening, task failures, and reattempts. In the end, you have a complete Global ML Building Footprint dataset from Microsoft within Google Earth Engine, and there is more to come in the awesome GEE community datasets catalog.
If you like the effort, Star the project to get more updates and support the community. You can also follow me on Twitter for more frequent updates.