Behind the Scenes: The Role of NSF’s Jetstream2 in Building the Awesome GEE Community Catalog
Delve into the world of data preprocessing, where we demystify what happens behind the scenes at the Awesome GEE community catalog and how Jetstream2 a National Science Foundation computing resource plays a vital role in making geospatial data more accessible and revolutionizing research workflows via the community catalog.
We’ll trace the evolution of the community Catalog, revealing how it has grown over time into a valuable resource for geospatial research.
Community and collaboration are at the core of this ecosystem. Explore the collaborative efforts, partnerships, and contributions that drive the Awesome GEE Community Catalog. We’ll also discuss NSF’s role in promoting open science through transformative projects like Jetstream2.
Evolution of the Awesome GEE Community Catalog
It all started approximately three years ago with a single dataset sourced from the High-Resolution Settlement Layers provided by Facebook. This humble beginning marked the inception of what would soon become an open data catalog, a valuable resource for the geospatial community. You can read about our approach to community commons and our early start here. Fast forward to today, and this experiment has flourished into an impressive project, boasting over 320 terabytes of datasets and encompassing more than 1300 unique datasets with over a million images and over a billion features.
Behind this transformation lies a commitment to fostering a geospatial data commons. The concept was elegantly simple: create a platform for datasets, and let the community nurture its growth. Over time, what began as the development of a single tool to fetch a lone dataset has evolved into a sophisticated system, encompassing over 120 thousand lines of code. This codebase facilitates seamless interactions with datasets, automates processes, handles downloads, and conducts preprocessing, all to ensure that the community catalog remains a valuable resource for geospatial researchers.
Throughout this journey, collaboration has been key. Individual data contributors have played a vital role in enriching the catalog, while tools have been crafted to assist organizations in sharing their datasets with ease. The process of building this community has been an ongoing effort, reflecting a shared commitment to advancing geospatial research.
Behind the Scenes of Data Preprocessing
Behind the scenes, the intriguing journey of data preprocessing unfolds, commencing with data requests or proactive updates and additions to the catalog. This process transcends the conventional approach of archiving data on personal websites or within data centers, aligning with the mandates of many funding agencies. For instance, anyone with geospatial data can initiate the journey by submitting their dataset for inclusion in the community catalog.
What sets this process apart is its inclusivity. You need not be the dataset author to contribute; if you’re utilizing specific products within your projects, you can suggest their inclusion. The fundamental idea is to pool these datasets together, beginning by selecting those that a few contributors have uploaded or preprocessed to some degree. This involves a meticulous verification process that encompasses license types, data sources, and more. The overarching goals are twofold: to enhance data accessibility and to significantly boost the reproducibility of scientific research.
To facilitate this intricate journey, an extensive arsenal of custom scripts has been developed, comprising over 120,000 lines of code. These scripts serve as the backbone, capable of handling datasets with varying levels of complexity. They excel in tasks such as fetching data from diverse providers like Zenodo, Dryad, Figshare, and others, preprocessing, and optimizing datasets to ensure seamless integration into the Community catalog. All of this runs in NSF-funded Jetstream2’s exosphere VMs and is backed with an impressive amount of always-on-compute resources which we will explore soon.
Tasks range from reprojecting data to conducting batch LZW compressions and format conversions, all seamlessly executed by these scripts. Moreover, they play a crucial role in keeping the community catalog continuously updated with the latest data. Finally, the datasets are ingested into Google Earth Engine for use via transitioning them to an intermediate GCS bucket.
For those seeking deeper dive into this intricate journey, a Medium article and a presentation from the Geo for Good 2022 event provide detailed information to complement this narrative.
These steps ensure that datasets are not only accessible but also seamlessly integrated into the geospatial research ecosystem. Over the next few paragraphs, I am going to dive into Jetstream2’s role and setup and how this functions as the powerhouse behind a lot of this work.
Jetstream2’s Role in Geospatial Data Accessibility
Jetstream2 is a cloud-based on-demand computing and data analysis resource. Its significance extends beyond the boundaries of traditional high-performance computing, offering an accessible and user-friendly environment tailored to meet the evolving needs of researchers. This NSF-funded initiative plays a pivotal role in helping build the GEE community catalog and for other researchers towards transformative research endeavors. This accessibility empowers a diverse array of institutions, including small colleges, historically black colleges and universities, minority-serving institutions, tribal colleges, and higher education institutions in EPSCoR states.
You can apply for access here. It’s allocated via ACCESS and all allocations are free to US-based researchers doing open, publishable science.
Jetstream2 extends its capabilities to encompass a broader range of hardware and services, accommodating diverse research needs. Jetstream2 serves as a valuable resource not only for individual researchers but also for gateway projects and other “always on” services, enhancing the collaborative nature of geospatial research. By providing a user-friendly interface and a suite of features, including interactive virtual machines, secure data movement, and virtual desktops.
The National Science Foundation (NSF) has made significant investments toward making high-performance computing more accessible through projects like STAMPEDE-2, FRONTERA, & Jetstream2 to name a few. These resources have not only reduced the cost barriers associated with cutting-edge research applications but have also played a critical role in supporting large-scale research and development.
Jetstream2 Behind the Scenes
At the heart of Jetstream2’s accessibility is Exosphere, a user-friendly interface that simplifies resource management. Researchers can easily create and manage instances, volumes, and persistent IP addresses through Exosphere. It also provides convenient tools such as a one-click web shell (terminal) in your browser, a one-click desktop environment for running graphical software, and a browser-based file upload/download tool. For newcomers to Jetstream2 or those seeking a straightforward resource management solution, Exosphere is the ideal starting point.
Jetstream2 goes beyond simplicity by offering custom application-specific virtual machine (VM) configurations. It supports CPU and GPU nodes, providing researchers with tailored options to suit their specific computational needs. Researchers can submit allocation proposals, obtain approvals, and convert allocations into desired configurations, whether storage volumes or GPU hours. This flexibility ensures that Jetstream2 caters to a wide range of research requirements.
Jetstream2’s capabilities extend to its impressive connectivity. With a 100 Gbps network connection from compute hosts to the cloud’s internal network infrastructure and 2x100 Gbps uplinks from the cloud infrastructure to the data center infrastructure, it ensures rapid data transfer and computational performance.
Additionally, it boasts 100 Gbps connectivity to the Internet2 backbone and 100 Gbps connectivity to the XSEDE research network via virtualized links. These robust network connections empower researchers to seamlessly transfer large datasets and run computationally intensive applications without encountering performance bottlenecks.
Efficient storage management is critical in geospatial computation, and Jetstream2 excels in this aspect. For the community catalog and under the project allocation we get approximately 30 TB of transient storage, distributed across multiple drives that can be attached or detached from instances as needed. What sets Jetstream2 apart is its unique hot swap capability, allowing users to detach a volume from an active instance and attach it to another, enhancing operational efficiency and data accessibility.
Geospatial Accessibility and Open Science
In the ever-evolving landscape of geospatial research, the fusion of open data, cutting-edge supercomputing, and high-performance clusters has propelled us into an era of unprecedented possibilities. At the forefront of this transformative journey stand programs like the NSF-funded Jetstream2 are part of the National Science Foundation’s commitment to making research data not just accessible but also highly usable. You can cite Jetstream2 using the citation information here.
In the year of Open Source data and science, 2023, the principles of FAIR (Findability, Accessibility, Interoperability, and Reuse) resonate profoundly. These principles underpin our efforts, guiding us in our mission to build a geospatial data ecosystem that empowers researchers and advances the frontiers of knowledge.
The Awesome GEE Community Catalog, is a testament to collaborative effort and innovation, embodies these principles in its very essence. You can also cite the community catalog using citations available here
Reflecting on the complexity of this endeavor keeps us humble. The scale of data, the intricacies of preprocessing, and the seamless usability of the catalog are a testament to the dedication of the geospatial community. As I often say, “Communities are what communities build together.” This sentiment embodies the spirit of collaboration, echoing the harmonious synergy of researchers, contributors, and organizations who have collectively constructed a resource that continues to drive geospatial research forward.