Physical asset management has well established practices for managing physical quality through all stages of the asset lifecycle, from the supply chain procurement process to the decommissioning. Though we speak of data as an asset, data quality practices for the data related to those assets doesn't often follow those same practices. Great Expectations is a way to apply those same principles, playing the role of the 'inspector' though the life cycle of the data asset.
Much of the data about assets related to a location, either physical location on earth, or related to a local coordinate system of the asset, a site, etc. The inspiration was to create a set of expectations that would allow for basic quality management of geometry data - both specific to geography, and also general to an arbitrary geometry. And while the inspiration was from asset management, the methods are general and can be used for general consumer or device data, or scientific or engineering data presented in a cartesian coordinate system
What it does
There are 7 expectation that validate a column of geometries in "Well Known Text" (WKT), "Well Known Binary"(WKB), or X-Y (Longitude-Latitude) point tuples.
- ExpectColumnValuesGeometryDistanceToAddressToBeBetween (PR 4652)
- ExpectColumnValuesGeometryToBeWithinPlace (PR 4626)
- ExpectColumnMiminumBoundingRadiusToBeBetween (PR 4688)
- ExpectColumnValuesGeometryCentroidsToBeWithinShape (PR 4625)
- ExpectColumnValuesGeometryToBeNearShape (PR 4625)
- ExpectColumnValuesGeometryToBeWithinShape (PR 4625)
- ExpectColumnValuesGeometryToInstersectShape (PR 4625)
Most of the expectations are column map expectations, checking each geometry individually. The first 2 are specific to geocoding - querying a shape or a location from a geocoder (e.g. OpenStreetMap - Nominatim), and checking to see whether the points are contained within the shape or within a specific distance. 4 through 7 check that the given geometries have some sort of interaction with a reference shape. These are generic and work in the coordinate system given by the user. Within is useful for setting boundaries and points, where intersection would apply more to lines and polygons that may extend well outside the reference shape. Using reference shapes is also best practice for the use of geocoders. Heavy users of GE will be making many calls to the geocoding service, and if an open source one is selected, it is best to cache the returned geometry locally, rather than submit the same query multiple times or many sequential queries.
How we built it
The goal was to get a broad selection of related expectations, rather than a single polished expectation. The expectations are only implemented for pandas to start.
Challenges we ran into
The challenge is knowing how users prefer to present the input data. A typical DB may have latitude and longitude as separate columns, or may present them as tuples. however, conventions in PostGIS and GIS-enabled MSSQL is to have a geometry column an query as WKT or WKB if it is being exported. These present the data in Longitude first coordinates, which isn't how it's presented colloquially. Ultimately, the user is responsible for the data formating, but providing options for input is an area that could be improved upon in the refinement of these expectations.
Accomplishments that we're proud of
The ExpectColumnValuesGeometryDistanceToAddressToBeBetween is a very flexible expectation, which can be used in many industries for many different purposes. The user has the flexibility to input simple mapping text queries as an input and get an array distance tests back from a variety of point input formats. GeoPy also has broad support for geocoders, which allows the user to call paid services through API keys.
The ExpectColumnValuesGeometryToBeWithinPlace is similar but instead of a point, allows return of a Polygon area, which is useful fo querying countries, states, counties, cities, and geographical features or landmarks like lakes or islands, etc. It becomes a very flexible tool that opens up possibilities for a variety of use cases.
What we learned
Getting a fully functional expectation that interacts with all areas of GE can be complex. These expectations need to be implemented in Spark and SQLAlchemy, and also need renderers for documentation. If these prove to be useful, they will also benefit from integration into profilers, etc., so it can become a complicated landscape to implement a well-rounded expectation.
I also had not previously contributed to a widely used open-source project. This was a first and there is definitely a learning curve that I'm happy to be climbing.
If these expectations prove to be useful, implementing the Spark and SQL versions of these and adding renderers would be the first step to open up user access. The negative versions of these may also be particularly useful to some (e.g. NOT within a specific place or shape. Column min and max aggregate expectations may also prove useful. Lastly, pygeos runs qucikly and better than Shapely, however, those projects are being merged, and Shapely 2.0 will be the main interface to these functions. Once that is a stable release, transferring the library to shapely will be preferred.