Improved Automated QA for Recipes and Products #983
Replies: 1 comment
-
random thoughts: definitely like the focus on what columns a product needs aka a product's expectations is a significant amount of this pain caused by the current I imagine this could go too far, but I think I like the general approach of something failing during a build rather than investing too much in pre-build checks. Definitely still love declaring expectations though! on the topic of yml files for modeling, here's an example of declaring tests for certain columns in a dbt project. that one only uses a couple of the built-in tests but there's tons of others ( |
Beta Was this translation helpful? Give feedback.
-
In building FacDB, I ran into a few issues which were trivial to fix, but nonetheless took time to diagnose. They might provide a nice framework to discuss potential improvements.
Here's a sampling of issues I encountered Building FacDB
USECODE
s were corrupted on Bytes: they should have been 4 digit numerics, but were cast to numbers. E.g. "0211" -> 211.edm-publishing
are out of our control.borough
in dsny_electronicsdrop`)zip_code
indsny_fooddrop
was sometimes being read in as a float in pandas)dot_parking
ballooned in size and changed completely. It would have been nice to catch this prior to import to EDM Recipes, rather than noticing the discrepancy in the output, and then having to purge the bad data from S3 (I could easily have forgotten to do that).What I'm thinking for next steps: Perhaps model out the required/important fields in FacDB for 1) a subset of recipes 2) for the output to edm-publishing. The output of this exercise would be some declarative format about expectations (probably fields with types modeled in yml) which we'd then use to write automations to detect and potentially coerce out-of-spec data into a usable format. Modeling might be a nice way to indicate which columns actually matter at the periphery. E.g. at ingestion time, if we no longer have the
boro
field indsny_electronicsdrop
does that actually matter?There are some neat libraries (e.g. Cerberus) that we might make use of, though I think it would be nice to get our feet wet before making a decision about them.
Thoughts @fvankrieken , @damonmcc . Would love to hear about your pain points as well. If it'd be easier, we could just huddle and jot down some notes.
Beta Was this translation helpful? Give feedback.
All reactions