Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add product conversion tests #53

Open
wants to merge 1 commit into
base: add-complex-tests
Choose a base branch
from

Conversation

renancunha
Copy link
Contributor

Similar to #52, this PR should not be merged into master. It was created to try to come closer to a real-world scenario, where we use the shublang to do data conversion between schemas.

Initially, a specification (a set of rules) was created to transform the data contained at the file tests/resources/unified_schema_product.json into a custom product schema that' is used at BV. We tried to cover all the fields present at the custom schema, according to the availability of data at the source schema.

Below are some considerations:

  1. We should start to think about a functionality to solve Convert Free Form Text to Enum #26. It's very common to have to map a free form text to a set of predefined values (in Enum). In this specific example, we have to map the currency symbols (like £) to the currency abbreviations (like GBP), also had to map the stock situation (InStock, OutOfStock, etc). Do this kind of thing manually, using map, could get really weird, not-readable, and hard to maintain.

  2. We need a pipe to convert data to string. There are some fields that are integers at the unified-schema product but should be strings at the customer schema, like all the fields related to reviews.

  3. When we do some operation of filtering or something that results in a list with jmespath, we need to use the first twice to get exactly the data that we want. As an example, when we try to get a determined property from the additionalProperties list, we can do that in two ways:

    • jmespath("additionalProperty[?name=='colour'].value | [0]") | first or
    • jmespath("additionalProperty[?name=='colour'].value") | first | first

    In these cases, we will need some kind of convention on which approach we should adopt.

It would be great to have discussions about the expressions created, possible optimizations, different approaches to get the same results, expressions complexity, etc.

@akshayphilar
Copy link
Member

@ivankivanov could you please have a look at the missing fields and see if these can be mapped back to our schema
https://github.com/scrapinghub/shublang/pull/53/files#diff-ce8c62e2dd02132083d2a2d1a43490a8R48-R77

Copy link
Contributor

@BurnzZ BurnzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @renancunha , you're off to a great start here! Thanks for putting this out. It really helps visualize and see where we're currently at the moment in terms of conversions. 🎉

In addition to the three (3) main points you've raised, I have the following food for thought (that might also address 2 and 3):

So right off the bat we could actually see that:

  • all data conversions starts with jmespath
  • almost all data conversion ends with first
  • there was a need to transform a number into a string

I was thinking that we could avoid the superfluous and unnecessary operations above and have those type-conversions completely implicit. One way to achieve this is by defining the schema of the item transformation we want.

Such schema would contain:

  • the data type
  • the jmespath of the data source which complies with the Unified Schema

As a quick example, suppose we only need to retrieve a small subset of fields from the unified_schema_product.json you've used:

{
	"price_now": {
		"type": "str",
		"src": "offers[0].price"
	},

	# No need to do a `map(lambda x: str(x))`
	"rating_current": {
		"type": "str"
		"src": "aggregateRating.ratingValue",
	},

	"reviews_quantity_1_star": {
		"type": "str",

		# This actually would return: [4]
		# but we don't need to
		# - pipe jmespath with "... | [0]",
		# - do a `map(lambda x: str(x))`,
		# - and end it with a `first`
		# because the conversion module we'll be create
		# would handle it based on the type

		"src": "ratingHistogram[?ratingValue=='1'].ratingPercentage"
	},

	"image_urls": {
		"type": "List[str]",

		# Similar to PEP-585, we can specify the expected elements
		# of any iterables. The conversion module would be able to 
		# adapt to it automatically.

		"src": "images"
	}
}

Notice that we didn't need to write jmespath and first in any of them. :D

The doesn't actually solve the 1st concern you've raised which tackles #26 . I'll need to brew that one on my brain for a few days. 🤔

In any case, the schema we'll be defining would be used as input, alongside the data source, as you can see below:

image

Lemme know what you think. Cheers!

@renancunha
Copy link
Contributor Author

Thanks for adding such great analyzes on our discussion here @BurnzZ! I agree with the points that you have mentioned. Looking at the "specifications" created to do this transformation above we see this repetitive pattern of getting the source field with jmespath and ending selecting the first element with the first function.

Some considerations:

  • About the need to transform number to string: I'll open a PR adding this functionality soon.
  • About defining the schema of the item transformation we want: I think that it makes total sense and this new schema definition will act as a specification file to the output that we want. There are cases where we need to also apply some rules (shublang functions) to transform the data. Considering these cases, maybe should we also specify at the new schema definition, for each field (if needed), a rules pipeline that our conversion module should execute at the src data? As an example, let's get the category field, which should be mapped from the breadcrumbs[*].name and then joined with join(" > "):
"category": {
    "type": "str",
    "src": "breadcrumbs[*].name",
    "rules": "join(' > ')"
}

By doing this will have some kind of separation of concerns: the src, as it's names suggests, will be responsible just to fetch some information from the data source and the rules will be responsible just to transformations. I don't know if naming it as rules is the best choice, it was just an idea.

I think having this three info (type, src, and rules) for each field, our module will be capable of handling every single case that came to my mind now.

@BurnzZ
Copy link
Contributor

BurnzZ commented Aug 25, 2020

The third part would be a great addition @renancunha ! You're absolutely right about the separation of concerns.

I don't know if naming it as rules is the best choice, it was just an idea.

What do you think about calling it transformations? since to rule needs to actually enforce something, but in this case, we're transforming something.

So far, we have established a clear split among:

  1. type - enforces the data type of the value, be it int, float, List[int], List[str], dict, etc.
  2. src - where the field comes from with respect to the Unified Schema.
  3. transformations - could be optional if we need any further modifications to the current data format or value. But I think it's important to note that the type could handle the most basic of conversions, like if we have a value of 299.00 and it's specified as "type": "str", then it's an automatic conversion into "299.00", without the need for transformations. Another example would be if 299.00 is specified as "type": "int", then it would be converted automatically into 299.

I'm trying to think of anything else we might add, but I think these are the fundamental ones. Anything else might just be special or edge cases for some specific use cases.

Do you think the next step would be for you to try out implementing a basic PoC for this "specifications" we've been discussing? 😄

@renancunha
Copy link
Contributor Author

What do you think about calling it transformations? since to rule needs to actually enforce something, but in this case, we're transforming something.

Naming it as transformations makes much more sense, people will look at and immediately knows what is this, great catch @BurnzZ. I think we can proceed with this choice.

Do you think the next step would be for you to try out implementing a basic PoC for this "specifications" we've been discussing?

Yeah, I think that I can start to work on a PoC for creating this module responsible to parse a schema file (as a tree of fields), field by field, and executing the steps that we've discussed here (get the src, apply the transformations, resolve the result according to the type, etc). I will just give attention to #26 firstly because it will be necessary even in the simple schema conversions.

My question is where should I start to implement this PoC? It would be great to implement it in a repository where we can follow the progress of it (like we do here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants