Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common: add an enrichment extracting canonical properties into dedicated contexts #47

Open
chuwy opened this issue Jun 19, 2020 · 5 comments
Labels
enrichment Add or fix existing enrichment RFC Improvement requires significant design efforts

Comments

@chuwy
Copy link
Contributor

chuwy commented Jun 19, 2020

In order to refactor atomic events we need to extract all non-generic information from a fat table into dedicated contexts and preserve only common properties. As a first step, we can have those properties in atomic event (as we do now, to not break data models) and in their deciated tables/columns (to start writing new data models).

I tried to summarize what contexts and event-specific properties can be extracted out of Event:

  1. app_id
  2. platform
  3. etl_tstamp
  4. collector_tstamp
  5. dvce_created_tstamp
  6. event
  7. event_id
  8. txn_id
  9. name_tracker
  10. v_tracker
  11. v_collector
  12. v_etl
  13. user_id
  14. user_ipaddress
  15. user_fingerprint
  16. domain_userid
  17. domain_sessionidx
  18. network_userid
  19. geo_country - MaxMind context
  20. geo_region - MaxMind context
  21. geo_city - MaxMind context
  22. geo_zipcode - MaxMind context
  23. geo_latitude - MaxMind context
  24. geo_longitude - MaxMind context
  25. geo_region_name - MaxMind context
  26. ip_isp - MaxMind context
  27. ip_organization - MaxMind context
  28. ip_domain - MaxMind context
  29. ip_netspeed - MaxMind context
  30. page_url - Web page context (source of truth)
  31. page_title - Web page context (source of truth)
  32. page_referrer - Referrer context (source of truth)
  33. page_urlscheme - Web page context
  34. page_urlhost - Web page context
  35. page_urlport - Web page context
  36. page_urlpath - Web page context
  37. page_urlquery - Web page context
  38. page_urlfragment - Web page context
  39. refr_urlscheme - Referrer context
  40. refr_urlhost - Referrer context
  41. refr_urlport - Referrer context
  42. refr_urlpath - Referrer context
  43. refr_urlquery - Referrer context
  44. refr_urlfragment - Referrer context
  45. refr_medium - Referrer context
  46. refr_source - Referrer context
  47. refr_term - Referrer context
  48. mkt_medium - Marketing campaign context
  49. mkt_source - Marketing campaign context
  50. mkt_term - Marketing campaign context
  51. mkt_content - Marketing campaign context
  52. mkt_campaign - Marketing campaign context
  53. contexts
  54. se_category - Struct event self-describing event
  55. se_action - Struct event self-describing event
  56. se_label - Struct event self-describing event
  57. se_property - Struct event self-describing event
  58. se_value - Struct event self-describing event
  59. unstruct_event
  60. tr_orderid - Ecommerce transaction self-describing event
  61. tr_affiliation - Ecommerce transaction self-describing event
  62. tr_total - Ecommerce transaction self-describing event
  63. tr_tax - Ecommerce transaction self-describing event
  64. tr_shipping - Ecommerce transaction self-describing event
  65. tr_city - Ecommerce transaction self-describing event
  66. tr_state - Ecommerce transaction self-describing event
  67. tr_country - Ecommerce transaction self-describing event
  68. ti_orderid - Ecommerce transaction item context
  69. ti_sku - Ecommerce transaction item context
  70. ti_name - Ecommerce transaction item context
  71. ti_category - Ecommerce transaction item context
  72. ti_price - Ecommerce transaction item context
  73. ti_quantity - Ecommerce transaction item context
  74. pp_xoffset_min - Page ping self-describing event
  75. pp_xoffset_max - Page ping self-describing event
  76. pp_yoffset_min - Page ping self-describing event
  77. pp_yoffset_max - Page ping self-describing event
  78. useragent - Browser context (but populated from different places)
  79. br_name - Browser context (but populated from different places) (ua-utils)
  80. br_family - Browser context (but populated from different places) (ua-utils)
  81. br_version - Browser context (but populated from different places) (ua-utils)
  82. br_type - Browser context (but populated from different places) (ua-utils)
  83. br_renderengine - Browser context (but populated from different places) (ua-utils)
  84. br_lang - Browser context (but populated from different places)
  85. br_features_pdf - Browser context (but populated from different places)
  86. br_features_flash - Browser context (but populated from different places)
  87. br_features_java - Browser context (but populated from different places)
  88. br_features_director - Browser context (but populated from different places)
  89. br_features_quicktime - Browser context (but populated from different places)
  90. br_features_realplayer - Browser context (but populated from different places)
  91. br_features_windowsmedia - Browser context (but populated from different places)
  92. br_features_gears - Browser context (but populated from different places)
  93. br_features_silverlight - Browser context (but populated from different places)
  94. br_cookies - Browser context (but populated from different places)
  95. br_colordepth - Browser context (but populated from different places)
  96. br_viewwidth - Browser context (but populated from different places)
  97. br_viewheight - Browser context (but populated from different places)
  98. os_name - Browser context (but populated from different places) (ua-utils)
  99. os_family - Browser context (but populated from different places) (ua-utils)
  100. os_manufacturer - Browser context (but populated from different places)
  101. os_timezone - Browser context (but populated from different places)
  102. dvce_type - Browser context (but populated from different places) (ua-utils)
  103. dvce_ismobile - Browser context (but populated from different places) (ua-utils)
  104. dvce_screenwidth - Browser context (but populated from different places)
  105. dvce_screenheight - Browser context (but populated from different places)
  106. doc_charset - Web page (or document) context
  107. doc_width - Web page (or document) context
  108. doc_height - Web page (or document) context
  109. tr_currency - Ecommerce transaction self-describing event
  110. tr_total_base - Ecommerce transaction self-describing event
  111. tr_tax_base - Ecommerce transaction self-describing event
  112. tr_shipping_base - Ecommerce transaction self-describing event
  113. ti_currency - Ecommerce transaction item context
  114. ti_price_base - Ecommerce transaction item context
  115. base_currency - Ecommerce transaction self-describing event
  116. geo_timezone - MaxMind context
  117. mkt_clickid - Marketing campaign context
  118. mkt_network - Marketing campaign context
  119. etl_tags
  120. dvce_sent_tstamp
  121. refr_domain_userid - Referrer context
  122. refr_dvce_tstamp - Referrer context
  123. derived_contexts
  124. domain_sessionid
  125. derived_tstamp
  126. event_vendor
  127. event_name
  128. event_format
  129. event_version
  130. event_fingerprint - This should remain in canonical event
  131. true_tstamp

Their grouping is not very semantic, but should be based mostly on the info source, e.g. although browser/device info semantically is the same information, some of properties are passed thourgh the tracker protocol and some derived through user-agent enrichment.

Contexts

  • MaxMind context
  • Web page context
  • Referrer context
  • Marketing campaign context
  • Ecommerce transaction item
  • Browser/device context (potentially multiple of them)

Self-describing events

  • Struct event
  • Ecommerce transaction
  • Page ping

Common properties

It leaves us with 31 core properties that can be set almost for all events/pipelines. Maybe some of them (user/device identification) can/should be moved into dedicated contexts.

  1. event_id - event identification
  2. app_id - event identification
  3. event - eventually will be discarded in favor of vendor/name/version
  4. txn_id - event identification
  5. event_vendor - event identification
  6. event_name - event identification
  7. event_format - event identification
  8. event_version - event identification
  9. event_fingerprint - event identification
  10. platform - probably should be moved as well
  11. dvce_created_tstamp - timestamps
  12. dvce_sent_tstamp - timestamps
  13. collector_tstamp - timestamps
  14. etl_tstamp - timestamps
  15. derived_tstamp - timestamps
  16. true_tstamp - timestamps
  17. user_id - user/device identification
  18. user_ipaddress - user/device identification
  19. user_fingerprint - user/device identification
  20. domain_userid - user/device identification
  21. domain_sessionidx - user/device identification
  22. domain_sessionid - user/device identification
  23. network_userid - user/device identification
  24. name_tracker - pipeline/aux
  25. v_tracker - pipeline/aux
  26. v_collector - pipeline/aux
  27. v_etl - pipeline/aux
  28. etl_tags - pipeline/aux
  29. unstruct_event - payload
  30. contexts- payload
  31. derived_contexts - payload
@chuwy
Copy link
Contributor Author

chuwy commented Jun 19, 2020

Migrated from snowplow/snowplow#4244 (comments are auto-generated)

@chuwy chuwy added enrichment Add or fix existing enrichment RFC Improvement requires significant design efforts labels Jun 19, 2020
chuwy added a commit that referenced this issue Jun 7, 2021
@chuwy
Copy link
Contributor Author

chuwy commented Jun 21, 2021

I've created a spreadsheet, proposing what new contexts and events should look like: https://docs.google.com/spreadsheets/d/1UaXrH92IvRWyXNU8wUQ-oxvEI9kJxoxbIcbRjna7RAI/edit#gid=0

@BioQwer
Copy link

BioQwer commented Nov 8, 2022

@chuwy do you have enrichments config for full atomic schema?

@benjben
Copy link
Contributor

benjben commented Nov 16, 2022

Hi @BioQwer , which config are you refering to ? FYI this issue is still on our roadmap but this has not been prioritized yet.

@BioQwer
Copy link

BioQwer commented Mar 16, 2023

I work with Open Source version.
I have many empty values in atomic columns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enrichment Add or fix existing enrichment RFC Improvement requires significant design efforts
Projects
None yet
Development

No branches or pull requests

3 participants