Skip to content

Releases: Parsely/parsely_raw_data

v2.4.0: Redshift Data Warehouse and ETL using dbt

13 Nov 19:58
Compare
Choose a tag to compare

This release adds a new folder dbt that contains the code to complete incremental ETL processes into a Data Warehouse in Redshift.

v2.3.0 Release: New columns and Apple News Real-Time data

18 Sep 17:30
303ae23
Compare
Choose a tag to compare

We're making improvements to the Parse.ly Data Pipeline! We listened to your feedback and have incorporated it into release 2.3.0, scheduled to go live in the next month.

There are parts of this new version that are not backwards compatible and could cause breaking changes depending on how you have your queries configured. ****To prepare for the release and update your queries ****please see all details in our release post.

Prepare for the release

What's changing?

We will be adding new data as well as new columns, so you will likely see an increase in the file sizes and number of rows. The updated schema documentation is available here.

Changed columns:

  • flags_is_amp: This was an undocumented field that has existed for a few years. However it was typically populated as null, instead of false. This will now be of type Boolean and either true or false.

Additional columns:

  • Any column that does not appear for an event will receive a null value. This logic will look at the schema that we have defined and auto-populate missing columns with null. All columns, regardless of whether they are null or not, will now be available for every event. This means that each event will have 122 columns, with null values populated where appropriate.
  • channel: The Parse.ly-defined channel the event came in on. Can be strings like fbia (Facebook Instant Articles), apln-rta (Apple News), amp (Accelerated Mobile Pages), or website. If we add a new channel, that value will appear here. (Note that your Parse.ly account must be integrated with AMP, Facebook Instant, or Apple News for those values to appear.)
  • pageview_id: A unique identifier for the pageview associated with an event. This will remain consistent for all events for a given pageview. This allows for sophisticated aggregation in your data warehouse such as correlating all heartbeat events for each pageview event. This will either be null, or a long integer like 17542680 and will always appear.
  • pageload_id: A unique identifier for the pageload of an event. This is useful for single-page apps where there may be multiple calls to trackPageview for a single page load. This will either be null, or a long integer like 17542680 and will always appear.
  • videostart_id: A unique identifier for a given videostart event, allowing you to correlate other events, like vheartbeat to their originating videostart. This will either be null, or a long integer like 17542680 and will always appear.
  • schema_version: This is a new field and indicates the matching schema version from pypi and the parsely_raw_data repo. For example, this new release will be 2.3.0. The other versions that are on pypi can be found here: https://pypi.org/project/parsely_raw_data/ We will also have our documentation by schema version, enabling back-referencing of older versions for historical data.

Additional data:

  1. Parse.ly now supports a real-time Apple News integration, available in the dashboard and the data pipeline. Please reach out to your account manager or reply to this email for more info.
    1. Once your Apple News integration is complete, you can expect the following:

      1. apln-rta will appear as a value in the new channel column.
      2. For the Apple News events, you’ll also find some of the original data from Apple in the extra_data object under the apln key. This is a passthrough of what we receive from the Apple News API and contains fields specific to that platform. For more details on the values in extra_data and how to filter for Apple News Real-Time Data, see our post here.
    2. If using SQL with Parse.ly data, you'll likely want to GROUP BY the channel field to do metric calculations. This is because certain metrics—like unique visitor counts—only make sense on single channels.

2.2.5

01 Jun 15:23
Compare
Choose a tag to compare

At oauth2client to requirements, since we use it for talking to BigQuery

2.2.4

24 Jan 19:58
Compare
Choose a tag to compare

Unpin boto3 requirement to prevent conflicts when this module is used as a library.

2.2.3

08 Nov 14:54
Compare
Choose a tag to compare

This tiny release just fixes a Python 3 compatibility issue with raise syntax.

2.2.2

04 Oct 20:40
Compare
Choose a tag to compare

New Features

  • Option to preserve extra_data JSON fields in Redshift as VARCHAR
  • Documentation updates
  • console scripts for schemas, Redshift access, Bigquery access, etc (all documented now)

Bugfixes

  • install_requires now includes all relevant dependencies so setup.py can install the package and all its reqs together.
  • Move to psycopg2cffi-compat to make use on PyPy easier

Add event_id to schema

17 Mar 16:11
Compare
Choose a tag to compare
  • add unique event_id hash values on data pipeline events onto public schema documentation

bugfix

09 Mar 21:03
Compare
Choose a tag to compare

Remove more Event imports.

bugfix - remove Event class entirely

09 Mar 20:51
Compare
Choose a tag to compare

Removes a lingering import of Event class in init file.

2.0.0

09 Mar 20:28
Compare
Choose a tag to compare
  • remove Event class

All items passed through the functions are now simple JSON lines.

  • add new fields to schema:
    campaign_id
    flags_is_amp
    utm_campaign
    utm_term
    utm_medium
    utm_source
    utm_content
    ip_market_nielsen
    ip_market_name
    ip_market_doubleclick

These new fields are enrichments/additions to data pipeline pixels.