Skip to main content
Skip table of contents

Comparing Data Sharing and Data Flow

Data Flow and Data Sharing are both export products but they differ on the type of data they offer. This article aims at digging deeper in what makes the two products fundamentaly different in terms of data available.

Nature of the Data

  • Data Flow delivers real-time data that is “raw”. Only the basic properties are included, meaning that the data is originating from a tag parameter, provided by our streaming pipeline. It exports only the validated, directly stored and processed event properties.

  • Data Sharing gives you access to Piano Analytics' underlying Snowflake database. As you know, it exposes two key tables:
    - The stream.events table that holds the raw real-time event data (similar in nature to what you would see in a Data Flow export).
    - The data.events table that is built from an overnight reprocessing of the raw events, and displays all the historical data.

This article will focus on Data Sharing as we apply additional business-logic, enrichment and calculations, that differentiate with Data Flow and create additional value.

Calculated/Enriched properties for Data Sharing

At Piano Analytics, we have implemented a complementary mechanism to ensure your event data (historical data) is complete and reliable, even when some information arrive late or out of order. This is how it works:

  • 1) The real-time streaming module

As users navigate the site, events can be linked to different IDs (pageview_id, visit_id..).
This module temporarily stores key properties linked to the same ID (in the cache memory) to enrich every event having that ID with the values of these properties. Then, the module acts as a scanner for each event:
- If an event provides a new property linked to the ID, it’s saved and used to enrich future events in that visit.
- If an event is missing a property value, the module can fill in the gap using the cached data.

This ensures richer data in real time, even when events arrive partially or out of sequence. This data is available in the stream.events table of Data Sharing, but this is also the module we use to send the real-time data for Data Flow files.

  • 2) End-of-the-day processing module

To provide maximum data quality, we run an additional processing for the whole streaming events.
- The module looks at all events tied to the ID, sorts them by timestamp.
- It recreates the complete history in the data.events table, copying missing information on all the events concerned, even if they were collected before we received the new value to copy.

This guarantees accurate, complete datasets without the need for manual cleanup or completion.

To be clear, this data will be available in the data.events table of Data Sharing, which is all your historical data starting D-1.

As a result, the data.events table is providing you with calculated or enhanced properties and events, that are not available in our Data Flow service. Here is the list of the properties available in the data.events but not with Data Flow:

Data Sharing - Differences between stream.events & data.events 08.12.22.xlsx

Some interesting properties to use thanks to the data.events table:

  • Visit Duration: computed by aggregating events over a visit.  

  • Entry and Exit pages: that combine information from multiple events to define the visit scope.

  • Visit Page views: to have a distribution of the pages viewed for each visit.

  • Page duration: calculated after the full visit concludes, to know the inter-page durations.

Choosing between Data Flow or Data Sharing is first of all a matter of use-case. If your analysis or reporting workflow depends on these additional calculations and enrichments, Data Sharing (especially reading from the fully reprocessed data.events table) will provide a more comprehensive set of data, and save you some precious time not recreating calculation on your own.

Use-cases for Data Sharing

Whether the clients are using Data Sharing in Pay Per Use (PPU) or Deliver To Your Database (DTYD), their usage supports 3 primary scenarios:

  1. Ad hoc data querying

  2. To power BI and reporting

  3. Data export/copy for external storage or processing

Use-case 1: Ad hoc data querying

The client performs SQL queries in the Snowflake user interface to answer specific business
questions.

Use-case 2: To power BI and reporting

The client queries shared tables (stream.events or data.events) via Snowflake in real time (or near-real), without moving or duplicating data. These simple or complex SQL queries are used to :

  • Generate BI reports (Power BI, Tableau, Looker, Qlik etc..)

  • Feed customized dashboards

  • Build ad hoc exports for advanced analysis

Please find the list of Snowflake third party integrations here: https://analytics-docs.piano.io/en/analytics/v1/3rd-party-connectors

Benefits

  • Real-time access to the latest data

  • No data duplication

  • Centralized data within Snowflake

Use-case 3: Data export/copy for external storage

Clients replicate or export all (or a subset) of their Piano Analytics data to external cloud storage or data warehouses (AWS S3/Redshift, Google Cloud Storage/BigQuery, Azure Blob/Synapse) for custom processing, data lakes, or machine learning workloads.

Benefits

  • Seamless integration with existing data lakes

  • Support for large-scale processing

  • Cost-effective longterm storage and archiving

What about Data Flow?

As you probably understood, Data Flow on the contrary does not offer this diversity of practises. Files containing your daily data will be dropped on an bucket every 15,30 or 60 minutes. To get a complete view of a visit, the events of the day need to be compared and matched to recreate the journeys of your visitors, which requires resources and time.
Besides, your “historical data” only starts accumulating from the moment you subscribe the option. However, if you really need your historical data (up to 25 retention months), we have options for you. Please contact your Account Manager.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.