Narrative Snapshot Documentation

Learn how Narrative utilizes snapshots to express how datasets change over time.

Introduction 

Narrative leverages Iceberg Tables to store large amounts of data. One of Iceberg's core principles is utilizing snapshots to store and reference a collection of files. A snapshot represents the state of a dataset at a point in time and is used to access the complete set of data files behind a dataset. Snapshots are recorded through manifest files that contain a row for each data file in the table with its metrics. The complete picture of dataset is shown by joining together all the data files for a manifest. Every manifest holds metadata about the data, including partition stats and data file counts. These stats are used to skip over manifests that are not required for an operation. Narrative uses these efficiencies to save time when displaying information on a dataset's statistics and retention policies. 

Narrative Snapshot Use Cases 

Dataset Retention Policies 

Narrative wants to prevent suppliers from having to house data that they deem useless. Dataset retention policies allow users to control the size of their datasets according by defining how long they want to keep data stored in Narrative's ecosystem. Properly set up dataset retention policies will allow providers to keep storage costs down and prevent buyers from scanning unnecessary aged data.  

Snapshots assist retention policies by allowing sellers to query previous versions of their dataset to understand how their file size changes over time. This is useful for sellers who want to know how much data was expired and when. Storing data in snapshots also allows providers to delete data at the file level instead of the row level, which avoids painful processing costs to rewrite the entire dataset. 

Users may update a dataset's retention policy one dataset at a time. For example, the following retention policy (json request) will expire data that has been in a single dataset after 30 days when sent to: https://api.narrative.io/datasets/{dataset_id}/admin/retention-policy

{ 

"type": "expire_when",

"expression": {

"type": "snapshot_age",

"operator": ">",

"period": "P30D"

}
}

The dataset's response json will have a subsection that look similar to the following: 

"name": "Dataset Name",

"retention_policy": {

"type": "expire_when",

"expression": {

"type": "snapshot_age",

"operator": ">",

"period": "P30D"

}
}

Once a dataset's retention policy is in place, buyers and sellers can call the following endpoint to understand which snapshots from their dataset are set to expire: 

https://api.narrative.io/datasets/932/admin/retention-policy/preview

The response will contain an array of relevant snapshots:

{

"snapshots_to_expire": []

}

Dataset Statistics

Dataset statistics provide insights into the size and counts of a single dataset. Providers want to know how much data is in their dataset to:

  1. Audit that their ingestion is working as expected
  2. Understand how the current state of the dataset is contributing to their storage costs
  3. View insights into the distribution of values behind their dataset 

Narrative solves all three of these use cases by pulling the history of dataset statistics through individual snapshots. The change in row counts (rows added or deleted) are captured in the delta between snapshots.  For example, snapshot one was captured on Tuesday and indicates that a dataset has 2000 rows. Snapshot two was captured on Wednesday and indicates that a dataset has 3000 rows. The user can infer that 1000 rows were added between Tuesday and Wednesday. This system of record is how Narrative calculates a user's dataset storage costs over time.

A union of all the files listed in the manifest of each snapshot, while filtering out the files that are deleted, gives a holistic view of a dataset's history. Users can pull dataset statistics by calling the following endpoint: https://api.narrative.io/datasets/{dataset_id}/stats

The response will look similar to the following depending on a dataset's schema and row counts: 

{
"dataset_id": xxx,
"records":
[
{
"snapshot_id": 8116371869825870170, # snapshot_id
"total_dataset_files": 924178, # total_table_files
"total_dataset_records": 3392852797, # total_table_records
"snapshot_added_files": 2, # snapshot_added_files
"snapshot_added_records": 203023, # snapshot_added_records
"snapshot_deleted_files": 0, # snapshot_deleted_files
"snapshot_deleted_records": 0, # snapshot_deleted_records
"active_dataset_stored_bytes": 410850194493,# total_table_size_in_bytes
"est_dataset_stored_files": 924178, # estimated_total_table_files
"snapshot_added_deleted_files": 0, # snapshot_added_delete_files
"snapshot_added_stored_bytes": 17048471, # snapshot_added_size_in_bytes
"est_dataset_stored_records": 3392852797, # estimated_total_table_records
"snapshot_removed_deleted_files": 0, # snapshot_removed_delete_files
"snapshot_removed_bytes": 0, # snapshot_removed_size_in_bytes
"est_dataset_total_stored_bytes": 410850194493 # estimated_total_table_size_in_bytes,
"columns_summary": [
{
"name": "unique_id", # name
"type": "string", # type
"nanValue": "NaN", # nanValue
"nullValue": "", # nullValue
"valueCount": 3000, # valueCount
"columnSizes": 30323, # columnSizes
"lower_bounds": true, #lowerBounds
"upper_bounds": false
}

]
}
]
}