Data Cloud experience using Snowflake
Data Cloud Experience with Snowflake - ONE PLATFORM, MANY WORKLOADS, NO DATA SILOS
Cloud storage is a model of computer data storage in which the digital data is stored in logical pools, said to be on ‘the cloud’. The physical storage spans multiple servers (sometimes in multiple locations), and the physical environment is typically owned and managed by a hosting company.
Snowflake is one of the cloud data storage provider, responsible for keeping the data available and accessible, and the physical environment secured, protected, and running.
In this article, we will have an overview of concepts and features of Snowflake. The purpose is to enable decision-making on why and when to use Snowflake as a Software-as-a-Service (SaaS) solution based on various driving factors.
Snowflake Inc.
Snowflake Inc. is a cloud computing-based data warehousing company based in Bozeman, Montana. It was founded in July 2012 and was publicly launched in October 2014 after two years in stealth mode. The company is credited with reviving the data warehouse industry by building and perfecting a cloud-based data platform.
About Snowflake
Snowflake offers a cloud-based data storage and analytics service, generally termed as Software-as-a-Service (SaaS). Snowflake enables data storage, processing, and analytic solutions that are faster, easier to use, and far more flexible than traditional offerings.
The Snowflake data platform is not built on any existing database technology or “big data” software platforms such as Hadoop. Instead, Snowflake combines a completely new SQL query engine with an innovative architecture natively designed for the cloud.
Snowflake’s cloud data platform supports multiple data workloads, from Data Warehousing and Data Lake to Data Engineering, Data Science, and Data Application development across multiple cloud providers and regions from anywhere in the organization. Snowflake’s unique architecture delivers near-unlimited storage and computing in real-time to virtually any number of concurrent users in the Data Cloud. The Data Cloud is Snowflake’s vision of a world without data silos, allowing organizations to access, share, and derive better insights from their data.
Market Adoption
Snowflake is used globally by organizations of all sizes across a broad range of industries. As of July 31, 2020, there were 3,117 customers, increasing from 1,547 customers as of July 31, 2019. As of July 31, 2020, customers included seven of the Fortune 10 and 146 of the Fortune 500, based on the 2020 Fortune 500 list.
The number of customers that contributed more than $1 million in trailing 12-month product revenue increased from 22 to 56 as of July 31, 2019, and 2020, respectively.
How it’s different from other cloud providers
Snowflake is a true SaaS offering. More specifically:
● There is no hardware (virtual or physical) to select, install, configure, or manage.
● There is virtually no software to install, configure, or manage.
● Ongoing maintenance, management, upgrades are handled by Snowflake.
Snowflake runs completely on cloud infrastructure. All components of Snowflake’s service (other than optional command-line clients, drivers, and connectors), run in public cloud infrastructures. Snowflake uses virtual compute instances for its compute needs and storage service for persistent storage of data. Snowflake cannot be run on private cloud infrastructures (on-premises or hosted).
Snowflake Warehouse is a single integrated system with fully independent scaling for compute, storage, and services. Unlike shared-storage architectures that tie storage and compute together, Snowflake enables automatic scaling of storage, analytics, or workgroup resources for any job, instantly and easily.
Snowflake is not a packaged software offering that can be installed by a user. Snowflake manages all aspects of software installation and updates.
Snowflake Features
The Cloud Data Platform is built on a cloud-native architecture that leverages the massive scalability and performance of the public cloud. Key elements of the Snowflake platform include:
- Diverse data types — It integrates and optimizes both structured and semi-structured data as a common data set, without sacrificing performance or flexibility.
- Massive scalability of data volumes — It leverages the scalability and performance of the public cloud to support growing data sets without sacrificing performance.
- Multiple use cases and users simultaneously — It makes compute resources dynamically available to address the demand of as many users and use cases as needed.
- Optimized price-performance — It uses advanced optimizations to efficiently access only the data required to deliver the desired results. It delivers speed without the need for tuning or the expense of manually organizing data prior to use.
- Easy to use — It delivers instant time to value with a familiar query language and consumption-based business model, reducing hidden costs.
- Delivered as a service with no overhead — It is delivered as a service, eliminating the cost, time, and resources associated with managing the underlying infrastructure.
- Multi-cloud and multi-region — It is available on 3 major public clouds across 22 regional deployments around the world. These deployments are interconnected to create our single Cloud Data Platform, delivering a consistent, global user experience.
- Seamless and secure data sharing — It enables governed and secure sharing of live data within an organization and externally across customers and partners, generally without copying or moving the underlying data.
Snowflake Architecture
Snowflake’s architecture is a hybrid of traditional shared-disk and shared-nothing database architectures. Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data accessible from all compute nodes in the platform.
Snowflake cloud-native architecture consists of three independently scalable layers across storage, compute, and cloud services. The storage layer ingests massive amounts and varieties of structured and semi-structured data to create a unified data record. The compute layer provides dedicated resources to enable users to simultaneously access common data sets for many use cases without latency. The cloud services layer intelligently optimizes each use case’s performance requirements with no administration. These deployments are interconnected to create our single Cloud Data Platform, delivering a consistent, global user experience.
Snowflake’s unique architecture consists of three key layers:
- Database Storage
- Query Processing
- Cloud Services
- Database Storage
When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized, compressed, columnar format. It stores this optimized data in cloud storage.
Snowflake manages all aspects of how this data is stored — the organization, file size, structure, compression, metadata, statistics, and other aspects of data storage are handled by Snowflake. The data objects stored by Snowflake are not directly visible nor accessible by customers; they are only accessible through SQL query operations run using Snowflake.
- Query Processing
Query execution is performed in the processing layer. Snowflake processes query using “virtual warehouses”. Each virtual warehouse is an MPP compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider.
Each virtual warehouse is an independent compute cluster that does not share compute resources with other virtual warehouses. As a result, each virtual warehouse has no impact on the performance of other virtual warehouses.
- Cloud Services
The cloud services layer is a collection of services that coordinate activities across Snowflake. These services tie together all of Snowflake's different components to process user requests, from login to query dispatch. The cloud services layer also runs on compute instances provisioned by Snowflake from the cloud provider.
Services managed in this layer include:
- Authentication
- Infrastructure management
- Metadata management
- Query parsing and optimization
- Access control
Components that make the Snowflake Data Cloud
Each component of The Snowflake Data Cloud allows us to understand the complete data solution:
Cloud Data Warehouse
- Compute isolation
- Connect existing tools and reports
- Be the center of your Business Intelligence Strategy
Cloud Data Lake
- Centralized repository to store any type of data
- Structured
- Unstructured
Data Engineering
- Build reliable data pipelines with Snowflake automation
- Streams
- Tasks
- Snowpipe
Data Science
- Prepare, standardize, and serve data for building models
- Feature Store
- Experiment and Coefficient History
Data Applications
- Availability of data and compute is taken care of
- Access and Store your data anywhere across clouds
Data Exchange and Sharing
- Access external datasets like they were your own, without actually having to move or ingest the data
- Share your data in or outside the business with security guaranteed
Workloads handled by Snowflake
All data workloads follow a common pattern and Snowflake helps simplify these with solutions tailored for each task:
Data Ingestion
- Snowpipe
Data Pipelines
- Streams
- Tasks
Data Analytics
- SQL
Availability
- Data Sharing
- Data Exchange
Cost of Snowflake
The pricing model of Snowflake is based on per second. Snowflake’s pricing model calculation consists of two factors: the cost of storage and the cost of computing resources consumed.
While using Snowflake, virtual warehouses being used to do work can be turned on or off as many times as required. Snowflake offers the unique ability to track costs for each step of the data lifecycle.
Virtual data warehouses are available in eight “T-shirt” style sizes: X-Small, Small, Medium, Large, and X- to 4X-Large. Each data warehouse size has a compute credit designation. Performance improves as the size of warehouses increases.
Based on the credit usage, as an example, we will try to evaluate the cost of a Small, Medium, and Large size Snowflake warehouse.
Small
Billing in this range is typically between $25k-$75k.
Following usage considerations have been made:
● 10–20 analytics users
● 10–20 ELT pipelines
● Under 5 TB of data
● Most work is done by analysts during business hours
● Small and Large-sized warehouses to perform ELT work
● Medium-sized warehouses for analytics
Medium
Billing in this range is typically between $100k-$200k.
Following usage considerations have been made:
● 30–50 analytics users & ELT pipelines
● Under 50 TB of data
● Most work is done by analysts during business hours
● Small and Large-sized warehouses to perform ELT work
● Medium-sized warehouses for analytics
Large
Billing in this range is typically between $300k-$500k.
Following usage considerations have been made:
● 100s-1000s of analytics users & ELT pipelines
● Under 100+ TBs of data
● Work being done around the clock
● Small, Medium, and Large-sized warehouses to perform ELT work
● Medium and Large-sized warehouses for analytics
Data Security and Governance
- Choose the geographical location where your data is stored, based on region.
- User authentication through standard user/password credentials.
Enhanced authentication:
- Multi-factor authentication (MFA).
- Federated authentication and single sign-on (SSO).
- OAuth.
- All communication between clients and the server is protected through TLS.
- Deployment inside a cloud platform VPC (AWS) or VNet (Azure).
Isolation of data (for loading and unloading) using:
- Amazon S3 policy controls.
- Azure storage access controls.
- Google Cloud Storage access permissions.
- Support for PHI data (in compliance with HIPAA & HITRUST CSF regulations).
- Automatic data encryption by Snowflake using Snowflake-managed keys.
- Object-level access control.
- Snowflake Time Travel (1 day standard for all accounts; additional days, up to 90, allowed with Snowflake Enterprise) for:
- Querying historical data in tables.
- Restoring and cloning historical data in databases, schemas, and tables.
- Snowflake Fail-safe (7 days standard for all accounts) for disaster recovery of historical data.
- Column-level Security to apply masking policies to columns in tables or views.
- Row-level Security to apply row access policies to tables and views.
Snowflake Working Steps
When to select Snowflake
When you deal with many consumers with different volumes and treatments you usually tend towards a multi-cluster organization of your data warehouse, where each cluster is dedicated to a workload category: I/O intensive, storage-intensive, or compute-intensive.
This design gives more velocity to the teams. You can decide to have one cluster for each team, for example, one for the finance, one for the marketing, one for the product, etc. They generally no longer have resource-related issues, but new kinds of problems could emerge: data freshness and consistency across clusters.
Indeed, multi-clustering involves synchronization between clusters to ensure that the same complete data is available on every cluster on time. It complexifies the overall system, and thus results in a loss of agility.
In most cases, thousands of queries have to run on a single cluster, so very different workloads can occur concurrently:
- a Drivy fraud application frequently requires the voluminous web and mobile app tracking data to detect fraudulent devices,
- the main business reporting runs a large computation on multiple tables,
- the ETL pipeline of production DB dump and enrichment is running,
- the ETL pipeline responsible for the tracking is running,
- an exploration software extracts millions of records.
In order to improve the overall performance, reduce our SLAs and make room for every analyst who wants to sandbox a complex analysis, we were looking for a solution that would increase the current capabilities of the system without adding new struggles.
It has to ensure the following:
- ANSI SQL support and ACID transactions.
- Peta-byte scale.
- A fully managed solution.
- Seamless scaling capability, ideally ability to scale independently compute and storage.
- Cost-effective.
- Sharing Data via a Cloud Network.
- Collaboration in the Data Cloud.
- Centralized Secure Data Sharing.
- Faster Data Integration.
Snowflake Computing meets all those requirements, it has a cloud-agnostic (could be Azure or AWS or GCP) shared-data architecture and elastic on-demand virtual warehouses that access the same data layer.
Based on the mentioned features, functionalities, and differentiating factors, I believe it will be easy for an organization to decide whether to proceed ahead with Snowflake as a data warehouse SaaS solution or not.
Reference -
1. Key Concepts & Architecture — Snowflake Documentation
2. The Data Cloud For Dummies®
3. What is Snowflake? 8 Minute Demo | Snowflake Inc.
4. Snowflake Tutorial | What Is Snowflake [Complete Beginners Tutorial] — MindMajix
5. https://www.phdata.io/blog/what-is-the-snowflake-data-cloud/
6.https://www.sec.gov/Archives/edgar/data/1640147/000162828020013010/snowflakes-1.htm
8. https://community.snowflake.com/s/article/Use-Case-Why-we-ve-chosen-Snowflake-as-our-Data-Warehouse