Cloud technology allows start-ups and big companies to build complex data solutions for a fraction of the standard costs. Amazon Web Services is at the forefront of providing complex technological solutions in the Cloud ranging from virtual machines, through databases to Big Data platform. Through a combination of services like Redshift, S3 or Elastic Map Reduce, it is possible to build data warehousing and analytical platforms with a minimal budget and resources.
In this article we look at various Amazon Web Services components, provide their operational cost in some standard setups and then describe a sample E-Commerce Data Platform architecture (skip to it now by clicking here). Be sure to checkout the AWS Simple Monthly Calculator for the most up-to-date cost calculation.
Redshift is a Massive Parallel Processing data warehouse hosted in the Cloud. Based on ParAccel technology, it has its nuts-and-bolts hidden from the user. It is a columnar, shared-nothing architecture designed to run queries in parallel on huge data sets, with data automatically exchanged across nodes when needed. Amazon Redshift exposes a PostgreSQL compatible ODBC/JDBC endpoint making it very easy to integrate with other tools. A simple web interface allows the cluster to be scaled up or down with a simple click of a mouse. It is trivial to maintain and operate, but does not give you any performance tuning capabilities.
From a business perspective, Amazon Redshift comes with no hardware or software costs apart from the pure AWS charge. This removes the need of having a dedicated DBA/Tech team to maintain your data warehouse on costly hardware. ParAccel transforms SQL queries into C++ code, compiles and executes it on data nodes transparently. It lacks some standard database features like indexes, triggers or stored procedures, but once you get a grip of how it works internally, it is very easy to use with a minimal learning curve. Data can be encrypted in storage in case you worry about hypervisor exploits. Clusters are available in both US and EU territories, but be sure to consult your legal department before storing customer information in Redshift.
A data warehouse should be the central element of your Data Platform. This will be the place where analysts will work on their day-to-day projects, prepare data for deep dives and run recurrent reporting. Heavy data crunching should be offloaded into computational clusters, but a well-crafted SQL query can provide a lot of business insight.
|Nodes||Parameters||Pricing model||Initial cost||Monthly cost||Full cost|
|One XL-node||2 TB of storage, 2 CPU cores, 15 GB RAM||On-Demand||N/A||$622||$7464 a year|
|One XL-node||2 TB of storage, 2 CPU cores, 15 GB RAM||1 yr Reserved||$2500||$157||$4384 a year|
|One 8XL-node||16 TB of storage, 16 CPU cores, 120 GB RAM||3 yr Reserved||$24000 for instance + $2026 for Business support||$667||$50000 for 3 years|
Relational Database Service (RDS)
While Redshift is a great data warehousing solution, running some simple queries might take unexpectedly long amounts of time to complete. When in need of a SQL solution for staging data, cutting through smaller sets or delivering analytical results to business analysts, a traditional row-oriented database will usually be a better solution. Pre-crunch your data in Redshift and export results into a relational database of operations. It will not only make things faster, but will also remove the load from the central data warehouse and provide more capacity for data scientists.
An operational data mart can be built using Amazon RDS – a hosted database solution supporting all standard RDBMS systems without the need of running a whole server yourself. All modern RDBMS engines are supported and come preconfigured to work on Amazon ‘s infrastructure. You can choose from MySQL instances (with support for Read Replicas), PostgreSQL (with PostGIS, language extensions, Full Text Search and JSON data types) or commercial Microsoft SQL Server and Oracle, both in a “License included” or a “Bring-your-own-license” pricing models. An interesting feature of RDS is the ability to run a commercial database in a pay-by-the-hour model without owning a full (and rather expensive) license. Software updates and backups are handled automatically and you have a choice between standard Amazon instances (with poor I/O performance) and special Provisioned IOPS instances with high I/O throughput.
Some common setups and cost breakdowns:
|Engine||Instance||Parameters||Pricing model||Initial cost||Monthly cost||Full cost|
|MySQL||db.m1.small||20 GB of storage, 1 CPU core, 1.7 GB RAM||On-Demand||N/A||$60||$720 a year|
|MySQL||db.m1.xlarge||200 GB of storage, 4 CPU cores, 15 GB RAM, High I/O||On-Demand||N/A||$490||$5880 a year|
|MySQL||db.m1.small||20 GB of storage, 1 CPU core, 1.7 GB RAM||1 yr Reserved||$170||$20||$410 a year|
|MySQL||db.m1.xlarge||200 GB of storage, 4 CPU cores, 15 GB RAM, High I/O||1 yr Reserved||$1400||$267||$4604 a year|
|Oracle + License||db.m1.small||20 GB of storage, 1 CPU core, 1.7 GB RAM||On-Demand||N/A||$105||$1260 a year|
|Oracle + License||db.m1.xlarge||200 GB of storage, 4 CPU cores, 15 GB RAM, High I/O||On-Demand||N/A||$847||$10164 a year|
|Oracle + License||db.m1.small||20 GB of storage, 1 CPU core, 1.7 GB RAM||1 yr Reserved||$316||$35||$736 a year|
|Oracle + License||db.m1.xlarge||200 GB of storage, 4 CPU cores, 15 GB RAM, High I/O||1 yr Reserved||$2613||$376||$7125 a year|
Simple Storage Service (S3)
S3 is a redundant storage system that can be treated as a “hard drive in the cloud”. It is not as flexible as Dropbox, but is the universal way of storing and accessing data for other AWS systems. It is organized into buckets and a hierarchical key/folder structure. Files can be up to 5TB in size and are never transmitted across AWS regions. Redundancy is implemented by copying data across multiple S3 nodes in a single region for high durability and high availability. Data access can be regulated on a per-user level, but I personally find it quite difficult to manage on a very granular level.
In a data warehousing infrastructure, S3 is used as a universal data store – a place to keep raw data files e.g. flat CSV files to be bulk-loaded into Redshift or pieces of script/code for Elastic Map Reduce. The service is charge on the basis of both storage and the number of GET/PUT operations performed on the data. A medium setup could look like this:
|Data size||PUT reqs/mo||GET req/mo||Data IN/mo||Data OUT/mo||Monthly cost||Total cost|
|1 TB||100,000||500,000||60 GB||10 GB||$100||$1200 a year|
Elastic Cloud Compute (EC2)
This is the most known service provided by Amazon – virtual machines for runnin your servers and computations. Deployed as standalone machines or as parts of a Virtual Private Cloud (VPC), EC2 machines are the workhorse behind Amazon Web Services. Internally powered by the Xen hypervisor, machines range from small single-CPU instances to powerful 32 vCPU machines and GPGPU clusters with NVIDIA cores.
Virtual machines are ideal for running custom ETL processes or R analytical clusters. AWS Marketplace contains a lot of user-made AMIs (system images) with software specifically tailored for data mining and analytics. The number of possible setups is huge, so make sure you fully understand the difference between individual instance types.
Amazon Elastic MapReduce is the way to bring Big Data into your organization. It is a Hadoop platform running both Vanilla(Amazon) Hadoop or a MapR installation on EC2 machines (configured transparently for you behind the scenes). No need to setup a cluster of machines, worrying about the network infrastructure, hardware or maintenance. Elastic MapReduce delivers all of that (plus more) and you only pay for the actual computation time. Depending on the task it can be as low as a couple of dollars per hour and you are free to choose any cluster size – from few machines, up to hundreds of GPGPU instances. Supported Hadoop technologies include Hive, Pig, HBase, custom Java MapReduce jobs or the Hadoop Streaming API. Data can be stored on S3, native HDFS clusters, DynamoDB (Amazon NoSQL offering) or imported from (exported to) any other system, e.g. Redshift, through the AWS Data Pipeline.
Standard use-cases for E-Commerce include large file transformations (transaction or application logs) and clickstream analysis. Some sample setups:
|Cluster size||Instance parameters||Usage||Monthly cost|
|20 x m1.small||1.7 GB RAM, 1 CPU core, 160 GB storage||1h / day||$46 a month|
|20 x m1.xlarge||15 GB RAM, 4 CPU cores, 1690 GB storage||2h / day||$720 a month|
To easily move data around various AWS services, Amazon provides yet another tool: the AWS Data Pipeline. It allows the user to create data processing workflows with a graphical UI, JSON descriptions or API calls. These workflows can be triggered manually or run on a schedule defined by the user.
Data pipelines are composed of data nodes, data activities, logical preconditions and computational resources (EC2 or Elastic MapReduce clusters). Data sources include DynamoDB, MySQL, Redshift of S3 files. Multiple, complex activities can be used in the processing workflow, including:
- copying data from one location to another,
- Elastic MapReduce job flows,
- Hive queries,
- Pig scripts,
- Redshift copy/unload operations,
- running a shell command,
- querying a database through SQL.
All these operations can be used for simple intra-AWS orchestration and standard workflow templates are provided in the user interface. Complex ETL processes are easier to implement with custom solutions deployed on EC2 machines (e.g. Talend which has good support of AWS).
Large organizations generate a lot of data which is processed and then stored in archives for future use. Such information is not accessed frequently and AWS provides a low-cost service for durable storage of backups and archives. Data cannot be retrieved instantly, but the cost is lower than in on-premises solutions.
A rather unexpected feature – the AWS Import/Export is a service where you can ship physical storage mediums (like hard drives or USB sticks) to Amazon and they are imported into your S3 buckets. This is extremely useful when there are terabytes of data to be moved in/out of your E-Commerce Data Platform. If you do the maths, you will find that often it is actually faster to put the data on a hard drive and send it over than to pipe it through the Internet.
The newest addition to AWS is Amazon Kinesis. It is only available in a limited preview at the time of writing this article, but its operation closely resembles real-time streaming data systems like Storm. Data is loaded into Kinesis through simple HTTP PUT requests and is then delivered to multiple data-processing applications within seconds. Data Processors are sharded using a user-defined key like in other MapReduce solutions. Information is retained for 24 hours and can be read analyzed or moved to durable storage. Typical E-Commerce scenarios would include:
- Web site clickstream analysis in real-time.
- Processing marketing information into performance metrics.
- Analyzing social media interactions, e.g. tweet streams.
E-Commerce Data Platform
All these elements services from Amazon Web Services can be combined to build robust data platforms for storing, processing and analyzing E-Commerce marketing data. A sample architecture may be composed of:
- An ETL system implemented in Talend and deployed on a EC2 virtual machine.
- Raw data files stored on S3 and used to supply data warehouses, data marts and MapReduce systems with information.
- A Redshift cluster where analysts run ad-hoc queries or scheduled reports.
- A data mart implemented in RDS for data integration and tactical data delivery to reporting tools like QlikView etc.
- EC2 virtual machines running analytical and statistical software for data scientists.
- An Elastic MapReduce Hadoop cluster for Big Data analytics and deep data dives. Queries performed through Pig and Hive.
- Data movement workflows and orchestration implemented in the AWS Data Pipeline.
- Historical data archive in Glacier for occasional historical analysis.
Initial data loads are best performed through the AWS Import/Export service or incremental, daily uploads. Once Kinesis becomes available to the general public, it should be included for real-time customer behavior analytics alongside static data. A sample costs breakdown for operating a medium-sized E-Commerce Data Platform:
|Instances||Parameters||Pricing model||Initial||Monthly||Total /yr|
|15 GB RAM, 4 vCPU, 1690 GB storage||Reserved 1yr Heavy||$1352||$82||$2335|
|Data Storage||S3||2 TB, 100k PUT, 500k GET, 60 GB IN/mo, 10 GB OUT/mo||N/A||$0||$180||$2160|
|2 TB storage, 2 vCPU, 15 GB RAM||Reserved 1yr||$2500||$157||$4348|
|Operational Data Marts||RDS,
|850 GB storage, 2 vCPU, 7.5 GB RAM||Reserved 1yr||$676||$84||$1672|
|32 GB RAM, 4 vCPU, 850 GB storage||Reserved 1yr Heavy||$1578||$99||$2766|
|Hadoop Cluster||Elastic MapReduce,
20x m1.xlarge, MapR M7, 2h/day
|15 GB RAM, 4 vCPU, 1690 GB storage||N/A||$0||$988||$11856|
Note that the above is just an estimate and does not include some additional charges, e.g. for traffic. Of course, not all companies will require the full suite of tools, so make sure you tailor the final infrastructure to your needs and budgets. Remember about data security! There is no better target for industrial espionage than a complete E-Commerce Data Platform.