Thursday, October 29, 2009

Amazon adding more components/utilites as services in AWS

Amazon just launched Relational Database Service (RDS) as a new service in its AWS cloud. Werner Vogels's blog "Expanding the Cloud: The Amazon Relational Database Service (RDS)" gives us more detail of the motivations to create it's S3, EC2, SimpleDB, EBS and now RDS. We can find the business

Amazon's tech. style:
In the Amazon services architecture, each service is responsible for its own data management, which means that each service team can pick exactly those solutions that are ideally suited for the particular application they are implementing.It allows them to tailor the data management system such that they get maximum reliability and guaranteed performance at the right cost as the system scales up.

So, for the target of "scalability, reliability, performance, and cost-effectiveness":
- Since the Key-Value storage solutions are widely used, they led to the creation of S3.
- Since the simple structured data management systems without complex transactions and relations, without rigid schema is widely used, they led to the creation of SimpleDB.
- But, Amazon found that many applications running in EC2 instances want to use RDBMS, then EBS is launched to provide scalable and reliable storage volume that can be used for persisting the databases.
- To free up users to focus on their applications and business, now RDS is ready. The users need not to maintain their DB and consider who to scale the DB.
- Another case is AWS MapReduce services.

RDS is MySQL in AWS cloud.

Now, AWS users have three methods to use DB:
(1) RDS
(2) EC2 AMI+EBS
(3) SimpleDB (no relation model)

We can re-read the good paper "Amazon Cloud Architecture", to understand the strategy of AWS better.

References:
[1] Amazon Relational Database Service (RDS): http://aws.amazon.com/rds/
[2] Werner Vogels's blog: http://www.allthingsdistributed.com/2009/10/amazon_relational_database_service.html
[3] Amazon Cloud Architecture: http://jineshvaria.s3.amazonaws.com/public/cloudarchitectures-varia.pdf

Friday, October 2, 2009

The Integration of Analytic DBMS and Hadoop

Recently, tow famous vendors of analytic DBMS, Vertica and Aster-Data announced their integration with Hadoop. The analytic DBMS and Hadoop, each address distinct but complementary problems for managing large data.

Vertica:

Currently it is a light integration.
  • ETL, ELT, data cleansing, data mining, etc.
  • Moving data between Hadoop and Vertica.
  • InputFormat (InputSplit , VerticaRecord, push down relational map operations by parameterizing the database query).
  • OutputFormat (to existing or create a new table).
  • Easy for Hadoop developers to push down Map operations to Vertica databases in parallel by specifying parameterized queries which result in pre-aggregated data for each mapper.
  • Support Hadoop streaming interface.

Typical usages:

(1) Raw Data->Hadoop(ETL)->Vertical (for fast ad-hoc query, near realtime)
(2) Vertical -> Hadoop(ETL) ->Vertical (for fast ad-hoc query, near realtime)
(3) Vertical -> Hadoop (sophisticated query for analysis or mining)

We can expect to see tighter integration and higher performance.

References
[1] The Scoop on Hadoop and Vertica: http://databasecolumn.vertica.com/2009/09/the_scoop_on_hadoop_and_vertic.html
[2] Using Vertica as a Structured Data Repository for Apache Hadoop: http://www.vertica.com/MapReduce
[3] Cloudera DBInputFormat interface: http://www.cloudera.com/blog/2009/03/06/database-access-with-hadoop/
[4] Managing Big Data with Hadoop and Vertica: http://www.vertica.com/resourcelogin?type=pdf&item=ManagingBigDatawithHadoopandVertica.pdf

AsterData:

AsterData already provide in-database MapReduce.

The new Aster-Hadoop Data Connector, which utilizes Aster’s patent-pending SQL-MapReduce capabilities for two-way, high-speed, data transfer between Apache Hadoop and Aster Data’s massively parallel data warehouse.
  • ETL processing or data mining, and then pull that data into Aster for interactive queries or ad-hoc analytics on massive data scales.
  • The Connector utilizes key new SQL-MapReduce functions to provide ultra-fast, two-way data loading between HDFS (Hadoop Distributed File System) and Aster Data’s MPP Database.
  • Parallel loader.
  • LoadFromHadoop: Parallel data loading from HDFS to Aster nCluster.
  • LoadToHadoop: Parallel data loading from Aster nCluster to HDFS.

Key advantages of Aster’s Hadoop Connector include:
  • High-performance: Fast, parallel data transfer between Hadoop and Aster nCluster.
  • Ease-of-use: Analysts can now seamlessly invoke a SQL command for ultra-simple import of Hadoop-MapReduce jobs, for deeper data analysis. Aster intelligently and automatically parallelizes the load.
  • Data Consistency: Aster Data's data integrity and transactional consistency capabilities treat the data load as a 'transaction', ensuring that the data load or export is always consistent and can be carried out while other queries are running in parallel in Aster.
  • Extensibility: Customers can easily further extend the Connector using SQL-MapReduce, to provide further customization for their specific environment.

The typical usages are similar to Vertica.

References
[1] Aster Data Announces Seamless Connectivity With Hadoop: http://www.nearshorejournal.com/2009/10/aster-data-announces-seamless-connectivity-with-hadoop/
http://www.asterdata.com/news/091001-Aster-Hadoop-connector.php
[2] DBMS2 - MapReduce tidbits http://www.dbms2.com/2009/10/01/mapreduce-tidbits/#more-983
[3] AstaData Blog: Aster Data Seamlessly Connects to Hadoop, http://www.asterdata.com/blog/index.php/2009/10/05/aster-data-seamlessly-connects-to-hadoop/

Another Integration of Analytic DBMS and Hadoop case is HadoopDB project. http://db.cs.yale.edu/hadoopdb/hadoopdb.html