"Open Source (OSS) frameworks have improved the quality of Big Data processing with its diverse set of tools addressing numerous use cases

In fact, if you are a part of a team working on building a modern data architecture, chances are high you are using an open-source stack.

Similarly, Cloud Computing has been enabling Big Data Solutions in yielding scalable and cost-effective solutions in analytics space.

Open Source and Cloud : The Correlation

In the cloud ecosystem, many of the commercially available cloud services are either

Similar to an OSS ➡ Similar in Features (Eg: AWS Step Functions and Apache Airflow )

Modeled after an OSS ➡ Follows/ Inherits the design principles of an existing Open Source framework. (Eg: AWS Kinesis and Apache Kafka)
Managed service of an OSS ➡ Takes care of deployment & maintenance of the OSS framework and making it ready to use. (Eg: AWS RDS Postgres and PostgresDB)

To understand more, Let's touch upon the basics...

Getting to know the cloud

The first step that many of us go through while getting to know about cloud services is to start wondering where to start from the plethora of services available out there.

So, For the ease of understanding, Irrespective of the cloud provider (AWS, Azure, GCP, etc). let's group the big data related cloud services into these stages.

Now, Let's try to understand the cloud ecosystem by comparing AWS cloud services with its equivalent open source frameworks. (Similar comparison can be drawn with Azure and GCP as well)

📍 Data Ingestion:

AWS ServiceWhat it doesRelation with OSSOSS Alternative
KinesisStream ProcessingModelled AfterApache Kafka
SQSMessage QueueSimilar toRabbitMQ
Managed Streaming for Kafka (MSK)Stream ProcessingManaged Service ofApache Kafka

📍 Data Storage:

AWS ServiceWhat it doesRelation with OSSOSS Alternative
S3Object storeSimilar toMinio, Swift, Ceph, ...
RDSRelational databaseManaged Service ofMariaDB, MySQL, Postgres
DynamoDBNoSQL databaseSimilar toApache Cassandra
ElastiCacheIn-memory cacheManaged Service ofMemcached, Redis
NeptuneGraph databaseSimilar toNeo4j
Amazon QLDBLedger databaseModelled AfterHyperledger
Amazon DocumentDBDocument databaseSimilar toMongoDB
AWS Lake FormationData lakeSimilar toHDFS
EC2 EBSBlock storage for EC2Similar toOpenEBS, Portworx

📍 Data Processing:

AWS ServiceWhat it doesRelation with OSSOSS Alternative
Elastic Map ReduceHadoopManaged Service ofHadoop,
Step FunctionsWorflow OrchestratorSimilar toApache Airflow , Flyte
AWS GlueETLManaged Service ofApache Spark
LambdaServerlessSimilar toKnative, OpenFaaS, Fn
BatchBatch Job ComputingSimilar toApache Airflow on Kubernetes

📍 Data Analysis & Visualization:

AWS ServiceWhat it doesRelation with OSSOSS Alternative
Amazon RedshiftData warehousingSimilar toSpark SQL, Apache Hive, Presto
AthenaData warehousingSimilar toSpark SQL, Apache Hive, Presto
CloudSearchSearchSimilar toElasticsearch
Elasticsearch ServiceSearchManaged Service ofElasticsearch
QuickSightBusiness analyticsSimilar toPowerBI

📍 Deployment:

AWS ServiceWhat it doesRelation with OSSOSS Alternative
Elastic Container Registry (ECR)Container registryManaged Service ofDocker Registry, Quay
Elastic Container Service (ECS)Container orchestrationManaged Service ofKubernetes, Marathon
Elastic Kubernetes Services (EKS)Container orchestrationManaged Service ofKubernetes
Cloud FormationInfrastructure as a codeSimilar toTerraform

Some of the notable cloud adoptions with respect to Big Data.

- Till now, AWS users have launched more than 15 million Hadoop clusters. (EMR / Containerized versions)
- "container-as-a-service" (EKS, ECS) and "Database-as-a-service" (RDS, DynamoDB) are the most commonly used managed services in 2020.
- Database services usage up 127% year over year.

Next Steps...

  1. You can understand how these services are put to use in real-world use cases in this article
  2. This Whitepaper from AWS on Big Data will be a good place to understand its Services.
  3. And start getting hands-on following this repo

Going forward, I'll publish detailed posts on tools and frameworks used by Data Engineers day in and day out.

Follow for updates.