Sharding is a technique used to horizontally partition a data-store into smaller, more manageable fragments called shards, which are distributed across multiple servers or nodes.
How is sharding different from Spark partitioning its RDD's across different nodes, new learner, just curious. Does the concept of sharding come into play, if at all when we handle data in spark?
Sharding stores your data across multiple nodes in a cluster and the distribution of which data goes where is determined by the shard key.
Spark during processing time, uses map reduce which splits the data across multiple nodes allowing parallel processing of data and then reducing the results of the parallel processes to give you the final result. This is a compute time partitioning only(temporary) vs sharding is permanent, that's how the data is stored.
Spark generally uses data from HDFS or S3, which internally does act like a sharded data store.
Hope that explains it. Please let me know if you have any further questions.
How is sharding different from Spark partitioning its RDD's across different nodes, new learner, just curious. Does the concept of sharding come into play, if at all when we handle data in spark?
It's two different concepts @sajid007.
Sharding stores your data across multiple nodes in a cluster and the distribution of which data goes where is determined by the shard key.
Spark during processing time, uses map reduce which splits the data across multiple nodes allowing parallel processing of data and then reducing the results of the parallel processes to give you the final result. This is a compute time partitioning only(temporary) vs sharding is permanent, that's how the data is stored.
Spark generally uses data from HDFS or S3, which internally does act like a sharded data store.
Hope that explains it. Please let me know if you have any further questions.
Sharding is for optimized data storage and retrieval whereas Spark Partitioning is for parallel processing.