Network bandwidth considerations for performance
Network bandwidth and latency, file format, and compression settings impact performance in hybrid cloud environments, where compute resources run in the cloud and data remains on-premises.
- Remote data access is a practical model for bursty workloads; however, performance is heavily impacted by available network bandwidth.
- Columnar file formats (e.g., Parquet, ORC) drastically reduce execution time and data transfer compared to CSV, making them a prerequisite for hybrid setups.
- Bandwidth constraints (e.g., 5 Gbit/s) increase execution time and reduce CPU efficiency for I/O-intensive queries.
- Gzip Compression significantly reduces data transfer volume, improving performance under limited bandwidth. Snappy offers minor gains with lower CPU overhead.
- Not all queries are impacted equally. CPU-bound queries run efficiently even under constrained bandwidth, while I/O-bound queries degrade sharply.
The strategic use of columnar formats and compression enables many workloads to run efficiently in hybrid environments, even with limited network capacity. For CPU-intensive Spark jobs, this setup is a viable architecture for burst-to-cloud use cases. In contrast, I/O-intensive jobs remain highly sensitive to network limits, making this approach less suitable for data-heavy pipelines without further optimization.
