Network bandwidth considerations for performance

Network bandwidth and latency, file format, and compression settings impact performance in hybrid cloud environments, in which compute resources run in the cloud and data remains on premises.

Consider the following factors when optimizing network performance:

  • Remote data access is a practical model for bursty workloads; however, performance is heavily impacted by available network bandwidth.
  • Columnar file formats (for example, Parquet or ORC) drastically reduce execution time and data transfer compared to CSV, making them a prerequisite for hybrid setups.
  • Bandwidth constraints (for example, 5 Gbit/s) increase execution time and reduce CPU efficiency for I/O-intensive queries.
  • Gzip Compression significantly reduces data transfer volume, improving performance under limited bandwidth. Snappy offers minor gains with lower CPU overhead.
  • Not all queries are impacted equally. CPU-bound queries run efficiently even under constrained bandwidth, while I/O-bound queries degrade sharply.

The strategic use of columnar formats and compression enables many workloads to run efficiently in hybrid environments, even with limited network capacity. For CPU-intensive Spark jobs, this setup is a viable architecture for burst-to-cloud use cases. In contrast, I/O-intensive jobs remain highly sensitive to network limits, making this approach less suitable for data-heavy pipelines without further optimization.