Summary
This paper uses tools to convert an original ERA5 GRIB1 format file into GRIB2, NetCDF3 and NetCDF4 format files under different compression algorithms, and performs bz2 compression on them, and finally compares their file sizes. The final conclusion is that, with or without bz2 compression, the smallest file format is the GRIB2 family of file formats.
Data Introduction
The raw data used in this test is the GRIB file downloaded from the ERA5 reanalysis data. The raw data is in GRIB1 format. In order to compare the sizes in different formats, I will use some tools to convert the data to the format.
Original data download address: https://doi.org/10.5281/zenodo.6348679
Tools and Methods
The command line tools used in this article include:
- ecCodes
- wgrib2
- netCDF4
- bzip2
- Among them, ecCodes and netCDF4 can be installed directly using
conda
, while bzip2 and wgrib2 need to use the corresponding installation method according to their own operating systems. This article uses wgrib2 by docker image:$ docker run agilesrc/wgrib2 bash
Size Comparison of GRIB1 and GRIB2
GRIB1 format does not support compression, while GRIB2 format supports compression, so we compare the size of GIRB1 and GRIB2 format files, essentially comparing the size of GRIB1 and GRIB2 files under various compression methods.
Since our original file era5-sample.grib is in GRIB1 format, we first convert it to GRIB2 and execute the command in the terminal:
grib_set -s edition=2 era5-sample.grib era5-sample.grib2
Check out the new GRIB2 file:
$ grib_ls era5-sample.grib2 era5-sample.grib2 edition centre date dataType gridType stepRange typeOfLevel level shortName packingType 2 ecmf 20211028 fc regular_ll 8 heightAboveGround 10 10u grid_simple 2 ecmf 20211028 fc regular_ll 8 heightAboveGround 10 10v grid_simple 2 ecmf 20211028 fc regular_ll 8 heightAboveGround 2 2d grid_simple 2 ecmf 20211028 fc regular_ll 8 heightAboveGround 2 2t grid_simple 2 ecmf 20211028 fc regular_ll 8 surface 0 fal grid_simple 2 ecmf 20211028 fc regular_ll 7-8 surface 0 slhf grid_simple 2 ecmf 20211028 fc regular_ll 7-8 surface 0 ssr grid_simple 2 ecmf 20211028 fc regular_ll 7-8 surface 0 str grid_simple 2 ecmf 20211028 fc regular_ll 8 surface 0 sp grid_simple 2 ecmf 20211028 fc regular_ll 7-8 surface 0 sshf grid_simple 2 ecmf 20211028 fc regular_ll 7-8 surface 0 ssrd grid_simple 2 ecmf 20211028 fc regular_ll 7-8 surface 0 strd grid_simple 2 ecmf 20211028 fc regular_ll 7-8 surface 0 tp grid_simple 13 of 13 messages in era5-sample.grib2 13 of 13 total messages in 1 files
It can be seen that the edition of GRIB has become 2. Next, we need to compress the converted GRIB2 file. In the environment where wgrib2 is pre-installed and supports bash commands, run the following script in the data directory:
#!/bin/sh ctypes=( “ieee” “simple” “complex1” “complex2” “complex3” “jpeg” “aec” “same” ) for ctype in “${ctypes[@]}” do wgrib2 -set_grib_type $ctype era5-sample.grib2 -grib_out era5-sample-$ctype.grib2 done
This results in the following files:
$ ls -lh era5-sample-*.grib2 -rw-r—r— 1 clarmylee staff 36M 3 12 14:02 era5-sample-aec.grib2 -rw-r—r— 1 clarmylee staff 34M 3 12 14:01 era5-sample-complex1.grib2 -rw-r—r— 1 clarmylee staff 27M 3 12 14:01 era5-sample-complex2.grib2 -rw-r—r— 1 clarmylee staff 27M 3 12 14:02 era5-sample-complex3.grib2 -rw-r—r— 1 clarmylee staff 322M 3 12 14:01 era5-sample-ieee.grib2 -rw-r—r— 1 clarmylee staff 36M 3 12 14:02 era5-sample-jpeg.grib2 -rw-r—r— 1 clarmylee staff 65M 3 12 14:02 era5-sample-same.grib2 -rw-r—r— 1 clarmylee staff 65M 3 12 14:01 era5-sample-simple.grib2
It can be seen that the largest compression format is ieee
, the smallest is complex3
, and the original GRIB1 format file is the largest except for ieee
. The GRIB2 file directly converted using grib_set
is slightly smaller than the original file, while the complex3
compression format file is about 1/3 of the original file.
The above GRIB files are storage formats that do not lose the ability to read directly. Let’s test them again, compress them into .bz2
format, and execute $ bzip2 -k *grib*
in the terminal, you can get the following files:
$ ls -lh *.bz2 -rw-r—r— 1 clarmylee staff 25M 3 12 14:02 era5-sample-aec.grib2.bz2 -rw-r—r— 1 clarmylee staff 33M 3 12 14:01 era5-sample-complex1.grib2.bz2 -rw-r—r— 1 clarmylee staff 26M 3 12 14:01 era5-sample-complex2.grib2.bz2 -rw-r—r— 1 clarmylee staff 26M 3 12 14:02 era5-sample-complex3.grib2.bz2 -rw-r—r— 1 clarmylee staff 55M 3 12 14:01 era5-sample-ieee.grib2.bz2 -rw-r—r— 1 clarmylee staff 26M 3 12 14:02 era5-sample-jpeg.grib2.bz2 -rw-r—r— 1 clarmylee staff 31M 3 12 14:02 era5-sample-same.grib2.bz2 -rw-r—r— 1 clarmylee staff 31M 3 12 14:01 era5-sample-simple.grib2.bz2 -rw-r—r— 1 clarmylee staff 52M 3 12 13:58 era5-sample.grib.bz2 -rw-r—r— 1 clarmylee staff 52M 3 12 13:58 era5-sample.grib2.bz2
It can be seen that after bz2 compression, the file with the smallest file size is the file compressed by the aec
method, and the bz2 compression effect is the most obvious in ieee
, while the original smaller file has little effect after being compressed by the bz2 algorithm.
From the above, it can be seen that under the GRIB ecology, simply from the perspective of reducing the file size, without losing the reading ability, using the GRIB2 format of the complex3 compression algorithm for storage is the best solution. In the case of loss of access capability, the aec
compression algorithm can also be considered.
Size Comparison Between NetCDF3 and NetCDF4
Let’s discuss the file size comparison between NetCDF3 and NetCDF4. Similar to GRIB, the old version of NetCDF3 does not support native compression. If you want to compress, you need to use a tool similar to bz2, while the new version of NetCDF4 supports native compression. , so the comparison of the two formats is actually a comparison between NetCDF3 and different compression levels of NetCDF4.
The grib_to_netcdf
command supports converting GRIB to the following four NetCDF storage formats:
- netCDF classic file format
- netCDF 64 bit classic file format (Default)
- netCDF-4 file format
- netCDF-4 classic model file format
We will not repeat the underlying differences of the above four data formats here, but only compare their volumes. We first convert the GRIB files to these four NetCDF formats.
$ grib_to_netcdf -k 1 -o era5-sample-class.nc3 era5-sample.grib $ grib_to_netcdf -k 2 -o era5-sample-64class.nc3 era5-sample.grib $ grib_to_netcdf -k 3 -o era5-sample.nc4 era5-sample.grib $ grib_to_netcdf -k 4 -o era5-sample-class.nc4 era5-sample.grib $ ls -lh *nc* -rw-r—r— 1 clarmylee staff 161M 3 12 15:26 era5-sample-64class.nc3 -rw-r—r— 1 clarmylee staff 161M 3 12 15:26 era5-sample-class.nc3 -rw-r—r— 1 clarmylee staff 161M 3 12 15:27 era5-sample-class.nc4 -rw-r—r— 1 clarmylee staff 161M 3 12 15:26 era5-sample.nc4
It can be seen that there is no significant difference in the size of various NetCDF formats converted directly by the grib_to_netcdf
command. Let’s first use nccopy
to natively compress nc4
, and execute the following script:
#!/bin/sh clevels=( 0 1 2 3 4 5 6 7 8 9 ) for level in “${clevels[@]}” do nccopy -k ‘netCDF-4’ -d $level era5-sample.nc4 era5-sample-c$level.nc4 done
Check result:
$ ls -lh era5-sample-*nc* -rw-r—r— 1 clarmylee staff 161M 3 12 15:26 era5-sample-64class.nc3 -rw-r—r— 1 clarmylee staff 161M 3 12 15:44 era5-sample-c0.nc4 -rw-r—r— 1 clarmylee staff 44M 3 12 15:44 era5-sample-c1.nc4 -rw-r—r— 1 clarmylee staff 44M 3 12 15:44 era5-sample-c2.nc4 -rw-r—r— 1 clarmylee staff 44M 3 12 15:44 era5-sample-c3.nc4 -rw-r—r— 1 clarmylee staff 44M 3 12 15:44 era5-sample-c4.nc4 -rw-r—r— 1 clarmylee staff 44M 3 12 15:44 era5-sample-c5.nc4 -rw-r—r— 1 clarmylee staff 44M 3 12 15:44 era5-sample-c6.nc4 -rw-r—r— 1 clarmylee staff 44M 3 12 15:44 era5-sample-c7.nc4 -rw-r—r— 1 clarmylee staff 44M 3 12 15:44 era5-sample-c8.nc4 -rw-r—r— 1 clarmylee staff 44M 3 12 15:44 era5-sample-c9.nc4 -rw-r—r— 1 clarmylee staff 161M 3 12 15:26 era5-sample-class.nc3 -rw-r—r— 1 clarmylee staff 161M 3 12 15:27 era5-sample-class.nc4
It can be seen that in the uncompressed state, the size of nc4 is 161M, and the results after using 1-9 levels of compression are all 44M, that is to say, in the NetCDF4 format, the results of compression and non-compression are very different. The compression difference between levels is small, and uncompressed NetCDF4 is as bulky as NetCDF3.
Let’s bz2 it again, execute $ bzip2 -k era5-sample-*nc*
Then draw a picture to see:
As can be seen from the above figure, according to the non-bz2 algorithm compression form, the largest is the uncompressed NetCDF4 format, and the smallest is the 9-level compressed NetCDF4 format.
If you look at the file size after bz2 compression, NetCDF3 will be smaller than NetCDF4 after bz2 compression.
Cross-comparison of 4 Format Sizes
The following is a combination of all the above compressed or uncompressed formats and different compression levels, and let’s take a look at their size comparisons.
The above picture is sorted according to the non-bz2 compression size from large to small. It can be seen that according to the original file sorting without loss of readability, the file size of the GRIB2 format under the complex3
compression algorithm is still the smallest.
And the format sorting after using bz2 compression, the smallest size is still the GRIB2 format under the aec
compression algorithm.
Conclusion
From the above comparison, we can conclude that whether or not bz2 compression is used, the file format with the smallest size is the GRIB2 family file format, and in the non-bz2 mode, the smallest size is the GRIB2 file based on the complex3
compression algorithm. The smallest volume is the GRIB2 file based on the aec
compression algorithm, of course, this is only considered from the perspective of volume. To examine read speed, additional experimentation and testing is required.
To cite this article, please use the following citation format:
Wentao Li. (2022). 一份GRIB与NetCDF的体积对比报告 (Version v1). Zenodo. https://doi.org/10.5281/zenodo.6348695
近期评论