Skip to content

Instantly share code, notes, and snippets.

@joanpau
Last active August 29, 2015 14:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save joanpau/aad88a6c1a9095cc4ba8 to your computer and use it in GitHub Desktop.
Save joanpau/aad88a6c1a9095cc4ba8 to your computer and use it in GitHub Desktop.
NetCDF text attribute encoding test.
%TESTNCTEXT NetCDF text attribute test.
%
% Input and output of text attributes to NetCDF files is not consistent when
% they contain non-ASCII characters. Saving the attribute to a file and loading
% it again doest not recover the original value.
%
% The cause of the problem seems to be the different data types used by MATLAB
% and NetCDF to represent character data, and that it is not documented how
% the conversion is done:
%
% - CHAR class in MATLAB is 2 bytes, and characters are encoded in what seems
% to be UCS-2 (2-byte Universal Character Set), equivalent to UTF-16
% without surrogate pairs.
%
% - NetCDF data type NC_CHAR is 1 byte, and the format does not specify any
% encoding.
%
% The conversion for writing text attributes seems follow these rules:
%
% - The text attribute in the NetCDF file and the corresponding MATLAB value
% have exactly the same length: a 13-element CHAR array is written as a
% 13-element NC_CHAR attribute value.
%
% - Each NC_CHAR value is set to the least significant byte of the respective
% CHAR value. Thus only CHAR codes in the range from 0 to 255 are stored
% unaltered in the NetCDF file.
%
% To read text attributes the conversion seems to be done as follows:
%
% - The sequence of NC_CHAR elements is decoded according to the current
% character set.
%
% - The null character, if present, terminates the string no matter the value
% of the following NC_CHAR elements.
%
% If the above assumptions are true, by using UCS-2 for internal representation
% of character data but storing only the least significant byte when writing to
% NetCDF, MATLAB encodes the text attributes according to iso8859-1 (latin1).
% Thus if a different character set is in use (default is UTF-8) the encoding
% and decoding procedures are not consistent.
%
% For example, if the character set is UTF-8, only the ASCII characters are
% preserved. They are the NC_CHAR values with codes in the range from 0 to 127.
% Codes from 128 to 255 are replaced by the 'replacement character' (U+FFFD,
% 0xfffd in UTF-16, decimal value 65533), because they are not valid in UTF-8.
%
% Hence, there are two ways to achieve write-read consistency:
%
% - Keep the current encoding approach, and always decode text attributes
% assuming they are encoded in iso8859-1 (latin1) and clearly state the
% text encoding and its limitations in the documentation. This requires
% a trivial modification in NETCDF.GETATT. Text attributes should be
% read as bytes in the NETCDFLIB call and then:
% attrvalue = native2unicode(attrbytes, 'latin1')
%
% - Keep the current decoding approach, and explicitly encode text attributes
% according to the current character set. This requires to modify the
% NETCDFLIB mex interface, whose code is not available. A hacky alternative
% is to perform the encoding in NETCDF.PUTATT, before the NETCDFLIB call:
% attrvalue = char(unicode2native(attrvalue))
%
% The above solutions would only provide MATLAB session read-write consistency.
% To achieve complete compatibility, the user should be able to set the
% encoding of the text attributes when reading and when writing, either as an
% option in the function calls or as a preference, with a sensible default
% value (e.g the default character set). This would probably require
% modifications to the mex interface NETCDFLIB and/or to the functions
% NETCDF.GETATT and NETCDF.PUTATT.
%
% All this should be noted in the documentation.
%
% See also:
% NATIVE2UNICODE
% UNICODE2NATIVE
%
% Author: Joan Pau Beltran
% Email: joanpau.beltran@socib.cat
%% Create test file.
% Create a file with several text attributes,
% some of temp with non-ASCII characters.
nc_globalid = netcdf.getConstant('NC_GLOBAL');
vocals = 'aàáeèéiíoòóuú';
codes = char(1:255);
complete = ['stop' char(0) 'here'];
ncid_out = netcdf.create('vocals.nc', 'NC_CLOBBER');
netcdf.putAtt(ncid_out, nc_globalid, 'vocals', vocals);
netcdf.putAtt(ncid_out, nc_globalid, 'codes', codes);
netcdf.putAtt(ncid_out, nc_globalid, 'complete', complete);
netcdf.close(ncid_out);
%% Load test file.
% Load the file again and try to read the same attribute.
% KOMPLETE is truncated at the null characater. This is acceptable.
% Other attributes should return the same contents,
% but they do not if character set is UTF-8.
ncid_in = netcdf.open('vocals.nc', 'NC_NOWRITE');
vokals = netcdf.getAtt(ncid_in, nc_globalid, 'vocals');
kodes = netcdf.getAtt(ncid_in, nc_globalid, 'codes');
komplete = netcdf.getAtt(ncid_in, nc_globalid, 'complete');
netcdf.close(ncid_in);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment