joanpau/testnctext.m

## testnctext.m
%TESTNCTEXT  NetCDF text attribute test.
%
%  Input and output of text attributes to NetCDF files is not consistent when
%  they contain non-ASCII characters. Saving the attribute to a file and loading
%  it again doest not recover the original value.
%
%  The cause of the problem seems to be the different data types used by MATLAB
%  and NetCDF to represent character data, and that it is not documented how
%  the conversion is done:
%
%    - CHAR class in MATLAB is 2 bytes, and characters are encoded in what seems
%      to be UCS-2 (2-byte Universal Character Set), equivalent to UTF-16
%      without surrogate pairs.
%
%    - NetCDF data type NC_CHAR is 1 byte, and the format does not specify any
%      encoding.
%
%  The conversion for writing text attributes seems follow these rules:
%
%    - The text attribute in the NetCDF file and the corresponding MATLAB value
%      have exactly the same length: a 13-element CHAR array is written as a
%      13-element NC_CHAR attribute value.
%
%    - Each NC_CHAR value is set to the least significant byte of the respective
%      CHAR value. Thus only CHAR codes in the range from 0 to 255 are stored
%      unaltered in the NetCDF file.
%
%  To read text attributes the conversion seems to be done as follows:
%
%    - The sequence of NC_CHAR elements is decoded according to the current
%      character set.
%
%    - The null character, if present, terminates the string no matter the value
%      of the following NC_CHAR elements.
%
%  If the above assumptions are true, by using UCS-2 for internal representation
%  of character data but storing only the least significant byte when writing to
%  NetCDF, MATLAB encodes the text attributes according to iso8859-1 (latin1).
%  Thus if a different character set is in use (default is UTF-8) the encoding
%  and decoding procedures are not consistent.
%
%  For example, if the character set is UTF-8, only the ASCII characters are
%  preserved. They are the NC_CHAR values with codes in the range from 0 to 127.
%  Codes from 128 to 255 are replaced by the 'replacement character' (U+FFFD,
%  0xfffd in UTF-16, decimal value 65533), because they are not valid in UTF-8.
%
%  Hence, there are two ways to achieve write-read consistency:
%
%    - Keep the current encoding approach, and always decode text attributes
%      assuming they are encoded in iso8859-1 (latin1) and clearly state the
%      text encoding and its limitations in the documentation. This requires
%      a trivial modification in NETCDF.GETATT. Text attributes should be
%      read as bytes in the NETCDFLIB call and then:
%        attrvalue = native2unicode(attrbytes, 'latin1')
%
%    - Keep the current decoding approach, and explicitly encode text attributes
%      according to the current character set. This requires to modify the
%      NETCDFLIB mex interface, whose code is not available. A hacky alternative
%      is to perform the encoding in NETCDF.PUTATT, before the NETCDFLIB call:
%        attrvalue = char(unicode2native(attrvalue))
%
%  The above solutions would only provide MATLAB session read-write consistency.
%  To achieve complete compatibility, the user should be able to set the
%  encoding of the text attributes when reading and when writing, either as an
%  option in the function calls or as a preference, with a sensible default
%  value (e.g the default character set). This would probably require
%  modifications to the mex interface NETCDFLIB and/or to the functions
%  NETCDF.GETATT and NETCDF.PUTATT.
%
%  All this should be noted in the documentation.
%
%  See also:
%    NATIVE2UNICODE
%    UNICODE2NATIVE
%
%  Author: Joan Pau Beltran
%  Email: joanpau.beltran@socib.cat


%% Create test file.
% Create a file with several text attributes,
% some of temp with non-ASCII characters.
nc_globalid = netcdf.getConstant('NC_GLOBAL');
vocals = 'aàáeèéiíoòóuú';
codes = char(1:255);
complete = ['stop' char(0) 'here'];
ncid_out = netcdf.create('vocals.nc', 'NC_CLOBBER');
netcdf.putAtt(ncid_out, nc_globalid, 'vocals', vocals);
netcdf.putAtt(ncid_out, nc_globalid, 'codes', codes);
netcdf.putAtt(ncid_out, nc_globalid, 'complete', complete);
netcdf.close(ncid_out);


%% Load test file.
% Load the file again and try to read the same attribute.
% KOMPLETE is truncated at the null characater. This is acceptable.
% Other attributes should return the same contents,
% but they do not if character set is UTF-8.
ncid_in = netcdf.open('vocals.nc', 'NC_NOWRITE');
vokals = netcdf.getAtt(ncid_in, nc_globalid, 'vocals');
kodes = netcdf.getAtt(ncid_in, nc_globalid, 'codes');
komplete = netcdf.getAtt(ncid_in, nc_globalid, 'complete');
netcdf.close(ncid_in);
	%TESTNCTEXT NetCDF text attribute test.
	%
	% Input and output of text attributes to NetCDF files is not consistent when
	% they contain non-ASCII characters. Saving the attribute to a file and loading
	% it again doest not recover the original value.
	%
	% The cause of the problem seems to be the different data types used by MATLAB
	% and NetCDF to represent character data, and that it is not documented how
	% the conversion is done:
	%
	% - CHAR class in MATLAB is 2 bytes, and characters are encoded in what seems
	% to be UCS-2 (2-byte Universal Character Set), equivalent to UTF-16
	% without surrogate pairs.
	%
	% - NetCDF data type NC_CHAR is 1 byte, and the format does not specify any
	% encoding.
	%
	% The conversion for writing text attributes seems follow these rules:
	%
	% - The text attribute in the NetCDF file and the corresponding MATLAB value
	% have exactly the same length: a 13-element CHAR array is written as a
	% 13-element NC_CHAR attribute value.
	%
	% - Each NC_CHAR value is set to the least significant byte of the respective
	% CHAR value. Thus only CHAR codes in the range from 0 to 255 are stored
	% unaltered in the NetCDF file.
	%
	% To read text attributes the conversion seems to be done as follows:
	%
	% - The sequence of NC_CHAR elements is decoded according to the current
	% character set.
	%
	% - The null character, if present, terminates the string no matter the value
	% of the following NC_CHAR elements.
	%
	% If the above assumptions are true, by using UCS-2 for internal representation
	% of character data but storing only the least significant byte when writing to
	% NetCDF, MATLAB encodes the text attributes according to iso8859-1 (latin1).
	% Thus if a different character set is in use (default is UTF-8) the encoding
	% and decoding procedures are not consistent.
	%
	% For example, if the character set is UTF-8, only the ASCII characters are
	% preserved. They are the NC_CHAR values with codes in the range from 0 to 127.
	% Codes from 128 to 255 are replaced by the 'replacement character' (U+FFFD,
	% 0xfffd in UTF-16, decimal value 65533), because they are not valid in UTF-8.
	%
	% Hence, there are two ways to achieve write-read consistency:
	%
	% - Keep the current encoding approach, and always decode text attributes
	% assuming they are encoded in iso8859-1 (latin1) and clearly state the
	% text encoding and its limitations in the documentation. This requires
	% a trivial modification in NETCDF.GETATT. Text attributes should be
	% read as bytes in the NETCDFLIB call and then:
	% attrvalue = native2unicode(attrbytes, 'latin1')
	%
	% - Keep the current decoding approach, and explicitly encode text attributes
	% according to the current character set. This requires to modify the
	% NETCDFLIB mex interface, whose code is not available. A hacky alternative
	% is to perform the encoding in NETCDF.PUTATT, before the NETCDFLIB call:
	% attrvalue = char(unicode2native(attrvalue))
	%
	% The above solutions would only provide MATLAB session read-write consistency.
	% To achieve complete compatibility, the user should be able to set the
	% encoding of the text attributes when reading and when writing, either as an
	% option in the function calls or as a preference, with a sensible default
	% value (e.g the default character set). This would probably require
	% modifications to the mex interface NETCDFLIB and/or to the functions
	% NETCDF.GETATT and NETCDF.PUTATT.
	%
	% All this should be noted in the documentation.
	%
	% See also:
	% NATIVE2UNICODE
	% UNICODE2NATIVE
	%
	% Author: Joan Pau Beltran
	% Email: joanpau.beltran@socib.cat


	%% Create test file.
	% Create a file with several text attributes,
	% some of temp with non-ASCII characters.
	nc_globalid = netcdf.getConstant('NC_GLOBAL');
	vocals = 'aàáeèéiíoòóuú';
	codes = char(1:255);
	complete = ['stop' char(0) 'here'];
	ncid_out = netcdf.create('vocals.nc', 'NC_CLOBBER');
	netcdf.putAtt(ncid_out, nc_globalid, 'vocals', vocals);
	netcdf.putAtt(ncid_out, nc_globalid, 'codes', codes);
	netcdf.putAtt(ncid_out, nc_globalid, 'complete', complete);
	netcdf.close(ncid_out);


	%% Load test file.
	% Load the file again and try to read the same attribute.
	% KOMPLETE is truncated at the null characater. This is acceptable.
	% Other attributes should return the same contents,
	% but they do not if character set is UTF-8.
	ncid_in = netcdf.open('vocals.nc', 'NC_NOWRITE');
	vokals = netcdf.getAtt(ncid_in, nc_globalid, 'vocals');
	kodes = netcdf.getAtt(ncid_in, nc_globalid, 'codes');
	komplete = netcdf.getAtt(ncid_in, nc_globalid, 'complete');
	netcdf.close(ncid_in);