This is a spec for http://dev.ckeditor.com/ticket/9991.
Content pasted from MS Word should be processed in two steps. First, the pasted content needs to be normalized, what means transforming it from a MS Word pseudo-HTML into a valid (although messy) HTML. This step will be performed by a new PFW filter (the pastefromword
plugin allows to have more than one filter). Later, in the second step, the already valid HTML needs to be transformed and filtered according to editor's configuration. This step will be performed by the Advanced Content Filter which features two mechanisms – content transformation and content filtering.
- Normalization.
- Utilize
CKEDITOR.htmlParser
to parse the content. May utilizeCKEDITOR.htmlParser.filter
to process the content. - Should:
- Remove things that are totally unnecessary.
- Remove MS Word specific styles which can't be interpreted by us.
- Generate lists out of MS Word's paragraphs.
- Split spans applying multiple styles into more spans applying single styles. That's because our features must be able to match
CKEDITOR.style
to a single element. - When splitting spans remember about the order. Styles like underline should be inside styles like fg color, so the underline has color of the font.
- It's not super clear where normalization ends and filtering starts. Pragmatic approach must be used. If something is much easier to done in a MS Word specific normalization, it should be implemented there. If something can be done as content transformation, it's better to add it as a transformation.
- Filtering and transformations.
- Read all 3 articles in http://docs.ckeditor.com/#!/guide/dev_advanced_content_filter
- Read http://docs.ckeditor.com/#!/api/CKEDITOR.filter-method-addContentForms and http://docs.ckeditor.com/#!/api/CKEDITOR.filter-method-addTransformations and related tests.
- Possible problems:
- If we get
<span style="font: Foo bar 12px whatever">
from MS Word we will either need to split it into multiple spans during the normalization step or as a content transformations (if that's possible). That's because features like font size allowsstyle{font-size}
, so thefont
style will not be allowed.
- If we get
- Transformations must be registered by specific plugins. E.g. currently the
basicstyles
plugin registers a couple of them. That keeps the code size low. - IIRC, transformations are executed even when the ACF is disabled. This is important, because it means that e.g. a
<font>
element will be replaced with a<span>
even if someone disabled the ACF. - Plugins which may require additional transformations (or content forms):
- colorbutton -
<font>
and<span style=color/bgcolor>
, - font -
<font>
and<span style=font-*>
, - justify - from
align
attr to a style orconfig.justifyClasses
, - indent - from style to class if
config.indentClasses
are defined.
- colorbutton -
Notes:
- Currently, in the
pastefromword/filter/
directory we have thedefault.js
file that implements the currently default filter. I guess that this file needs to be renamed to e.g.legacy.js
and the new filter will take its place. - Normalization is the most important part. Transformations are the second thing and when implementing them take plugins' popularity as priority indicator.
- Normalization tests.
- The biggest part.
- Should test the PFW's new filter behavior. It can be done by firing the
editor#paste
event with fixture data and checking what's left of it. See already existing tests such as many PFW tests and e.g. tests for ticket #9456. Note thatconfig.pasteFilter
should not need to be disabled, because the PFW plugin setspasteEvt.dontFilter
totrue
. This must be checked as currently we tended to disable the paste filter in PFW tests and I'm not sure why. - It'll be best to keep fixtures in separate HTML files (again, see #9456). However, unlike in #9456 I would rather recommend keeping input HTML (the HTML that we got from MS Word) and normalized HTML in separate files, because the way how they are kept in #9456 was confusing in the past.
- Transformation tests.
- Filtering does not need additional tests (at least, not many) as it's already very extensively tested.
- Transformations are also tested but they may need to be extended (both, in
core/filter.js
as well as in specific plugins such ascolorbutton
andfont
).
- Integrational tests.
- We know that normalization works, we know that filtering and transformations work, but we also need to make sure that all this work together. Hence, a few tests which fire
editor#paste
and apply theeditor.filter
(see TODO) to the result HTML needs to be implemented.
The new filter does not have to be super compatible with the old one. If the old one does something wrong or does something that highly complicates it while not being very beneficial, then the new one can work differently.
To compare the behavior of the new filter to the old one, you can configure two editors (note – they need to be on separate pages, as filter is exposed by a global CKEDITOR.cleanWord
method). In the sample with the new one disable ACF and in the sample with the old one set the config.pasteFromWordRemove*
options to false
and also disable the ACF. Both editors should load the full preset. Such configuration should give pretty similar results.
Note: The old configuration options will take no effect on the new filter. Those options were predecessors of the ACF, so the rewrite is exactly about removing them. Therefore, the documentation will need to be updated.
- Tests should be easily portable to CKE5, so they must be shaped nicely.
- It will be nice to write down some discoveries that would not be normally commented in the code.