Reinmar/pfw.md Secret

## pfw.md

      
    Raw
  

              pfw.md
            
          
    Paste From Word filter

This is a spec for http://dev.ckeditor.com/ticket/9991.
The process

Content pasted from MS Word should be processed in two steps. First, the pasted content needs to be normalized, what means transforming it from a MS Word pseudo-HTML into a valid (although messy) HTML. This step will be performed by a new PFW filter (the pastefromword plugin allows to have more than one filter). Later, in the second step, the already valid HTML needs to be transformed and filtered according to editor's configuration. This step will be performed by the Advanced Content Filter which features two mechanisms – content transformation and content filtering.

Normalization.


Utilize CKEDITOR.htmlParser to parse the content. May utilize CKEDITOR.htmlParser.filter to process the content.
Should:

Remove things that are totally unnecessary.
Remove MS Word specific styles which can't be interpreted by us.
Generate lists out of MS Word's paragraphs.
Split spans applying multiple styles into more spans applying single styles. That's because our features must be able to match CKEDITOR.style to a single element.
When splitting spans remember about the order. Styles like underline should be inside styles like fg color, so the underline has color of the font.


It's not super clear where normalization ends and filtering starts. Pragmatic approach must be used. If something is much easier to done in a MS Word specific normalization, it should be implemented there. If something can be done as content transformation, it's better to add it as a transformation.


Filtering and transformations.


Read all 3 articles in http://docs.ckeditor.com/#!/guide/dev_advanced_content_filter
Read http://docs.ckeditor.com/#!/api/CKEDITOR.filter-method-addContentForms and http://docs.ckeditor.com/#!/api/CKEDITOR.filter-method-addTransformations and related tests.
Possible problems:

If we get <span style="font: Foo bar 12px whatever"> from MS Word we will either need to split it into multiple spans during the normalization step or as a content transformations (if that's possible). That's because features like font size allows style{font-size}, so the font style will not be allowed.


Transformations must be registered by specific plugins. E.g. currently the basicstyles plugin registers a couple of them. That keeps the code size low.
IIRC, transformations are executed even when the ACF is disabled. This is important, because it means that e.g. a <font> element will be replaced with a <span> even if someone disabled the ACF.
Plugins which may require additional transformations (or content forms):

colorbutton - <font> and <span style=color/bgcolor>,
font - <font> and <span style=font-*>,
justify - from align attr to a style or config.justifyClasses,
indent - from style to class if config.indentClasses are defined.


Notes:

Currently, in the pastefromword/filter/ directory we have the default.js file that implements the currently default filter. I guess that this file needs to be renamed to e.g. legacy.js and the new filter will take its place.
Normalization is the most important part. Transformations are the second thing and when implementing them take plugins' popularity as priority indicator.

Tests


Normalization tests.


The biggest part.
Should test the PFW's new filter behavior. It can be done by firing the editor#paste event with fixture data and checking what's left of it. See already existing tests such as many PFW tests and e.g. tests for ticket #9456. Note that config.pasteFilter should not need to be disabled, because the PFW plugin sets pasteEvt.dontFilter to true. This must be checked as currently we tended to disable the paste filter in PFW tests and I'm not sure why.
It'll be best to keep fixtures in separate HTML files (again, see #9456). However, unlike in #9456 I would rather recommend keeping input HTML (the HTML that we got from MS Word) and normalized HTML in separate files, because the way how they are kept in #9456 was confusing in the past.


Transformation tests.


Filtering does not need additional tests (at least, not many) as it's already very extensively tested.
Transformations are also tested but they may need to be extended (both, in core/filter.js as well as in specific plugins such as colorbutton and font).


Integrational tests.


We know that normalization works, we know that filtering and transformations work, but we also need to make sure that all this work together. Hence, a few tests which fire editor#paste and apply the editor.filter (see TODO) to the result HTML needs to be implemented.

Compatibility with the old filter

The new filter does not have to be super compatible with the old one. If the old one does something wrong or does something that highly complicates it while not being very beneficial, then the new one can work differently.
To compare the behavior of the new filter to the old one, you can configure two editors (note – they need to be on separate pages, as filter is exposed by a global CKEDITOR.cleanWord method). In the sample with the new one disable ACF and in the sample with the old one set the config.pasteFromWordRemove* options to false and also disable the ACF. Both editors should load the full preset. Such configuration should give pretty similar results.
Note: The old configuration options will take no effect on the new filter. Those options were predecessors of the ACF, so the rewrite is exactly about removing them. Therefore, the documentation will need to be updated.
CKE5


Tests should be easily portable to CKE5, so they must be shaped nicely.
It will be nice to write down some discoveries that would not be normally commented in the code.