• Do not register here on develop.twiki.org, login with your twiki.org account.
• Use View topic Item7848 for generic doc work for TWiki-6.1.1. Use View topic Item7851 for doc work on extensions that are not part of a release. More... Close
• Anything you create or change in standard webs (Main, TWiki, Sandbox etc) will be automatically reverted on every SVN update.
Does this site look broken?. Use the LitterTray web for test cases.

Item5138: Wysiwyg pickaxe destroys non-US-ASCII content

Item Form Data

AppliesTo: Component: Priority: CurrentState: WaitingFor: TargetRelease ReleasedIn
Extension TinyMCEPlugin Urgent Closed   minor  

Edit Form Data

Summary:
Reported By:
Codebase:
Applies To:
Component:
Priority:
Current State:
Waiting For:
Target Release:
Released In:
 

Detail

Testcase:

  • Edit a topic in WYSIWYG mode
  • Use the omega-sign (Insert special character) to insert i.e. a heart
  • Click the pickaxe - the heart is translated to
  • Click WYSIWYG - the heart is tranlated to \x{2665}
  • Click the pickaxe - the heart is translated to \w{2665}

Testcase can be expanded by flipping between wysiwyg and raw edit a few times. Ordinary danish chars suffer from this as well.

-- TWiki:Main/SteffenPoulsen - 17 Dec 2007

I do not see this on mine so it must depend on how TWiki is setup.

-- TWiki:Main.KennethLavrsen - 17 Dec 2007

Do you have {Site}{CharSet} to utf8? I can reproduce Steffen's observation with a charset of iso-8859-15, but with utf8 the pickaxe works fine. This is probably is just another incarnation of Item4840.

I've created LitterTray.Item5138Demo but unfortunately WYSIWYG gets stuck at "Please wait ... retrieving page from server".

-- TWiki:Main.HaraldJoerg - 17 Dec 2007

I get the same, with utf8 things work ok.

Looks like UTF82SiteCharSet is not being called / is not effective?

-- TWiki:Main.SteffenPoulsen - 17 Dec 2007

I have uploaded a Wireshark trace of two situations.

A topic with the text "Very short topic <><>"

And one where the instead is the greek letter delta.

In both cases it is the trace of clicking the pickaxe and nothing else

-- TWiki:Main.KennethLavrsen - 18 Dec 2007

On the #twiki_release channel CDot said:

  • the pickaxe works as follows:
  1. The DOM representing the topic content is converted to HTML
  2. The HTML is sent to the server
  3. The server converts the HTML to TML
  4. The server sends the TML back to the browser
  5. The browser injects the TML (text) into the textarea

I compared the data sent from the browser to the server

  1. on a normal TMCE save and
  2. on the pickaxe

I inserted a greek delta symbol into the topic text. In the first case the delta-sym was sent to the save script url encoded as %26%23948%3B (ampersand, hash, 948, semicolon) and (ampersand, hash, 948, semicolon) was saved in the topic file. On the second case the delta-sym was sent to to the html2tml rest script as %25u03B4; which gets decoded as %u03B4 and the finally displayed in the textarea of rawedit.

The transformation fails on step 1 (of 5). TMCE url encodes the already url encoded data again (the percent sign from the first encoding is represented as %25). Then TMCE takes a different representation of the wide char (delta-sym): u03B4 instead of 948.

As far as I can see we have to fix this bug in TMCE.

-- TWiki:Main.OliverKrueger - 18 Dec 2007

I noted that when Michael reverted the patch in Item4946 the pickaxe problem changed bahaviour. Not fixed but better. Instead of destroying all the 8 bit characters when adding a 16 bit, the 8 bit stays OK but the 16 bit and any HTML entity that is 16 bit type becomes unicoded (%uXXXX) and is then not visible as the character it is supposed to be. And this problem only exists when the conversion is done through the Pickaxe. When doing normal save from TMCE these advanced characters are translated into a numerial html encoded character and work fine.

-- TWiki:Main.KennethLavrsen - 19 Dec 2007

I discovered that the way I had coded the pickaxe skipped an HTML cleanup step in TMCE that was converting unicode to entities. I enabled that step, and the behaviour is better again, but still not right.

-- TWiki:Main.CrawfordCurrie - 19 Dec 2007

Much better now, yes. We do have people using special versions of " when cut'n'pasting from word etc, so entities are still important - but it is a lot less critical bug now.

-- TWiki:Main.SteffenPoulsen - 19 Dec 2007

Thanks to assistance from some testers, this is finally fixed I think.

CC

There is a new behaviour now - which is HTML-wise correct it seems, but means that danish characters is now translated into HTML-entities:

&aelig;&oslash;&aring; 

This is also on ordinary saves (not using the pickaxe).

-- SteffenPoulsen - 19 Dec 2007

I just came here to report the same

This is unacceptable. Letters should be seen as the letters that are entered. 8-bit characters must not be translated into html entities. It makes it impossible to later edit in Raw mode and searches break. The searching in non english will in practical be totally broken because of this bug.

I implemented this change on our production server and I had to revert back to the code from this morning. A simple search for Sren did not work any longer. We have plenty of these searches for peoples names.

There is no way we can release with conversions like these. And this time it is even when you save normally. Not just the pickaxe case.

-- TWiki:Main.KennethLavrsen - 19 Dec 2007

The current behaviour is the default TMCE behaviour on save, when cleanup is enabled. TNCE cleanup is not sophisticated enough to distinguish "nice" and "nasty" characters. So I guess the only option is to post-process the save \o/

-- TWiki:Main.CrawfordCurrie - 19 Dec 2007

Is this the same thing as the terrible translation of left and right quotes (single and double)? This is on rc1, I paste in something unfortunately containing those quotes, and it ends up as:

Michael Brecker, %u201CPilgrimage%u201D (Head%u2019s Up)
Now, I don't know offhand what "%U201C" is, but it's not valid HTML of any sort, and Firefox on Linux, in any case, doesn't render it as anything but that noisy string.

-- TWiki:Main.WhitBlauvelt - 20 Dec 2007

Yes, it's probably the same thing.

I added a post-conversion of entities with 8-bit character codes back to characters. Seems to work (for me, using 8859-1, anyway).

CC

Looks better now.

I noticed one problem and it may end up being not to bad to resolve.

An often used set of html entities are greater than and less than "<" and ">" which is interpreted as tags and makes text invisible sometimes where you do not want it.

Does TMCE convert these to entities so we have to convert them back or can they be excepted?

I do not know how many other characters that could have a problem.

I can confirm that Danish letters are converted fine now.

One thing I never understood.

Before the pickaxe thing was introduced we could save things fine without all this conversion. It was the pickaxe cycle that created the initial problem. How come the pickaxe function "save" was so different than normal save?

-- TWiki:Main.KennethLavrsen - 20 Dec 2007

Another problem. You cannot write any code inside verbatim anymore without getting html entities converted.

I think the initial and very bad problem started when you enabled that Tiny html encodes in the first place when it saves. There is too much endoding and decoding.

Maybe going back to the very original behaviour and give up the pickaxe feature is in the end of the day the best.

-- TWiki:Main.KennethLavrsen - 20 Dec 2007

On our now production installation I have reverted the TMCE back to the version in SVN 16042 before these conversion fixes.

It may have a UTF problem but I can edit and save latin characters and I can open and close old topics without seeing verbatim text getting molested by strange conversions.

I recommend going back to before 16058 and fix the original problem in a totally different way. This conversion of conversion of conversion seems to me bad by concept and the current performance is not acceptable.

I can live without the pickaxe feature if needed. I'd rather have the normal edit/save work properly than having the pickaxe thing and this conversion problem.

-- TWiki:Main.KennethLavrsen - 21 Dec 2007

Before the pickaxe thing was introduced we could save things fine without all this conversion - no. The problem predates the pickaxe. There are a number of other bug reports that illustrate the issues with UTF8 and international characters.

I have actuially reduced the number of conversions in the course of this work.

-- TWiki:Main.CrawfordCurrie - 21 Dec 2007

I have restricted the conversions to characters in the range 128..255. But you can't have it both ways; either these 8 bot characters are represented as entities, or they are represented as 8 bit characters. I can't do both, context-sensitive depending on where the character is.

Please, if you have an issue, provide a testcase stating explicitly what the problem is. I don't use 8 bit characters myself, and am dependent on your support for debugging.

CC

Good idea to keep number of conversions as low as possible. We will probably never have a "idempotent" roundtrip for the encoding/decoding independent of (server/browser) versions, charsets and environments, so we "just" need to find out where the acceptable compromise is.

For non-technical use, the wysiwyg editor does a pretty nice job now I think. But it is a problem that for example entities in verbatim sections are not left alone anymore (i.e. non-breaking space). Of course non-verbatim html might also rely on nbsp, i.e. in tables (these are translated also).

Verbatim sections in a topic should disable the editor if current en/decode behaviour is accepted as OK.

-- SteffenPoulsen - 21 Dec 2007

The most recent checkin limited the conversion to characters in the range 128..255, and symbolic names for characters in that range except nbsp. if there is still an issue with conversion, it is a bug, in which case please provide a testcase.

CC

I will test this the next days

Steffen's idea to disable the TMCE when you have verbatim is not really a solution. If you have a topic with a couple of lines in verbatim it is a bit dramatic to suddenly disable the editor.

When we look at typical use cases the kind of user base that uses verbatim are typically programmer that uses TWIki to document software and use TWiki for bug tracking.

The typical text between the verbatim text is code. And the code where we have the problem is where someone tries to document some HTML and they want to write some entities.

So the 128..255 range may not be such a big problem. We may consider these last use cases as getting more exotic where disabling the editor for that topic all together may be the only way.

Let us test this latest approach for a while and report problems we find with actual examples.

-- TWiki:Main.KennethLavrsen - 22 Dec 2007

Issue on nbsp opened as Item5165.

-- TWiki:Main.SteffenPoulsen - 22 Dec 2007

ItemTemplate
Summary Wysiwyg pickaxe destroys non-US-ASCII content
ReportedBy TWiki:Main.SteffenPoulsen
Codebase 4.2.0, ~twiki4
SVN Range TWiki-4.3.0, Sat, 15 Dec 2007, build 16003
AppliesTo Extension
Component TinyMCEPlugin
Priority Urgent
CurrentState Closed
WaitingFor

Checkins TWikirev:16058 TWikirev:16068 TWikirev:16069 TWikirev:16073 TWikirev:16074
TargetRelease minor
ReleasedIn

Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt pickaxe.txt r1 manage 22.5 K 2007-12-18 - 00:08 KennethLavrsen  
Edit | Attach | Watch | Print version | History: r32 < r31 < r30 < r29 < r28 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r32 - 2007-12-22 - SteffenPoulsen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback