Midrange News for the IBM i Community


Posted by: clbirk
unencoding html characters in a field
has no ratings.
Published: 18 Aug 2012
Revised: 23 Jan 2013 - 4082 days ago
Last viewed on: 28 Mar 2024 (5563 views) 

Using IBM i? Need to create Excel, CSV, HTML, JSON, PDF, SPOOL reports? Learn more about the fastest and least expensive tool for the job: SQL iQuery.

unencoding html characters in a field Published by: clbirk on 18 Aug 2012 view comments(10)

I get in an archive that has for example & and &#E8; –  etc. and I need them converted to the appropriate ebcdic values. Is there a function that unencodes the values in a string that is passed to it? currently I have written logic to do the most common I get but...

 

Eample:

Jack & Jill   would become Jack & Jill

 

Return to midrangenews.com home page.
Sort Ascend | Descend

COMMENTS

(Sign in to Post a Comment)
Posted by: bobcozzi
Site Admin ****
Chicagoland
Comment on: unencoding html characters in a field
Posted: 11 years 7 months 9 days 20 hours 54 minutes ago

You can use the unescape() function in COZTOOLS.

 

  cleanData = unEscape( 'Jack & Jill');

 

 

Posted by: Ringer
Premium member *
Comment on: unencoding html characters in a field
Posted: 11 years 7 months 7 days 23 hours 47 minutes ago

Try the RPG xml-sax opcode. It unencodes all the entities including < > + (hex), ' (decimal) and unicode too (2 byte chars).

Chris Ringer

Posted by: bobcozzi
Site Admin ****
Chicagoland
Comment on: unencoding html characters in a field
Posted: 11 years 7 months 7 days 22 hours 48 minutes ago

I wouldn't bother with XML-SAX, I'd use XML-INTO. If you are on v6.1 or later there is almost never a reason to bother with the complexity of XML-SAX. It is there (mostly) for folks who had previous experience with XML parsers in other languages and want to continue to use it.

Don't get me wrong, if you know it, use it. But if you're an RPG programmer in the 99.9999% bracket (meaning you probably are or wouldn't be asking about escape characters) then I'd go with XML-INTO.  Here's an example the does what you want it to do (note you did have some errors in your escape characters. For exampl &#E8; is wrong, it should be è (hex form). But anyway, here's an RPG solution not using any 3rd-party software at all:

 

D szXML           S            512A   Inz('<myxml> +                 
D                                          <encoded> Bob Cozzi +     
D                                            likes &#x2013; pickles +
D                                               &#xE8;     &#x2013;+ 
D                                                  &#x2013; pickles+ 
D                                          </encoded> +              
D                                        </myxml>')                  
D encoded         S            512A                                  
C                   MOVE      *ON           *INLR                    
 /free                                                               
                                                                     
       xml-into encoded %xml( szXML  : 'path=myxml/encoded');
 /end-free

 

Posted by: DaleB
Premium member *
Reading, PA
Comment on: unencoding html characters in a field
Posted: 11 years 7 months 7 days 4 hours 40 minutes ago

Maybe we need clarification from clbirk, but his topic was that this is html, not xml.

Posted by: Ringer
Premium member *
Comment on: unencoding html characters in a field
Posted: 11 years 7 months 7 days 3 hours 59 minutes ago

As far as I know, HTML and XML share the same entitites.

http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

Chris R

Posted by: DaleB
Premium member *
Reading, PA
Comment on: unencoding html characters in a field
Posted: 11 years 7 months 7 days 2 hours 41 minutes ago

Yes, but using RPG XML-xxx opcodes probably won't work on HTML (though I've never tried it myself, so can't say for sure).

Posted by: Ringer
Premium member *
Comment on: unencoding html characters in a field
Posted: 11 years 7 months 7 days 2 hours 12 minutes ago

Well formed HTML can be parsed (which was the point of XHTML a few years ago), such as every <p> having a </p> and <br> instead being <br />, etc. If clbirk already has the extracted string, then wrap any <tag> around it and parse it as XML.

The XHTML !DOCTYPE spec died because most HTML web pages are not well formed and would break them (not parse correctly).

http://en.wikipedia.org/wiki/XHTML

Chris R

Posted by: bobcozzi
Site Admin ****
Chicagoland
Comment on: unencoding html characters in a field
Posted: 11 years 7 months 6 days 19 hours 20 minutes ago

I actually tested my example (above) and it works--doesn't care what it is just has to be well formed XML-like HTML if it is HTML. But it sounded like he just wanted to un-escape the URL-encoded data stream. To do that, he can smash it between an XML tag and let XML-INTO have at it.

Posted by: bobcozzi
Site Admin ****
Chicagoland
Comment on: unencoding html characters in a field
Posted: 11 years 7 months 5 days 17 hours 2 minutes ago
Edited: Wed, 22 Aug, 2012 at 19:35:12 (4236 days ago)

Further testing shows that XML-INTO (and I assume XML-SAX) will not un-escape (decode) a URL-encoded string. Therefore, URL encoding such as %23 the plus sign and so on, are not decoded by XML-INTO. Only stuff embedded in XML using full escape codes (not URL-encoded shortcuts) will be decoded. Undecided

Posted by: clbirk
Premium member *
Comment on: unencoding html characters in a field
Posted: 11 years 7 months 5 days 4 hours 29 minutes ago

It is xml (if you can call it that). It is an archive that I get with orders and the person who "designed" the "xml", well I shall byte my tongue.

 

These are typically the &  ; sort of things, and of course many like spaces are not encoded.

<info1>Jack &amp; Jill</info1>

 

Some times they double encode it, like:

<info1>Jack &amp;amp; Jill</info1>  They are not smart enough to realize they are calling the encoding function twice and I haven't been able to get that fixed by the outside developer.

 

Thanks for the information.