XML support #22

olivergeorge · 2014-12-23T02:36:49Z

I've done a little work to make hickory support basic XML parsing and rendering. I'm wondering if this is something you think could be added to the main hickory repo.

My goal was parse an XML document, make changes to it and then serialise the back to XML. Where possible I wanted the rendered XML to be as close to the original as practical to simplify debugging.

The changes needed were all quite minor really:

Use "application/xml" when calling js/DOMParser
Respect case of tag names
Render with preamble
Render a closed tag if no children are present

I cut and pasted a lot to keep things simple (for me) but I think you will have ideas on a more integrated approach.

I'm concious that this code relies on js/DOMParser which might be more restrictive than hickory is otherwise. In particular it's an IE9+ approach.

davidsantiago · 2014-12-23T02:49:21Z

Interesting!!! That's really cool. My main hesitation here is, I'll be honest, I'm not super up on the nitty gritty of XML and particularly how XML differs from HTML. I only had HTML in mind when I originally designed Hickory. If the data model is that compatible, that would be awesome.

We could surely implement some XML parser for the Clojure version. I'm not a Clojurescript user, so I am not sure what the XML parsing situation is in Javascript land. IE9+ doesn't sound great, but I defer to @jeluard on these issues. He has selected an HTML parser for the CLJS version, and although none of those options were superb, I do think he made the best compromise possible, so I would think trying to be as close to there would be best.

davidsantiago · 2014-12-23T02:53:28Z

A JS-expert friend of mine just said that IE9+ is "pretty reasonable in the modern era."

Make case insensitive attr selector work even when actual attributes may have case.

This reverts commit 1a39238. Easier to start fresh I think.

olivergeorge · 2014-12-23T06:21:10Z

I think one of the fiddly bits is that the code base tries to do case
insensitive searching by calling string/lower-case before tests. That
makes it a little messier. Perhaps there's a way to test "case
sensitivity" and handle both cases.

On 23 December 2014 at 13:49, David Santiago [email protected]
wrote:

Interesting!!! That's really cool. My main hesitation here is, I'll be
honest, I'm not super up on the nitty gritty of XML and particularly how
XML differs from HTML. I only had HTML in mind when I originally designed
Hickory. If the data model is that compatible, that would be awesome.

We could surely implement some XML parser for the Clojure version. I'm not
a Clojurescript user, so I am not sure what the XML parsing situation is in
Javascript land. IE9+ doesn't sound great, but I defer to @jeluard
https://github.com/jeluard on these issues. He has selected an HTML
parser for the CLJS version, and although none of those options were
superb, I do think he made the best compromise possible, so I would think
trying to be as close to there would be best.

—
Reply to this email directly or view it on GitHub
#22 (comment).

olivergeorge · 2014-12-23T06:23:51Z

Yeah, less worried about that. As a base line it would be quite broad
support and there are 101 tricks to expand reach to other browsers (with a
bit of patience).

On 23 December 2014 at 13:53, David Santiago [email protected]
wrote:

A JS-expert friend of mine just said that IE9+ is "pretty reasonable in
the modern era."

—
Reply to this email directly or view it on GitHub
#22 (comment).

davidsantiago · 2014-12-23T07:03:18Z

Hm, yeah. It's supposed to be case insensitive for HTML. I'm not sure what to do there. Rather than trying to squeeze a round peg through a square hole, maybe we should introduce a new namespace, like hickory.select-xml, with selectors that are adapted specifically for the XML usage case. What would you think of that?

I think the thing is some people want to use it outside of browsers.

olivergeorge · 2014-12-23T07:37:16Z

Sounds fine to me. I expect it will be able to reuse most things.

On 23 December 2014 at 18:03, David Santiago [email protected]
wrote:

Hm, yeah. It's supposed to be case insensitive for HTML. I'm not sure what
to do there. Rather than trying to squeeze a round peg through a square
hole, maybe we should introduce a new namespace, like hickory.select-xml,
with selectors that are adapted specifically for the XML usage case. What
would you think of that?

I think the thing is some people want to use it outside of browsers.

—
Reply to this email directly or view it on GitHub
#22 (comment).

jeluard · 2014-12-23T12:41:44Z

IE9+ sounds good enough and the code is pretty similar to the existing so it all sounds good.

I agree there probably will be people expecting this feature to work outside browser, including on node platform. Also I am wondering if it would not make more sense to add ClojureScript support to data.xml rather than XML support to hickory.

olivergeorge · 2014-12-23T22:58:53Z

That sounds strategically sane.

I imagine building on Google Closure's goog.dom.xml namespace would avoid
some browser specific considerations. I've never thought about it before,
does Google Closure play nice on node?

Either way, for my current project I think I'll stick with my current
approach. I won't be offended if you think it's better not to expand
hickory to cover the slightly stricter approach required to play nice with
XML documents.

cheers, Oliver

On 23 December 2014 at 23:41, Julien Eluard [email protected]
wrote:

IE9+ sounds good enough and the code is pretty similar to the existing so
it all sounds good.

I agree there probably will be people expecting this feature to work
outside browser, including on node platform. Also I am wondering if it
would not make more sense to add ClojureScript support to data.xml
https://github.com/clojure/data.xml rather than XML support to hickory.

—
Reply to this email directly or view it on GitHub
#22 (comment).

olivergeorge · 2014-12-23T23:03:28Z

Quick update regarding node support. Looks like "no browser = no dom = no
goog.dom.xml"

https://code.google.com/p/closure-library/wiki/NodeJS

Does Closure-Library-on-Node.js behave differently than Closure Library in
the browser?
...
Obviously, any libraries in Closure Library that use the DOM will not work
on NodeJS.

Seems sensible to switch in the loadXml call though. I'll patch to use
this now.

goog.dom.xml.loadXml = function(xml) {
  if (typeof DOMParser != 'undefined') {
    return new DOMParser().parseFromString(xml, 'application/xml');
  } else if (typeof ActiveXObject != 'undefined') {
   var doc = goog.dom.xml.createMsXmlDocument_();
    doc.loadXML(xml);
   return doc;
  }  throw Error('Your browser does not support loading xml documents');
};

olivergeorge · 2014-12-23T23:31:18Z

Okay, those two changes are added.

davidsantiago · 2014-12-24T21:29:45Z

Sorry guys, was traveling yesterday.

@jeluard Good question about data.xml. I assume Oliver had a reason for doing this work, though. Although I don't have a specific use case in mind for why a library that can work with both XML and HTML is useful, it seems like something that could be useful.

But @olivergeorge, you said something above about doing this requiring "a slightly stricter approach." What did you mean by that? I didn't understand there to be any changes to the semantics of HTML handling in any of these patches. Also, I assumed that more selectors would need modification to be correct for XML semantics. There's really only just the one selector that should be case-sensitive? If it's just that one, maybe we should just do a namespace like hickory.xml and stick all of the XML related functions together in there?

lvh · 2015-09-22T23:52:27Z

I also would like XML support. I'm currently doing it like this:

(defn parse-xml
  "Parse an XML string into DOM objects."
  [s]
  (.parseFromString (js/DOMParser.) s "text/xml"))

I'm using as-hiccup now and it seems to work fine. I definitely understand why you'd want to use goog.dom.xml though.

Perhaps this just needs a few tests and it's good to go?

lvh · 2015-09-29T20:17:42Z

So, the thing that I have there doesn't really work because my code doesn't fix as-hiccup making all tags lowercase, which this PR handles.

lvh · 2015-09-29T22:19:13Z

I just backported this, and it worked like a charm. I also added:

(def as-hiccup-xml
  (comp hickory-to-hiccup as-hickory-xml))

What would it take to get this merged? Should I add unit tests or something?

lvh · 2015-09-29T22:49:53Z

I wonder if it might make sense to use the browser DOM APIs to reproduce a document and then serialize it.

port19x · 2023-04-11T09:15:02Z

I believe data.xml is the proper tool to parse XML and will soon-ish document it in a prominent place in hopes that people seeking an XML parser will find an XML parser and not "misuse" an HTML parser for that usecase.

IE9+ sounds good enough and the code is pretty similar to the existing so it all sounds good.

I agree there probably will be people expecting this feature to work outside browser, including on node platform. Also I am wondering if it would not make more sense to add ClojureScript support to data.xml rather than XML support to hickory.

The main considerations here are scope creep and expertise.

Expertise: I write web-scraping programs and am familiar with the ins and outs of network requests and HTML, not XML.
Scope Creep: Explicit XML support would near double the scope of the project and take resources away from enhancing hickory's quality as an HTML parser.

I'll try my best to make hickory extensible and applicable to other hiccup producing libraries (#24)
I won't support XML (#29) (#63)

olivergeorge added 3 commits December 23, 2014 11:18

Add parse-xml and as-hickory-xml functions

446c604

Add hickory-to-xml

6c95422

Render empty XML tags based on presence of content

0642b19

olivergeorge and others added 2 commits December 23, 2014 14:14

Update select.cljx

1a39238

Make case insensitive attr selector work even when actual attributes may have case.

Revert "Update select.cljx"

736f1bc

This reverts commit 1a39238. Easier to start fresh I think.

olivergeorge added 2 commits December 24, 2014 10:09

Use goog.dom.xml to parse xml with greater browser compatibility

1eca312

add hickory.select-xml namespace with case insensitive attr selector

5511992

gersak mentioned this pull request May 6, 2016

Case sensitive tags/attributes #37

Closed

port19x added status: dubious Where do I even begin? category: governance Sustainable development labels Apr 11, 2023

port19x closed this Apr 11, 2023

port19x mentioned this pull request Apr 11, 2023

Explicit namespace support for as-hickory for XML #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XML support #22

XML support #22

olivergeorge commented Dec 23, 2014

davidsantiago commented Dec 23, 2014

davidsantiago commented Dec 23, 2014

olivergeorge commented Dec 23, 2014

olivergeorge commented Dec 23, 2014

davidsantiago commented Dec 23, 2014

olivergeorge commented Dec 23, 2014

jeluard commented Dec 23, 2014

olivergeorge commented Dec 23, 2014

olivergeorge commented Dec 23, 2014

olivergeorge commented Dec 23, 2014

davidsantiago commented Dec 24, 2014

lvh commented Sep 22, 2015

lvh commented Sep 29, 2015

lvh commented Sep 29, 2015

lvh commented Sep 29, 2015

port19x commented Apr 11, 2023

XML support #22

XML support #22

Conversation

olivergeorge commented Dec 23, 2014

davidsantiago commented Dec 23, 2014

davidsantiago commented Dec 23, 2014

olivergeorge commented Dec 23, 2014

olivergeorge commented Dec 23, 2014

davidsantiago commented Dec 23, 2014

olivergeorge commented Dec 23, 2014

jeluard commented Dec 23, 2014

olivergeorge commented Dec 23, 2014

olivergeorge commented Dec 23, 2014

olivergeorge commented Dec 23, 2014

davidsantiago commented Dec 24, 2014

lvh commented Sep 22, 2015

lvh commented Sep 29, 2015

lvh commented Sep 29, 2015

lvh commented Sep 29, 2015

port19x commented Apr 11, 2023