Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML support #22

Closed
wants to merge 7 commits into from
Closed

XML support #22

wants to merge 7 commits into from

Conversation

olivergeorge
Copy link

I've done a little work to make hickory support basic XML parsing and rendering. I'm wondering if this is something you think could be added to the main hickory repo.

My goal was parse an XML document, make changes to it and then serialise the back to XML. Where possible I wanted the rendered XML to be as close to the original as practical to simplify debugging.

The changes needed were all quite minor really:

  • Use "application/xml" when calling js/DOMParser
  • Respect case of tag names
  • Render with preamble
  • Render a closed tag if no children are present

I cut and pasted a lot to keep things simple (for me) but I think you will have ideas on a more integrated approach.

I'm concious that this code relies on js/DOMParser which might be more restrictive than hickory is otherwise. In particular it's an IE9+ approach.

@davidsantiago
Copy link
Collaborator

Interesting!!! That's really cool. My main hesitation here is, I'll be honest, I'm not super up on the nitty gritty of XML and particularly how XML differs from HTML. I only had HTML in mind when I originally designed Hickory. If the data model is that compatible, that would be awesome.

We could surely implement some XML parser for the Clojure version. I'm not a Clojurescript user, so I am not sure what the XML parsing situation is in Javascript land. IE9+ doesn't sound great, but I defer to @jeluard on these issues. He has selected an HTML parser for the CLJS version, and although none of those options were superb, I do think he made the best compromise possible, so I would think trying to be as close to there would be best.

@davidsantiago
Copy link
Collaborator

A JS-expert friend of mine just said that IE9+ is "pretty reasonable in the modern era."

olivergeorge and others added 2 commits December 23, 2014 14:14
Make case insensitive attr selector work even when actual attributes may have case.
This reverts commit 1a39238.

Easier to start fresh I think.
@olivergeorge
Copy link
Author

I think one of the fiddly bits is that the code base tries to do case
insensitive searching by calling string/lower-case before tests. That
makes it a little messier. Perhaps there's a way to test "case
sensitivity" and handle both cases.

On 23 December 2014 at 13:49, David Santiago [email protected]
wrote:

Interesting!!! That's really cool. My main hesitation here is, I'll be
honest, I'm not super up on the nitty gritty of XML and particularly how
XML differs from HTML. I only had HTML in mind when I originally designed
Hickory. If the data model is that compatible, that would be awesome.

We could surely implement some XML parser for the Clojure version. I'm not
a Clojurescript user, so I am not sure what the XML parsing situation is in
Javascript land. IE9+ doesn't sound great, but I defer to @jeluard
https://github.com/jeluard on these issues. He has selected an HTML
parser for the CLJS version, and although none of those options were
superb, I do think he made the best compromise possible, so I would think
trying to be as close to there would be best.


Reply to this email directly or view it on GitHub
#22 (comment).

@olivergeorge
Copy link
Author

Yeah, less worried about that. As a base line it would be quite broad
support and there are 101 tricks to expand reach to other browsers (with a
bit of patience).

On 23 December 2014 at 13:53, David Santiago [email protected]
wrote:

A JS-expert friend of mine just said that IE9+ is "pretty reasonable in
the modern era."


Reply to this email directly or view it on GitHub
#22 (comment).

@davidsantiago
Copy link
Collaborator

Hm, yeah. It's supposed to be case insensitive for HTML. I'm not sure what to do there. Rather than trying to squeeze a round peg through a square hole, maybe we should introduce a new namespace, like hickory.select-xml, with selectors that are adapted specifically for the XML usage case. What would you think of that?

I think the thing is some people want to use it outside of browsers.

@olivergeorge
Copy link
Author

Sounds fine to me. I expect it will be able to reuse most things.

On 23 December 2014 at 18:03, David Santiago [email protected]
wrote:

Hm, yeah. It's supposed to be case insensitive for HTML. I'm not sure what
to do there. Rather than trying to squeeze a round peg through a square
hole, maybe we should introduce a new namespace, like hickory.select-xml,
with selectors that are adapted specifically for the XML usage case. What
would you think of that?

I think the thing is some people want to use it outside of browsers.


Reply to this email directly or view it on GitHub
#22 (comment).

@jeluard
Copy link
Collaborator

jeluard commented Dec 23, 2014

IE9+ sounds good enough and the code is pretty similar to the existing so it all sounds good.

I agree there probably will be people expecting this feature to work outside browser, including on node platform. Also I am wondering if it would not make more sense to add ClojureScript support to data.xml rather than XML support to hickory.

@olivergeorge
Copy link
Author

That sounds strategically sane.

I imagine building on Google Closure's goog.dom.xml namespace would avoid
some browser specific considerations. I've never thought about it before,
does Google Closure play nice on node?

Either way, for my current project I think I'll stick with my current
approach. I won't be offended if you think it's better not to expand
hickory to cover the slightly stricter approach required to play nice with
XML documents.

cheers, Oliver

On 23 December 2014 at 23:41, Julien Eluard [email protected]
wrote:

IE9+ sounds good enough and the code is pretty similar to the existing so
it all sounds good.

I agree there probably will be people expecting this feature to work
outside browser, including on node platform. Also I am wondering if it
would not make more sense to add ClojureScript support to data.xml
https://github.com/clojure/data.xml rather than XML support to hickory.


Reply to this email directly or view it on GitHub
#22 (comment).

@olivergeorge
Copy link
Author

Quick update regarding node support. Looks like "no browser = no dom = no
goog.dom.xml"

https://code.google.com/p/closure-library/wiki/NodeJS

Does Closure-Library-on-Node.js behave differently than Closure Library in
the browser?

...
Obviously, any libraries in Closure Library that use the DOM will not work
on NodeJS.

Seems sensible to switch in the loadXml call though. I'll patch to use
this now.

goog.dom.xml.loadXml = function(xml) {
  if (typeof DOMParser != 'undefined') {
    return new DOMParser().parseFromString(xml, 'application/xml');
  } else if (typeof ActiveXObject != 'undefined') {
   var doc = goog.dom.xml.createMsXmlDocument_();
    doc.loadXML(xml);
   return doc;
  }  throw Error('Your browser does not support loading xml documents');
};

@olivergeorge
Copy link
Author

Okay, those two changes are added.

@davidsantiago
Copy link
Collaborator

Sorry guys, was traveling yesterday.

@jeluard Good question about data.xml. I assume Oliver had a reason for doing this work, though. Although I don't have a specific use case in mind for why a library that can work with both XML and HTML is useful, it seems like something that could be useful.

But @olivergeorge, you said something above about doing this requiring "a slightly stricter approach." What did you mean by that? I didn't understand there to be any changes to the semantics of HTML handling in any of these patches. Also, I assumed that more selectors would need modification to be correct for XML semantics. There's really only just the one selector that should be case-sensitive? If it's just that one, maybe we should just do a namespace like hickory.xml and stick all of the XML related functions together in there?

@lvh
Copy link

lvh commented Sep 22, 2015

I also would like XML support. I'm currently doing it like this:

(defn parse-xml
  "Parse an XML string into DOM objects."
  [s]
  (.parseFromString (js/DOMParser.) s "text/xml"))

I'm using as-hiccup now and it seems to work fine. I definitely understand why you'd want to use goog.dom.xml though.

Perhaps this just needs a few tests and it's good to go?

@lvh
Copy link

lvh commented Sep 29, 2015

So, the thing that I have there doesn't really work because my code doesn't fix as-hiccup making all tags lowercase, which this PR handles.

@lvh
Copy link

lvh commented Sep 29, 2015

I just backported this, and it worked like a charm. I also added:

(def as-hiccup-xml
  (comp hickory-to-hiccup as-hickory-xml))

What would it take to get this merged? Should I add unit tests or something?

@lvh
Copy link

lvh commented Sep 29, 2015

I wonder if it might make sense to use the browser DOM APIs to reproduce a document and then serialize it.

@port19x port19x added status: dubious Where do I even begin? category: governance Sustainable development labels Apr 11, 2023
@port19x
Copy link
Contributor

port19x commented Apr 11, 2023

I believe data.xml is the proper tool to parse XML and will soon-ish document it in a prominent place in hopes that people seeking an XML parser will find an XML parser and not "misuse" an HTML parser for that usecase.

IE9+ sounds good enough and the code is pretty similar to the existing so it all sounds good.

I agree there probably will be people expecting this feature to work outside browser, including on node platform. Also I am wondering if it would not make more sense to add ClojureScript support to data.xml rather than XML support to hickory.

The main considerations here are scope creep and expertise.

Expertise: I write web-scraping programs and am familiar with the ins and outs of network requests and HTML, not XML.
Scope Creep: Explicit XML support would near double the scope of the project and take resources away from enhancing hickory's quality as an HTML parser.

I'll try my best to make hickory extensible and applicable to other hiccup producing libraries (#24)
I won't support XML (#29) (#63)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: governance Sustainable development status: dubious Where do I even begin?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants