Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: XML Entity file handling #701

Draft
wants to merge 2 commits into
base: integration
Choose a base branch
from

Conversation

carlwilson
Copy link
Member

@carlwilson carlwilson commented Jan 6, 2022

Fixes #700 output is now a more informative:

  Status: Not well-formed
  ErrorMessage: File not found: /home/cfw/proj/opf/jhove/article.dtd (No such file or directory)
   ID: XML-HUL-10
  MIMEtype: text/xml

- moved the `Jhove` and `JhoveView` classes from default packageas Java 11 tooling complained;
- updated `jhove-apps` POM to add new `Jhove` class package to manifest line;
- updated windows batch and linux shell start scripts to add new `Jhove` class package to manifest line; and
- updated Maven dependencies for `izpack -> 5.1.3` and `jacoco -> 0.8.7`for Java 11 compatibility.
- fixed misfiring entity processor; and
- fixed package warnings.
@carlwilson carlwilson added bug A product defect that needs fixing P1 High priority issues to be scheduled in the upcoming release labels Jan 6, 2022
@carlwilson carlwilson self-assigned this Jan 6, 2022
@carlwilson carlwilson marked this pull request as draft January 6, 2022 23:30
@pwinckles
Copy link
Contributor

Observation, if you're going to use an automated code formatter, it would make sense to run it on the entire project to create a baseline. This would allow for cleaner PRs and ease the burden of resolving spurious merge conflicts.

ent = new InputSource(conn.getInputStream());
}

ent = new InputSource(obj.openStream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change here means that the code will no longer follow cross-protocol redirects (eg http->https). By default, HttpUrlConnection does follow redirects. However, it does not follow redirects across protocols.

@pwinckles
Copy link
Contributor

I've spent a fair bit of time looking at this function the past couple of days. Here is the modified implementation that I arrived at: https://github.com/UW-Madison-Library/jhove/blob/48d083b2394fc8b9426d9567744eff519284b830/jhove-modules/xml-hul/src/main/java/edu/harvard/hul/ois/jhove/module/xml/XmlModuleHandler.java#L332

Things to note:

  1. The code in this PR removes the protocol redirect handling. This means that it will no longer be able to resolve schemas like https://www.openarchives.org/OAI/2.0/oai_dc.xsd, which imports http://dublincore.org/schemas/xmls/simpledc20021212.xsd and is in turn redirected to https://dublincore.org/schemas/xmls/simpledc20021212.xsd
  2. The original code had what I thought was a bug in that after making the initial HTTP request it did not set ent, instead it only set ent if the request encountered a redirect. However, this turned out not to be a bug. The sole purpose of the code was actually to resolve the redirect. If resolveEntity() returns null, the calling code will attempt to resolve the systemId itself, which it is able to do so long as there are no redirects.
  3. Finally, resolveEntity() should set the public id and system id on the returned InputSource as these values are used to resolve relative child entities.

@david-russo
Copy link
Member

david-russo commented Feb 4, 2022

Hi Peter, thanks for your super helpful review. This all sounds a little familiar, since I did a review of the XML module code a year or two ago, but can barely remember the details now. I fixed a few bugs, and did a lot of house keeping. The PR (#634) is still awaiting integration, but you might find it of interest, if only to prevent duplicating efforts in places.

A fix for this issue which still retains the old redirection logic was included in commit d223a28 of that PR. I'm not sure if it addresses all of your concerns, since it's been too long for me to remember, but let me know if it doesn't, and maybe I can amend that older PR.

@pwinckles
Copy link
Contributor

@david-russo Yeah, it looks like your old PR does address the same issue as is addressed in this PR. It does still have the bug when resolving relative entities but that's a different problem.

I haven't looked at your PR in detail, but it looks like there's some good stuff in there, including a few changes that I have also made. It would be really nice if PRs didn't languish here for years, forcing us to duplicate each other's work and creating an ever growing merge problem for PR authors if/when PRs ever start to be merged again.

@david-russo
Copy link
Member

Cheers. Does this comment relate to the relative entity resolution issue you're describing? If so, from what I remember/wrote, it sounds like there's a hacky solution in the module at the moment, but a proper fix may require a change to what information JHOVE passes its modules.

@pwinckles
Copy link
Contributor

pwinckles commented Feb 4, 2022

Yes, but it can be addressed in some (perhaps not all) cases, simply by setting the system id on the returned InputSource. See here

I'll create a ticket for it with a demonstration of the problem later today.

[Edit] That is for entities that are relative a parent. This fix has no affect on relative entities without parents.

@carlwilson carlwilson added this to the JHOVE 1.28 milestone Nov 23, 2022
@carlwilson carlwilson removed this from the JHOVE 1.28 milestone Jun 21, 2023
@carlwilson carlwilson modified the milestones: JHOVE 1.32, JHOVE 1.34 Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A product defect that needs fixing P1 High priority issues to be scheduled in the upcoming release
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Missing XML DTD/Entity file resource causes ClassCastException
3 participants