Wednesday, February 13, 2008

Loading an XML file into Java objects with Castor

A common problem: You have a certain XML file, of a semi-fixed format. No document type definition (DTD), no XML schema definition (XSD), just some "agreed-on" XML structure. You want to load a bunch of those files into Java, and work with them. Best would be to transfer them into Java classes or beans. Castor allows this - see http://www.castor.org/ Install Castor 1.2, you'll need the complete version with source code; otherwise, it seems to be missing some of the dependencies, e.g., velocity-1.5.jar. The scripts to run are found in <CASTOR>/bin as .sh and/or .bat scripts, e.g., classpath.bat/.sh which will be used in the following.

Step 1: Generating a schema definition

Castor is able to generate an xsd file from an XML instance file. This schema might not be complete nor correct, yet it is a good starting point. Possible, you'll have to patch it, to remove nodes that have no well-known structure, or to add others that don't appear in the selected instance file.
classpath.bat 
  org.exolab.castor.xml.schema.util.XMLInstance2Schema 
  input.xml [output.xsd]
If no output file is given, the schema is written to standard out. Alternatively, you can used the class from your own code:
XMLInstance2Schema instance2Schema = new XMLInstance2Schema();
Schema schema = instance2Schema.
  createSchema("input.xml");
System.out.println(schema);

// copied from XMLInstance2Schema#main
Writer dstWriter = new PrintWriter(
  new FileOutputStream("output.xsd"), true);
SchemaWriter schemaWriter = new SchemaWriter(dstWriter);
schemaWriter.write(schema);
dstWriter.flush();
Some Links:

Step 2: Patch the generated schema

Often, changes to the generated schema file are necessary. The input.xml may, for example, contain a set of nodes that are not really well-agreed on, change regularly, or are very different between different instance files. In our case, it was some html-formatted text that was just barely made xml-compatible by making sure each <p> also contained </p> ... not even xhtml, I'd say. So, we replaced a complex node structure
{sequence}
 {element name="p"}
  {complexType}
   {all}
    {element name="i"}
     {complexType mixed="true"}
      {sequence}
      [...]
with simple
{element name="p" type="xsd:anyType" /}
Links:

Step 3: Generate the Java classes

Next step, Castor generates Java classes from the schema definition. Again, this can either be done by the sourceGen.bat provided with castor, or programmatically via org.exolab.castor.builder.SourceGeneratorMain.main(new String[] {param1, param2, ...}).
sourceGen.bat -i output-patched.xsd 
  -package my.package.name -dest src -f -types j2
-f suppresses any non-fatal warnings, including the overwriting of existing files. -types j2 uses java.util.List for collections, even List<Type> with Java 5.0 as below. For each type Type of the schema, a my.package.name.Type java file is generated, and a my.package.name.descriptors.TypeDescriptor for Castor use. Oh, I also put a castorbuilder.properties file into the current directory which contained
# Defines the XML parser to be used by Castor.
# The parser must implement org.xml.sax.Parser.
org.exolab.castor.parser=org.xml.sax.helpers.XMLReaderAdapter

# Defines the (default) XML serializer factory to use by Castor, which must
# implement org.exolab.castor.xml.SerializerFactory; default is 
# org.exolab.castor.xml.XercesXMLSerializerFactory
org.exolab.castor.xml.serializer.factory=org.exolab.castor.xml.XercesJDK5XMLSerializerFactory

# Defines the default XML parser to be used by Castor.
org.exolab.castor.parser=com.sun.org.apache.xerces.internal.parsers.SAXParser

org.exolab.castor.builder.javaVersion=5.0
Castor Source-Generation

Step 4: Use the classes

Write some code that unmarshals the XML file(s), and prints the resulting objects. toString() is not overridden, so you have to query each attribute and subnode individually.
TopType top = (TopType) Unmarshaller.unmarshal(
  TopType.class, new FileReader("input.xml"));
// topType.getSubItem returns SubItem[]
for (SubItem item: topType.getSubItem()) {
  System.out.printf("SubItem id: %s; value: %s\n",
    item.getSomeId(), item.getSomeValue());
  // p is just the anyType object from above; toString(), it 
  //  prints the XML content as a fragment.
  System.out.println(item.getP());
}
Have fun with it!

No comments: