Overview of XML Convert and XFlat

This overview is organized as follows:

Introduction

Companies have started using XML to send application data to browsers and to business applications. XML is well suited for the interchange of data, since XML documents are self-describing, easily parsed and can represent complex data structures. Also, there is a wide variety of high-quality, inexpensive tools for parsing and transforming XML documents. When using XML for data interchange, ideally, the sending application will be able to export an XML document, and the receiving application will be able to import an XML document. Unfortunately, many legacy applications use flat files to import or export data. So, companies will need to convert flat files into XML documents when sending data to XML-capable applications. Likewise, companies will need to convert XML documents into flat files that can be imported into legacy applications.

Flat Files

Flat files contain machine-readable data that is typically encoded as printable characters. A flat file usually contains a series of records (or lines), where each record is a sequence of fields. A field contains an atomic piece of data (e.g., a postal code).

Let's look at a simple flat file containing employee data. The file contains one or more employee records. Each record contains the following three fields:

  • Employee's social security number (ssn)
  • Employee's full name (last name followed by a comma followed by a space followed by the first name)
  • Employee's salary

The following are the contents of the employees flat file:

123456789,"Carr, Lisa",100000.00
444556666,"Barr, Clark",87000.00
777227878,"Parr, Jack",123000.00
998877665,"Charr, Lee",123000.00

Each record contains information about one employee. The format of the flat file is Comma Separated Value (CSV), which means that each record is terminated by the operating system's line separator and the fields within a record are separated by a comma. In addition, a field value may be enclosed in quotes, which escape any commas or line terminator characters that appear within the field value. Note that the quotes that surround the field value are not actually part of the field value. Also, if a field value contains a quote character, then the field value must be surrounded by quotes and the quote character in the field value must escaped by prefixing it with an additional quote.

Let's look at a more complicated flat file, the structure of which is similar to the structure of a Windows Configuration Settings file (e.g., an INI file, such as win.ini). The flat file contains a list of contacts. The following are the contents of the contacts flat file:

[contact]
name=Nancy Magill
email=lil.magill@blackmountainhills.com
phone=(100) 555-9328
[contact]
email=molly.jones@oblada.com
name=Molly Jones
[contact]
phone=(200) 555-3249
name=Penny Lane
email=plane@bluesuburbanskies.com

Each contact consists of a begin contact record followed by three optional records. The begin contact record consists of the string "[contact]". The three optional records, which can appear in any order, are as follows:

  • The name record, which contains the full name of the contact. This record begins with a "name=" label followed by the name.
  • The email record, which contains the email address of the contact. This record begins with an "email=" label followed by the email address.
  • The phone record, which contains the phone number of the contact. This record begins with a "phone=" label followed by the phone number.

Each record in the flat file is a line that is terminated with the operating system's line separator.

You might be wondering why these two files are considered "flat". The term "flat" means that the file is not indexed. The term also implies that a flat file does not have a hierarchical structure; however, many flat files do have a hierarchical structure. Even simple flat files, such as the employees file above, contain a sequence of records where each record contains a sequence of fields. Many flat files, such as those used to exchange insurance claims, have more complicated data structures, such as multiple record types, groups of records, nested groups of records, repeating groups, etc.

Flat files are commonly used to transfer data between applications, since many business applications (e.g., CRM systems, ERP systems, EDI translators, legacy applications, etc.) use flat files to import and export data. For example, when a company receives an EDI invoice from a vendor, it will use an EDI translator to convert the invoice data from the EDI data format (e.g., X12) into the data format required by the accounts payable system. The EDI translator will typically produce a flat file containing the converted invoice data. The accounts payable system then imports this flat file.

In the future, many business applications will be able to import and export XML. Until then, there will be a need for conversion tools that can convert complex legacy data into XML documents, and vice versa.

Conversion Between Flat Files and XML

Companies will need to convert flat files to XML when transferring data from legacy applications to XML-capable applications (e.g., an ERP system, Microsoft's Internet Explorer, etc.). Companies will also convert legacy data into XML when they need to display the data on a non-XML-capable browser, since it is easy to convert XML to HTML using XSLT.

Companies will need to convert XML into flat files when transferring data from XML-capable systems to a legacy system.

Conversion between flat file data and XML can be done via generic conversion tools (e.g., XML Convert) or custom scripts (e.g., a Perl script). Generic conversion tools are schema-driven, so that they can handle a wide range of legacy data formats. Such a conversion tool uses the schema of the flat file to parse the file and convert it to an XML document. The conversion tool also needs the flat file schema when converting an XML document into a flat file that conforms to the flat file schema.

The XFlat Language

XFlat is an XML language for defining flat file schemas. An XFlat schema is an XML document that conforms to the XFlat language and that describes the format of a class of flat files. An XFlat schema defines the structure and syntax of a class of flat files that contain non-XML data. An XFlat schema also defines the structure and syntax of a class of XFlat instances. An XFlat instance is an XML document whose structure is the same as a flat file and whose data is the same as the data in a flat file. In other words, an XFlat schema describes the structure of a class of flat files and the corresponding class of XFlat instances. XML Convert uses XFlat schemas to transform flat files into XFlat instances and vice versa.

The flat file that is described by an XFlat schema must consist of records, where each record is a sequence of fields. A field is an atomic piece of data (e.g., a postal code). Records and fields may be delimited. A record separator (i.e., delimiter) occurs at the end of a record and helps the parser determine where the record ends. Likewise, a field separator occurs at the end of a field and helps a parser to determine where a field ends. Fields that are not delimited must meet one or both of the following constraints:

  • The field must be fixed length (i.e., the minimum length of the field must be equal to the maximum length of the field).
  • The set of characters that are allowed in the field value must be specified.

The records may be grouped, and groups of records may be nested in a hierarchical structure (in other words, groups of records may contain subgroups). Note that XFlat supports nested data structures, but it does not support recursive data structures.

The XFlat element types are as follows:

  • XFlat, which is used to define an XFlat schema. The XFlat element is always the document element (i.e., root element) of an XFlat schema. An XFlat element must contain exactly one subelement, and this subelement must be a SequenceDef element, a ChoiceDef element or a RecordDef element.
  • SequenceDef, which is used to define a sequence of objects, where each object may be a sequence, a choice or a record. A SequenceDef element must contain one or more subelements; each of these subelements must be a SequenceDef element, a ChoiceDef element or a RecordDef element.
  • ChoiceDef, which is used to define a choice of one object from a set of objects, where each object in the set may be a choice, a sequence or a record. A ChoiceDef element must contain one or more subelements; each of these subelements must be a SequenceDef element, a ChoiceDef element or a RecordDef element.
  • RecordDef, which is used to define a record, which is essentially a sequence of fields. A RecordDef element must contain one or more FieldDef elements.
  • FieldDef, which is used to define a field. A field contains an atomic piece of data. A FieldDef element may not contain any subelements.

An XFlat schema contains all the information needed to convert a flat file to XML (or vice versa). The MapToXml attribute in the XFlat language allows you to map each group, record and field to an XML element or to nothing. A field can also be mapped to an XML attribute.

Note that XFlat is a declarative language. A non-programmer who is familiar with flat files can create an XFlat schema.

For more information about XFlat (e.g., the definitions of the XML elements and attributes in the XFlat language), please refer to the XFlat Language page in the XML Convert documentation.

Example XFlat Schemas

Let's look at the XFlat schema for the employees flat file. The contents of that flat file were as follows:

123456789,"Carr, Lisa",100000.00
444556666,"Barr, Clark",87000.00
777227878,"Parr, Jack",123000.00
998877665,"Charr, Lee",92000.00

The following XFlat schema describes the layout of the employees flat file:

<?xml version='1.0'?>
<XFlat Name="employees_schema" Description="Schema for CSV flat file">
    <SequenceDef Name="employees" Description="employees flat file">
        <RecordDef Name="employee" FieldSep="," RecSep="\N" MaxOccur="0">
            <FieldDef Name="ssn" NullAllowed="No" 
                      MinFieldLength="9" MaxFieldLength="11"
                      DataType="Integer" MinValue="0"
                      QuotedValue="Yes"/>
            <FieldDef Name="name" NullAllowed="No"
                      QuotedValue="Yes"/>
            <FieldDef Name="salary" NullAllowed="No"
                      DataType="Float" MinValue="0"
                      QuotedValue="Yes"/>
        </RecordDef>
    </SequenceDef>
</XFlat>

Please note the following about this XFlat schema:

  • Each of the following is mapped to an XML element in the XFlat instance by default:
    • the flat file as a whole
    • the employee record
    • the three employee fields
  • The tags for the XFlat instance are specified by the Name attributes in the SequenceDef, RecordDef and FieldDef elements.
  • The RecordDef element contains the MaxOccur="0" attribute, which means that there is no upper limit on the number of the employee records that may appear in the flat file.
  • The Description attribute, which is used in the XFlat and SequenceDef elements, contains free form text and is treated as a comment.
  • The record separator for the record is defined as "\N", which is an XFlat encoding for the line separator for the local operating system. On Unix, "\N" is converted to the line feed character (Unicode #xA). On Windows, "\N" is converted to the carriage return character (Unicode #xD) followed by the line feed character (Unicode #xA). If we were to define the value of the RecSep attribute as "&#D;&#A;", then an XML parser would convert this value to a space character during the attribute normalization process. XFlat has encodings for special characters that are affected by attribute normalization or that may not appear in XML documents.
  • All the FieldDef elements contain the QuotedValue="Yes" attribute, since the flat file format is CSV. For the same reason, the RecordDef element contains the FieldSep="," attribute.
  • All three FieldDef elements contain the NullAllowed="No" attribute, since all three fields are mandatory (i.e., the length of the field value must be greater than or equal to one character).
  • The data type of the Social Security Number (ssn) field is declared as "Integer".
  • The minimum value of the Social Security Number (ssn) field is set to zero, since a negative social security number is invalid.
  • The data type of the salary field is declared as "Float".
  • The minimum value of the Salary field is set to zero, since a negative salary is invalid.
  • The FieldDef elements for the name and salary fields do not include the MaxFieldLength attribute. Thus, the default maximum field length (i.e., 80 characters) applies to both fields.
  • The minimum field length for the ssn field is set to 9, since social security numbers are 9 digits long. The maximum field length for the ssn field is set to 11, since the value of the ssn field might be enclosed in quotes.

Now let's look at the XFlat schema for the contacts flat file. The following were the contents of that flat file:

[contact]
name=Nancy Magill
email=lil.magill@blackmountainhills.com
phone=(100) 555-9328
[contact]
email=molly.jones@oblada.com
name=Molly Jones
[contact]
phone=(200) 555-3249
name=Penny Lane
email=plane@bluesuburbanskies.com

The following XFlat schema describes the layout of the contacts flat file:

<?xml version='1.0'?>
<XFlat Name="contacts_schema" Description="Schema for contacts flat file">
    <SequenceDef Name="contacts" Description="Contacts flat file">
        <SequenceDef Name="contact" MinOccur="0" MaxOccur="0">
            <RecordDef Name="begin_contact" MapToXml="No" RecSep="\N">
                <FieldDef Name="label"
                          ValidValue="[contact]"
                          NullAllowed="No"
                          MapToXml="No"/>
            </RecordDef>
            <ChoiceDef Name="choice_of_one" MapToXml="No"
                       MinOccur="0" MaxOccur="3">
                <RecordDef Name="full_name" RecSep="\N" MapToXml="No">
                    <FieldDef Name="label"
                              ValidValue="name="
                              NullAllowed="No"
                              MapToXml="No"/>
                    <FieldDef Name="full_name"/>
                </RecordDef>
                <RecordDef Name="phone_num" RecSep="\N" MapToXml="No">
                    <FieldDef Name="label"
                              ValidValue="phone="
                              NullAllowed="No"
                              MapToXml="No"/>
                    <FieldDef Name="phone_number"/>
                </RecordDef>
                <RecordDef Name="email" RecSep="\N" MapToXml="No">
                    <FieldDef Name="label"
                              ValidValue="email="
                              NullAllowed="No"
                              MapToXml="No"/>
                    <FieldDef Name="email_address"/>
                </RecordDef>
            </ChoiceDef>
        </SequenceDef>
    </SequenceDef>
</XFlat>

Please note the following about this XFlat schema:

  • Each of the following is mapped to an XML element in the XFlat instance by default:
    • the flat file as a whole
    • the contact as a whole
    • the full_name field
    • the phone_number field
    • the email_address field
  • The tags for the XFlat instance are specified by the Name attributes in the SequenceDef and FieldDef elements.
  • The FieldDef elements for all the label fields contain the MapToXml="No" attribute, since we don't want to map the label fields to XML.
  • All four RecordDef elements contain the MapToXml="No" element, since there is only one interesting field in each of these four record types.
  • The ChoiceDef element specifies a choice of exactly one of the following: the full_name record, the phone_number record or the email_address record. The MaxOccur of the ChoiceDef element is three, since a contact may contain up to three records (i.e., a full_name record, a phone_number record and an email_address record).
  • The record separator for the records is defined as "\N", which is an XFlat encoding for the line separator for the local operating system.
  • Each of the full_name, phone_number and email_address fields is delimited by the record separator.
  • The FieldDef elements for all the label fields are not delimited by a field separator or record separator. Also, the label fields are variable length, since the default value of the MinFieldLength attribute is zero and the default value of the MaxFieldLength attribute is 80. In general, the FieldDef elements for variable-length, non-delimited fields must contain the ValidChars, InvalidChars or ValidValue attribute, so that XML Convert can figure out where the field ends. The FieldDef elements for all the label fields contain the ValidValue attribute; they also contain the NullAllowed="No" attribute, since the values of the label fields may not be null.
  • This schema does not describe all the syntax rules for this flat file. For example, a contact is not allowed to contain two name records; unfortunately, the XFlat language is not yet capable of expressing such a syntactical constraint.

XML Convert

XML Convert 2.2 is a Java application that uses XFlat schemas to convert flat files into XML, and vice versa.

Features

The key features of XML Convert 2.2 include:

  • Converts flat files to XML, and vice versa. Also converts flat files from one format to another.
  • Handles a wide range of flat file formats, including the following:
    • variable length records
    • comma separated values (CSV)
    • fixed length fields and records
    • multiple record types
    • nested groups of records
    • records that contain a mix of delimited and non-delimited fields
    • records in which each field has a different delimiter
    • semi-structured data, such as news feeds and human-readable reports
    • many others
  • Handles flat files that contain control characters and non-printable characters.
  • Uses a simple XML language (i.e., XFlat) for the flat file schemas.
  • Handles very large files. Converts in stream (i.e., serial) mode, which means that XML Convert does not read the entire flat file or XFlat instance into memory.
  • Written in Java for portability.
  • Includes Windows executables for ease of use on a PC.
  • Includes Java applications that are invoked from the command line.
  • Includes a simple Java API, so that XML Convert can be easily invoked from the user's Java application.
  • The flat file to XML converter is also available as a SAX 2.0 driver, so that applications, such as the SAXON XSLT processor, can process a flat file as if it were an XML document.
  • The XML to flat file converter is also available as a SAX 2.0 content handler.
  • Includes an XT output method that allows XT to transform any XML document into a flat file, the format of which is described in an XFlat schema. XT is James Clark's XSLT processor.
  • Includes a Java application that invokes XT to transform a flat file as if it were an XML document.
  • Validates the input data (e.g., flat file data or an XFlat instance) against the XFlat schema. If an error is found in the input data, then the conversion process is terminated and a detailed error message is generated.
  • Error messages include a detailed description of the error and the location of the error (within the XFlat schema file, the XFlat instance or the flat file), so that the user can quickly troubleshoot the error.

When XML Convert transforms a flat file to an XML document (i.e., an XFlat instance), it will verify the structure of the flat file data and the data types of the fields using the XFlat schema. If the flat file does not pass this verification, then it is rejected. This verification minimizes the chance that an invalid XML document will be sent to the receiving application.

Likewise, when XML Convert transforms an XML document to a flat file, it will verify that the XML document conforms with the XFlat schema. This verification minimizes the chance that an invalid flat file will be imported into a business application.

Converting Between the Employees Flat File and XML

Using the XFlat schema for the employees flat file (see above), XML Convert would convert the employees flat file into the following XML document (i.e., XFlat instance):

<?xml version='1.0'?>
<employees>
    <employee>
        <ssn>123456789</ssn>
        <name>Carr, Lisa</name>
        <salary>100000.00</salary>
    </employee>
    <employee>
        <ssn>444556666</ssn>
        <name>Barr, Clark</name>
        <salary>87000.00</salary>
    </employee>
    <employee>
        <ssn>777227878</ssn>
        <name>Parr, Jack</name>
        <salary>123000.00</salary>
    </employee>
    <employee>
        <ssn>998877665</ssn>
        <name>Charr, Lee</name>
        <salary>92000.00</salary>
    </employee>
</employees>

In the reverse direction, using the same XFlat schema, XML Convert would convert this XFlat instance back into the original employees flat file.

Converting Between the Contacts Flat File and XML

Using the XFlat schema for the contacts flat file (see above), XML Convert would convert the contacts flat file into the following XML document:

<?xml version='1.0'?>
<contacts>
    <contact>
        <full_name>Nancy Magill</full_name>
        <email_address>lil.magill@blackmountainhills.com</email_address>
        <phone_number>(100) 555-9328</phone_number>
    </contact>
    <contact>
        <email_address>molly.jones@oblada.com</email_address>
        <full_name>Molly Jones</full_name>
    </contact>
    <contact>
        <phone_number>(200) 555-3249</phone_number>
        <full_name>Penny Lane</full_name>
        <email_address>plane@bluesuburbanskies.com</email_address>
    </contact>
</contacts>

In the reverse direction, using the same XFlat schema, XML Convert would convert this XFlat instance back into the original contacts flat file.

XSLT and XML Convert

After converting a flat file into XML using XML Convert, it may be necessary to transform the resulting XML document (i.e., the XFlat instance) before sending it to the receiving application. For example, if the resulting XFlat instance will be sent to a browser that does not support XML, then the XFlat instance should be converted from XML to HTML using an XSLT processor. If the output will be sent to an XML-capable application, then it may be necessary to use an XSLT processor to convert the XFlat instance into a new XML document whose structure meets the requirements of the receiving application. (Note that the output of the XSLT processor can be an XML document, an HTML document or text.)

If the resulting XFlat instance will be sent to an XML-capable browser, then the XFlat instance can specify a stylesheet, so that the browser renders the XML document as a nicely formatted web page.

When converting an XML document into a flat file, the XML document might not have the same structure as the target flat file. In this case, the user can use an XSLT processor to convert the XML document into an XFlat instance (i.e., an XML document that complies with the XFlat schema that describes the format of the target flat file). The user would then employ XML Convert to transform the XFlat instance into a flat file. XML Convert uses an XFlat schema to parse the XFlat instance and produce the target flat file.

If you plan to use an XSLT processor to transform the output of XML Convert into a new XML document, then keep in mind most XSLT processors read the entire source document into memory.

Please note that XML Convert and the XFlat language do not provide any XML to XML transformation features, since XSLT can be used to do XML to XML transformation.

Also note that an XSLT processor can convert an XML document into non-XML text, without any help from XML Convert. Thus, you could use an XSLT processor without XML Convert to transform an XML document into a flat file. However, it would be tedious to write an XSLT stylesheet that rejects an input document that cannot be transformed into a valid flat file. It's important to reject a source document that cannot be transformed into a valid flat file, so that the receiving application does not import an invalid flat file. XML Convert rejects the input data file when it does not conform to the XFlat schema. Also, for most flat files, you can write a single XFlat schema that can be used in both directions (i.e., conversion from flat file to XML, and conversion from XML to flat file).

Summary

XML Convert 2.2 is a Java application that uses XFlat schemas to convert flat files into XML and vice versa. XML Convert can also convert legacy data from one format to another. XFlat is an XML language for defining flat file schemas. XML Convert uses an XFlat schema to parse and validate the input file (i.e., the flat file or the XFlat instance), and to produce the output file. XML Convert supports a wide variety of legacy data formats, including CSV, semi-structured data (e.g., human readable reports), fixed length records and fields, multiple record types, groups of records, nested groups, etc.