Converting a fixed width/ delimiter-separated text file content (Input File) to another formatted text file (Output File). What approach to use?

Ryan Chong
CERTIFIED EXPERT
The best way to learn is to teach
Published:
Updated:
In real business world data are crucial and sometimes data are shared among different information systems. Hence, an agreeable file transfer protocol need to be established.

Definitions:


Source System
The IT system that provide the original data.
Targeted System
The IT system that will received the output data from Source System.
Input File
Referring to the original data file that generated from the Source System.
Output File
Referring to the formatted Input File.
Source File
Referring to the Final Data File that will being used by the Targeted System.
File Convertor
The program that converts the Input File to Output File.
Template File
The configuration file that defines the rules how the Output File should be generated via the File Convertor.
Admin Executables / Template File Generator
The program that defines the configuration rules how the Output File should be generated via the File Convertor.


The Issue


Relating to this file transfer protocol, there are a few issues that need to be agreed on, especially to do with how the data is being transferred.


There are a few options available:

  1. Database to Database Integration Services - Data are being transferred seamlessly between Source System and Targeted System with authenticated credentials

  2. Data File Transfer - Data is first exported from the Source System as a file and then imported into the Target System

  3. Manual System Entries - Data is recreated manually via some sort of user interface in the Targeted System


What we will focus on in this article is the second scenario, Data File Transfer.


A common problem is there is no option available in the Source System to directly generate the Source File needed by Targeted System.


Figure 1


Hence, the Source File (Input File) needs to be further processed before it (Output File) can be recognized and used by the Targeted System.



Proposed Resolution


1. File Formats

To ensure the data to be transferred successfully from the Source System to Targeted System, first we need to determine the file format of Input File.

First of all, we need to determine what are the sections available in the Input File, most of the cases, there could be a Header section, a Detail section, with an optional Footer section. Each of the sections could be defined differently with different fixed width to define the data that being stored.

Similarly for the Output File, there could have a different format which come with a Header section, a Detail section or even with a Footer section, that's all depends on the Targeted System's requirement on what data format are accepted.

The difference could be in term of:
i) Number of fields required
ii) The sequence of fields
iii) The conversion of values
iv) The change of fixed width / delimiter
v) The conversion from fixed width to delimiter-separated, or vice versa
vi) The use of Derived fields

And hence, an Input & Output File Mapping documentation is useful to document all the requirements.



2. File Converter

A few things to consider:

i) User interface consideration

Do we need to build an User Interface to enable user to operate the program or we simply make it as simple as a Console Program or a Service that running at background?

ii) Programming Languages and Tools

What programming languages or tools to be used to generated the Output File?

There are few options available, such as:
a) SQL Server Integration Services (SSIS)
   We can create a Script Task (in C# or VB.NET) and write the logic within.

b) a Compiled Executables, VB Script, PowerShell script etc
   Depends on the efficiency and convenience of developer, different programming languages and tools could be used to develop the File Convertor

iii) Delimiter and Text Qualifier

In case the Delimiter was being used in either Input File or Output File, do consider to use the Delimiter as unique as possible, which is different from the data content.

A Text Qualifier could be very useful to use in case inevitably the data also contains the character of delimiter as part of the content.


iv) Other Consideration

Other than that, we may think what features we should build into the File Converter, such as:
a) Data verification
   To verify the data checksum, size, format, etc.

b) Derived fields
    To enable the File Converter to have capability to generate derived  fields based on certain logic, such as summation of fields, count of  records, etc.

3. Static or Dynamic Approach?

As what we can see from Figure in section: 1. File Formats above, it illustrates the Static mapping of Input File to Output File.

The question that may arise is how can have a more dynamic approach to make it possible for the File Convertor to generate different Output File with different output formats?

In real programming world, it's really not a practical approach to hard code the logic in the codes. Therefore if there's a chance, the developers will try to make things to be configurable as possible as it can.

I think to make this possible, we would need to have another Admin Executables (aka Template File Generator) in order to generate a Template File (we may also called it as a profile), so that when the File Converter starts to convert the Input File's content, it would actually read the configuration rules in the Template File and then do the conversion accordingly to generate the Output File.




How can this Template File looks like?

Well, it depends on what programming languages and tools we are using. As a general guideline, it probably will be in file format such as XML or Config file.

Sample using XML:


<?xml version="1.0"?>
<setting>
   <ver>1.0</ver>
   <input>
        <format>fixed width</format>
        <header>
            <code>header</code>
            <isdetail>false</isdetail>
            <field>
                <code>1</code>
                <name>Record Identifier</name>
                <type>Char</type>
                <size>34</size>
                <start>1</start>
                <end>34</end>
                <remarks></remarks>
            </field>
        </header>
        <header>
            <code>detail</code>
            <isdetail>true</isdetail>
            <field>
                <code>SubmitDate</code>
                <name>Submission Date</name>
                <type>Date</type>
                <format>YYYYMMDD</format>
                <size>8</size>
                <start>1</start>
                <end>8</end>
                <remarks></remarks>
            </field>            
            <field>
                <code>SerialNo</code>
                <name>Serial Number</name>
                <type>Numeric</type>
                <size>9</size>
                <start>9</start>
                <end>17</end>
                <remarks></remarks>
            </field>
            <field>
                <code>CustName</code>
                <name>Customer Name</name>
                <type>Char</type>
                <size>20</size>
                <start>18</start>
                <end>37</end>
                <remarks></remarks>
            </field>
        </header>
   </input>
   <output>
        <format>fixed width</format>
        <header>
            <code>AcctList</code>
            <remarks>This is the Account List</remarks>
            <printremarks>false</printremarks>
            <isdetail>true</isdetail>
            <field>
                <code>SubmitDate</code>
                <source>
                    <code>detail|SerialNo</code>
                </source>
                <trim>false</trim>                    
                <name>Submission Date</name>
                <type>Date</type>
                <format>MM/DD/YYYY</format>
                <size>10</size>
                <start>1</start>
                <end>10</end>
                <align>right</align>
                <remarks></remarks>
            </field>            
            <field>
                <code>AppCode</code>
                <source>
                    <code>detail|SerialNo</code>
                </source>                    
                <trim>true</trim>
                <formula></formula>                    
                <name>Application Code</name>
                <type>Numeric</type>
                <size>9</size>
                <start>11</start>
                <end>19</end>
                <align>left</align>
                <remarks></remarks>
            </field>
        </header>
        <header>
            <code>AcctTotal</code>
            <remarks>This is the Account Total</remarks>
            <printremarks>true</printremarks>
            <isdetail>false</isdetail>
            <field>
                <code>AppCode</code>
                <source>
                    <code>detail|SerialNo2</code>
                </source>
                <trim>true</trim>
                <formula>{SUM}(1)</formula>
                <name>Total Application</name>
                <type>Numeric</type>
                <format>#,#0</format>
                <size>7</size>
                <start>1</start>
                <end>7</end>
                <align>right</align>
                <remarks></remarks>
            </field>
        </header>
   </output>
</setting>

To give a sample to reading above template using C# codes, we can customize this sample:


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

using System.Xml;
using System.IO;


namespace MyFileConvertor
{
    class FileConvertor
    {
        private XmlDocument doc;

        public String TemplateFileVersion { get; set; }
        public Input Input { get; set; }
        public Output Output { get; set; }

        public FileConvertor(String TemplatePath)
        {
            doc = new XmlDocument();
            Input = new Input();
            Output = new Output();

            doc.Load(TemplatePath);
            TemplateFileVersion = doc.SelectSingleNode("setting/ver").InnerText;
            
            Input.Format = doc.SelectSingleNode("setting/input/format").InnerText;

            XmlNodeList inputHeaders = doc.SelectNodes("setting/input/header");
            foreach (XmlNode inputHeader in inputHeaders)
            {
                InputHeader hdr = new InputHeader();
                hdr.Code = inputHeader.SelectSingleNode("code").InnerText;
                hdr.isDetail = inputHeader.SelectSingleNode("isdetail").InnerText == "true" ? true : false;

                XmlNodeList inputFields = inputHeader.SelectNodes("field");
                foreach (XmlNode inputField in inputFields)
                {
                    Field fld = new Field();
                    fld.Code = inputField.SelectSingleNode("code").InnerText;
                    fld.Name = inputField.SelectSingleNode("name").InnerText;
                    fld.Type = inputField.SelectSingleNode("type").InnerText;
                    fld.Format = getNodeText(inputField.SelectSingleNode("format"));
                    fld.Size = Convert.ToInt32(inputField.SelectSingleNode("size").InnerText);
                    fld.Start = Convert.ToInt32(inputField.SelectSingleNode("start").InnerText);
                    fld.End = Convert.ToInt32(inputField.SelectSingleNode("end").InnerText);
                    fld.Remarks = inputField.SelectSingleNode("remarks").InnerText;
                    hdr.Fields.Add(fld);
                }
                Input.InputHeaders.Add(hdr);
            }

            Output.Format = doc.SelectSingleNode("setting/output/format").InnerText;

            XmlNodeList outputHeaders = doc.SelectNodes("setting/output/header");
            foreach (XmlNode outputHeader in outputHeaders)
            {
                OutputHeader hdr = new OutputHeader();
                hdr.Code = outputHeader.SelectSingleNode("code").InnerText;
                hdr.Remarks = outputHeader.SelectSingleNode("remarks").InnerText;
                hdr.PrintRemarks = outputHeader.SelectSingleNode("printremarks").InnerText;
                hdr.isDetail = outputHeader.SelectSingleNode("isdetail").InnerText == "true" ? true : false;

                XmlNodeList outputFields = outputHeader.SelectNodes("field");
                foreach (XmlNode outputField in outputFields)
                {
                    Field fld = new Field();
                    fld.Code = outputField.SelectSingleNode("code").InnerText;
                    fld.Name = outputField.SelectSingleNode("name").InnerText;
                    fld.Type = outputField.SelectSingleNode("type").InnerText;
                    fld.Format = getNodeText(outputField.SelectSingleNode("format"));
                    fld.Size = Convert.ToInt32(outputField.SelectSingleNode("size").InnerText);
                    fld.Start = Convert.ToInt32(outputField.SelectSingleNode("start").InnerText);
                    fld.End = Convert.ToInt32(outputField.SelectSingleNode("end").InnerText);
                    fld.Alignment = outputField.SelectSingleNode("align").InnerText;
                    fld.Remarks = outputField.SelectSingleNode("remarks").InnerText;

                    fld.Formula = getNodeText(outputField.SelectSingleNode("formula"));
                    fld.IsTrim = getNodeText(outputField.SelectSingleNode("trim")) == "true" ? true : false;

                    XmlNodeList outputFieldCodes = outputField.SelectNodes("source");
                    foreach (XmlNode outputFieldCode in outputFieldCodes)
                    {
                        Source src = new Source();
                        src.Codes.Add(outputFieldCode.SelectSingleNode("code").InnerText);
                        fld.Sources.Add(src);
                    }
                    
                    hdr.Fields.Add(fld);
                }
                Output.OutputHeaders.Add(hdr);
            }


        }
        public String getNodeText(XmlNode node)
        {
            try
            {
                return node.InnerText;
            }
            catch (Exception)
            {
                return "";
            }
        }
        public String getNodeText(XmlNode node, String defaultReturn)
        {
            try
            {
                return node.InnerText;
            }
            catch (Exception)
            {
                return defaultReturn;
            }
        }
        
    }

    class Input
    {
        public String Format { get; set; }
        public List<InputHeader> InputHeaders;

        public Input()
        {
            InputHeaders = new List<InputHeader>();
        }
    }

    class InputHeader
    {
        public String Code { get; set; }
        public Boolean isDetail { get; set; }
        public List<Field> Fields;

        public InputHeader()
        {
            Fields = new List<Field>();
        }
    }

    class Output
    {
        public String Format { get; set; }
        public List<OutputHeader> OutputHeaders;

        public Output()
        {
            OutputHeaders = new List<OutputHeader>();
        }
    }

    class OutputHeader
    {
        public String Code { get; set; }
        public String Remarks { get; set; }
        public String PrintRemarks { get; set; }
        public Boolean isDetail { get; set; }
        public List<Field> Fields;

        public OutputHeader()
        {
            Fields = new List<Field>();
        }
    }

    class Field
    {
        public String Code { get; set; }

        public Boolean IsTrim { get; set; }
        public String Formula { get; set; }
        
        public String Name { get; set; }
        public String Type { get; set; }
        public String Format { get; set; }
        public int Size { get; set; }
        public int Start { get; set; }
        public int End { get; set; }
        public String Alignment { get; set; }
        public String Remarks { get; set; }
        
        public List<Source> Sources;

        public Field()
        {
            Sources = new List<Source>();
        }
    }

    class Source
    {
        public List<String> Codes;
        public Source()
        {
            Codes = new List<string>();
        }
    }

}


In general, we can build a much more complex rules in the Template File to meet the requirements.

So, the preferred Programming Languages and Tools will play a crucial role to read the settings from the Template File, which eventually based on the logic within to generate the Output File.

This may not be an easy task but it's worth a try :)


Conclusion:

As a whole, we need to ensure the data accuracy and integrity are take into consideration.

Some of other concerns:

1. We can build an application log to track the activities of during the file conversion process.
2. We can embed the security algorithms to protect our data if necessary.
3. Automate the whole process by integrate it as part of the ETL (Extract, Transform and Load) or Windows Scheduled Task process.
4. How the Template File to be updated/ stored if the File Converter had been deployed to many computer machines?

More references:

Script Task
https://docs.microsoft.com/en-us/sql/integration-services/control-flow/script-task

Text Import Wizard
https://support.office.com/en-us/article/Text-Import-Wizard-c5b02af6-fda1-4440-899f-f78bafe41857?ui=en-US&rs=en-US&ad=US&fromAR=1
(Read Step 2 of 3 for explanation of Delimiter and Text Qualifier)

0
1,957 Views
Ryan Chong
CERTIFIED EXPERT
The best way to learn is to teach

Comments (0)

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.