|
After reading a post on the C# newsgroup asking for a EBCDIC to ASCII
converter, and seeing one solution, I decided to write my own implementation.
This page describes the implementation and its limitations, and a bit about
EBCDIC itself.
EBCDIC
Unfortunately it appears to be fairly tricky to get hold of many concrete
specifications of EBCDIC. This is what I've managed to glean from various
websites:
- Introduced by IBM, EBCDIC is an encoding mostly used on mainframes.
-
Like "OEM", EBCDIC isn't a single character encoding:
there are many EBCDIC encodings, suited to different cultures.
-
It is primarily a single-byte encoding, ie each character is encoded
as a single byte. However, there are two characters, "shift out" and
"shift in" (0x0e and 0x0f respectively) which are used to change
between this an a double-byte character set (DBCS). As far as I can
tell, a single EBCDIC encoding doesn't specify which DBCS is to
be used - in other words, you really need even more information
before you can tell what's going on. Presumably the DBCS in question
can't have any pairs beginning with byte 0x0f, as otherwise it would
be confused with the "shift in" flag.
If you have any more information, particularly about the DBCS
aspect, please mail me at
skeet@pobox.com.
My EBCDIC Encoding implementation
I managed to get hold of details of 47 EBCDIC encodings from
http://std.dkuug.dk/i18n/charmaps/.
To be honest, I don't really know what DKUUG is, so I'm really just hoping that
the maps are accurate - they seem to be quite reasonable though. Each
encoding has a name and several have aliases, although I currently ignore
this aliasing.
My implementation consists of three projects, described below, of which only
the middle one is of any interest to most people.
- A character map reader
-
This simply finds all of the files whose names begin with "EBCDIC-" in the
current directory, reads them all in (warning of any oddities in the encoding,
such as any non-zero byte having two distinct meanings) and writes a resource
file out,
ebcdic.dat. This is a console applicion built from a
single C# source file.
- An encoding library
-
This is a library built from two C# source files and the
ebcdic.dat
file generated by the reader. This library is all most users will need. More details
are provided below.
- A test program
-
This is a console application built from a single C# source file and requiring
the library described above. Currently it just displays the encoded version of
"hello" and then decodes it.
Using The Encoding Library
The encoding library is very simple to use, as the encoding class
(JonSkeet.Ebcdic.EbcdicEncoding) is a subclass of the standard
.NET System.Text.Encoding class. To obtain an instance of the appropriate
encoding, use EbcdicEncoding.GetEncoding (String) passing it the name of the
encoding you wish to use (eg EBCDIC-US). You can find out
the list of names of available encodings using the EbcdicEncoding.AllNames
property, which returns the names as an array of strings.
Once you have obtained an EbcdicEncoding instance, use it like any other
Encoding: call GetString, GetBytes etc. The encoding
does not save any state between requests, and can safely be used by many threads
simultaneously. There is no need (or indeed facility) to release encoding resources when
it is no longer needed. All encodings are created on the first use of the
EbcdicEncoding class, and maintained until the application domain is unloaded.
Sample Code
The following is a sample program to convert a file from EBCDIC-US to ASCII. It should
be easy to see how to modify it to convert the other way, or to use a different
encoding (eg from EBCDIC-UK, or to UTF-8).
public class Singleton
using System;
using System.IO;
using System.Text;
using JonSkeet.Ebcdic;
public class ConvertFile
{
public static void Main(string[] args)
{
if (args.Length != 2)
{
Console.WriteLine
("Usage: ConvertFile ");
return;
}
string inputFile = args[0];
string outputFile = args[1];
Encoding inputEncoding = EbcdicEncoding.GetEncoding ("EBCDIC-US");
Encoding outputEncoding = Encoding.ASCII;
try
{
using (StreamReader inputReader =
new StreamReader (inputFile, inputEncoding))
{
using (StreamWriter outputWriter =
new StreamWriter (outputFile, false, outputEncoding))
{
char[] buffer = new char[8192];
int len=0;
while ( (len=inputReader.Read (buffer, 0, buffer.Length)) > 0)
{
outputWriter.Write (buffer, 0, len);
}
}
}
}
catch (IOException e)
{
Console.WriteLine ("Exception during processing: {0}", e.Message);
}
}
}
|
Limitations
Due to the lack of available information about the DBCS aspect of EBCDIC, this
encoding class makes no effort whatsoever to simulate proper shifting. Shift out and
shift in are merely encoded/decoded to/from their equivalent Unicode characters,
and bytes between them are treated as if the shift had not taken place. (This means
that a decoded byte array is always a string of the same length as the byte array,
and vice versa).
Any byte not recognised to be from the specific encoding being used is decoded to the
question mark character, '?'. Any character not recognised to be in the set of characters
encoded by the specific encoding being used is encoded to the byte representing the
question mark character, or to byte zero if the question mark character is not in the
character set either.
The library doesn't currently have a strong-name, so can't be placed in the GAC. You
may, however, download the source and modify
Licence
This was just an interesting half-day project. I have no desire to make any money out
of this code whatsoever, but I hope it's interesting and useful to others. So,
feel free to use it. If you have any questions about it, or just find it useful and
wish to let me know, please mail me at skeet@pobox.com.
You may use this code in commercial projects, either in binary or source form. You
may change the namespace and the class names to suit your company, and modify
the code if you wish. I'd rather you didn't try to pass it off as your own work,
and specifically you may not sell just this code - at least not without asking me first.
I make no claims whatsoever about this code - it comes with no warranty, not even
the implied warranty of fitness for purpose, so don't sue me if it breaks something.
(Mail me instead, so we can try to stop it from happening again.)
Downloads
Jon Skeet is a software engineer in Reading, England. He specialises in Java and C#,
and is considered by his colleagues to be a fount of useless information about both.
He can be reached at skeet@pobox.com, and has a C# website at http://www.pobox.com/~skeet/csharp
Do you have a question or comment about this article? Have a programming problem you need to solve? Post it at eggheadcafe.com forums and receive immediate email notification of responses.
|