 |
 |

Automated Text File Format and Character Encoding Converter |
Easily and quickly convert text file character encoding, and text file formats between TSV, CSV, XML, and more
|
|
|
|
|
|
|
|
|
|
|

Conversion Between Text File Formats
There are many ways to structure your data in a text file. Among them the
CSV and tab-delimited formats are in widespread use and can be opened by many
kinds of applications, like spreadsheet and database programs. You should avoid
constructing or editing such files by hand. One problem with tab-delimited
text files is that tabs are whitespace characters, and you may therefore easily
break the structure by replacing a tab with a space. In the case of CSV, the
comma is such a common character that the specification provides conventions
for avoiding delimiter collision,
so that a comma intended as part of the data is not interpreted as a delimiter
instead. It is thus far better to convert file formats using software like
Mergemill Pro.
XML, or extensible markup language, is the most commonly used machine-readable
format. For compatibility between database applications, it is best to convert
the tab-delimited and CSV formats to XML files. One important advantage, among
many, of using XML is that you may specify the character encoding of the content.
This makes it very easy to migrate multilingual data.
Conversion Between Text File Character Encodings
In order to represent textual characters in a file, some sort of mapping must
be used to assign numeric values to the characters. The mapping
varies depending on the character set, which depends on the language being
used and other factors. Larger character sets, such as the Japanese Kanji set,
use more bytes to represent each of their members.
Interpretive problems may occur if a computer attempts to read data encoded
with a mapping different from what it expects. An example is when a Mac OS
application attempts to read a text file created on a Windows computer.
The Mac OS application may expect text to use the Mac OS Roman character set,
while the Windows file may use the Windows Latin-1 character set. So to handle
text correctly, some method of identifying the various mappings and converting
between them is necessary.
Most character sets and character encoding schemes developed in the past are
limited in their coverage, usually supporting just one language or a small
set of languages. Multilingual software has traditionally had to implement
methods for supporting and identifying multiple character encodings.
A simpler solution is to combine the characters for all commonly used languages
and symbols into a single universal coded character set. Unicode is such a
universal coded character set, and offers the simplest solution to the problem
of text representation in multilingual systems. Because Unicode includes the
character repertoires of most common character encodings, it facilitates data
interchange with other platforms. Using Unicode, text shared across applications
and platforms can be encoded in a single coded character set.
The Mergemill Pro Advantage
Converting between common text file formats is easy with Mergemill Pro. You
simply choose to export data in CSV, XML, or tab-delimited text format. You
may also create a custom output format with no more than a few lines of scripts.
Converting between text encoding is even easier. Mergemill Pro lets you specify
the datafeed encoding and the output encoding, and it does the character
encoding conversion in generating the output. The Mergemill Pro interface elements,
internal data storage, and intermediate files created in running jobs are all
in UTF-8 Unicode.
The biggest benefits of using Mergemill Pro are its automation features, and
its powerful processing capabilities that let you do far more than simply conversion.
You may set up a drop-in folder for Mergemill Pro to automatically process
the files contained in the folder at certain scheduled times.


Top of Page

|
 |
 |