Some Tools for Conversion of Plain Text to HTML

Next Meeting: Sat, TBD
Meeting Directions

Be a Member
Join SCOUG

Navigation:

20 Most Recent Documents
Search Archives
Index by date, title, author, category.

Features:

Mr. Know-It-All
Ink
Download!

Supporting Warpstock Phoenix 2023

Supporting Warpstock Orlando 2022

SCOUG:

Home

Email Lists

SIGs (Internet, General Interest, Programming, Network, more..)

Online Chats

Pictures from Sept. 1999

The views expressed in articles on this site are those of their authors.

warptech
SCOUG was there!

SCOUG, Warp Expo West, and Warpfest are trademarks of the Southern California OS/2 User Group. OS/2, Workplace Shell, and IBM are registered trademarks of International Business Machines Corporation. All other trademarks remain the property of their respective owners.

The Southern California OS/2 User Group
USA

August 2002

Some Tools for Conversion of Plain Text to HTML

by Dallas E. Legan

The purpose of this article is to document a pair of programs for the conversion of text to html. Their purpose is to convert any URLs in the text to active links, convert any characters that might be confused with html into appropriately encoded form and in the process add a few boilerplate HTML tags that are frequently used. They do not do a complete job, but are simply to bring things to a state where the remaining work will typically be just some touch up that can easily be done with any text editor. I have no hard data, but think of it as about 90% of the work.

These programs were originally written for putting out the SCOUG 'Download' column by Gary Wong and for my personal use, but could easily be used for quickly putting technical documentation on a web site or as a first step converting a directory listing for a hard disk or removable media into html to be followed with further enhancement, into an html interface.

My thinking is that the web page author should concentrate first on the 'text,' and then move attention to the 'Hyper' part of the equation. Rather than obsoleting text, HTML enhances it by using technology to bring items that previously were too easily ignored or left out, like illustrations, footnotes and bibliographic references up to equality in ease of use with the rest of the document. The Internet's common FAQ, rather than a new format for technical information, is instead an updating of the ancient 'dialog', probably partially due to the writings of Douglas Hofstadter and Raymond Smullyan.

History...

These programs had their origin in a question about the sed stream editor from SCOUG member Peter Skye: whether it could be used for converting a list of URLs into html links. This question resulted in an article that was submitted to Skye's aborted 'Muscle Man' project, and later to 'Extended Attributes', but was never published. Later at a SCOUG general meeting the website/newsletter editor complained about the difficulty of making web pages of Gary Wong's 'Download' column. This seemed like the problem I'd given a little bit of attention to in the unpublished sed article, and an interesting one to expand on. I decided to take a stab at doing it in REXX, partly to avoid having to make the user install another tool, partly to exercise some of the ideas I'd learned attending a REXX Language Association conference, and partly because REXX seemed like it might make a more versatile tool. sed is very powerful, but also extremely narrowly focused in its job as a programmable editor, and trying to add features like a help listing seemed to stretch things.

...Featuritis

The programs originally began as one, URL2HTML, which did its basic task, and converted a few characters as needed. As more features were added, the help listing gradually grew to more than an 80 column by 25 line screen listing, and I decided that some of the functionality could reasonably be split off into a separate program, comfortably bringing the help listing for both programs within the boundaries of 80 by 25. Logically, it was useful to keep the processing of URLs into links grouped with conversion of problem characters into encoded form. If these were done separately, the problem of accidentally converting the characters of the HTML links into encoded form would arise. As it is, this conversion can be handled while the URLs are separated from the rest of the text. I call this program URL2LINK.

The other program, TXT2HTML, is simply for adding tags for paragraphing, breaking line, preformatted text, some generic html at the head and foot of the document etc. As I got into the swing of things, I added features like being able to add a giant comment giving an outline of basic HTML that could be used for cutting and pasting into the document, and later deleted when things were finished. Originally, the header/footer feature was envisioned to be edited by the user in the program (and it still can be), but the idea of making provision for copying these from other files caught my imagination. So to the basic generic header/footer in the program's data were added some comments intended to allow the user to dump them out once, then edit the data to his specific needs. Once this is done, leaving these html marker comments in place, the file can be used as a source for cloning this information to other html files made with the program, minimizing the amount of editing needed on new files.

Finally, the idea that the programs should be able to convert themselves into HTML that could be printed to a file from a browser, and with a minimum of editing be made executable again grabbed my imagination as an ultimate test of how well they worked. I remember back in the 1980's in 'Computer Languages' magazine there use to be tremendous debate over the idea that a great computer language should be able to implement its own compiler as a test of how versatile it was. This idea seems reasonable, 'til you start blindly trying to apply it to some perfectly great special purpose languages, but it does have some merit. An excellent counter-example some one pointed out in CL magazine was that of a special purpose language for controlling machine tools - it did its job fine, but expecting this language to be useful for writing compilers was silly and would complicate it to unuseability. Anyway, while these programs do not constitute language compilers or interpreters by any means, the idea that they could process themselves, and still be executable after their HTML was 'interpreted' by a web browser seemed like a good test of how cleanly they could convert text. This also simplifies distribution problems, such as differing end of line characters and such, since users can simply print the file out from a web browser and with minimal editing, get it running.

Typically, I use:

   URL2LINK -r URL2LINK | TXT2HTML -sb  >  URL2LINK.html

Where '-r' activates conversion of several characters into encoded character entities, and '-sb' sets up '<PRE>...</PRE>' tags and basic header/footer templates respectively. As this indicates, both programs work as filters and can be used from the command line or in conjunction with any text editor that can route blocks of text through filters. This last point is useful for touching up additions to files already in HTML format. If you have a web browser that can easily toggle between viewing local files and editing them with such an editor (such as Lynx can do), you have a simple web page development environment.

Originally the intent was that the input text file would require no special markup other than that complete URLs needed to be specified, including the schema, typically 'http://'. Because it required little addition, provision was finally added to allow a text label to go along with the links and be used instead of just the URL itself. See the '-h' output for URL2LINK for details on this.

Requirements, Idiosyncrasies, Limitations

Of course these programs require some kind of REXX interpreter to be run.

Each URL must be completely on one line, they cannot be split across two or more lines. For instance this will work:

  http://www.scoug.com/os24u/2002/scoug208.download.html

but this will not:

   yadda yadda yadda  http://www.scoug.com/os24u/2002/
   scoug208.download.html

There can be more than one URL per line of text up to the recursion depth that the REXX interpreter can handle.
Because some of its capabilities are in conflict or are seldom used in conjunction with each other, the TXT2HTML program simply defaults to echoing the input file to output. To make it do something to the input, you must use one of the options it has in the command line invocation of it.
The comments used for tagging to clone header and footer HTML information must be proceeded with at least one blank.
In order to get a print-to-file from the web pages running you will have to remove any web page title, heading and some trailing text at the end of the file to clean it up slightly. Also you may have to adjust the top line for OS/interpreter conventions, and also (un)comment out appropriate portions of code in a section labeled with comments as 'Function Corrections.'
The current version of the program should be able to handle several types of characters following the URL, where previous version could only handle a limited set of punctuation following the URL. The safest however, is always a space.

Drilling Deeper

These programs provided a context to explore REXX. They provided opportunities for working on general solutions to recurring problems like how to handle help display, a general way of processing of command line options, initializing data, and advanced structuring of filter programs in REXX. While they have only been tested on two interpreters, references I'd encountered suggested steps to try to maximize portability, and these have been incorporated. In many cases this is for systems I have no immediate access to, but were included anyway as a first stab if ever needed. Even with the two interpreters used, measures to handle their peculiarities had to be taken, providing information for future use.

While none of these are ultimate solutions, they should go a long way with me personally, and may help others. REXX's simplicity, originally for the purpose of making it easy to learn and understand, make it a good ground for working on ideas to understand all the problems involved.

These programs also provided me with chance to study HTML from a different perspective. In writing them, I learned new things about the structure of URLs, basics and more involved facets of HTML, character entities, and probably many other things I cannot think of at the moment.

Conclusions

I've used these programs in the various versions they've gone through a fair amount, building a lot of my own web site with them. I don't know if anyone else outside the SCOUG website editor will find them of any use, but if anyone has any problems feel free to contact me, and I'll try to iron them out. I've found the effort spent on them worth it from many points of view.

Note: Dallas continues to enhance and refine these tools, and he makes the latest versions of them available for copying from his web site.
http://www.lafn.org/~aw585/url2link.html
http://www.lafn.org/~aw585/txt2html.html

Or, you can download zip files (as of August 7, 2002) of the REXX scripts.
URL2LINK
TXT2HTML
Both were used in the conversion of this article from text to HTML.

The Southern California OS/2 User Group
P.O. Box 26904
Santa Ana, CA 92799-6904, USA