Structurer

By Daniel Barron

Overview

Structurer facilitates the 'Interconversion of Formulae and Structures'.

"Many databases, whilst contain useful chemical information can be accessed either by using formula or by using structures. Generally each record in the database will contain both of these, since interconversion is, in general, quite difficult."(1)

If you read only one section of this, read the explanation on how it works and then what it supports.

A commercial package is available; however, to date no proposed algorithm for this concept has been actually published.

Objectives

Report Format

This report is split into a number of sections. First the objectives and design brief (this page) followed by the Background information followed by the Results and finally the Conclusion and References & Bibliography.

The format of each section will be explained, if necessary, at the start of each section.

Any number, eg. '(23)' can be referred to in the bibliography section near the end of this report. All knowledge not referenced is either personal or research results from my work.

Background

History

There are four main reasons for the project. They are:

  1. Most importantly that there have been no previous publications regarding possible algorithms for this sort of conversion.
  2. There are large efficiency gains to be had for this method of processing chemical structures and names in terms of both storage space and database-type retrieval, especially from a database on the Internet. One other example of an efficiency gain with this sort or algorithm available is the lack of need of a naming specialist to cumbersomly convert a whole database of structures and the reduced costs involved.
  3. The spread of functionality of such an algorithm is far wider than a method such as a database. For example it is necessary only to program in the capability to handle alcohols once, but in a database it would be necessary to input all the possible structures with an alcohol group.
  4. To provide an online resource for schools, educational establishments, and other people who have access to the Internet and want to do a conversion or research into the subject.
  5. Chemists have always found it easy or acceptable to use notation for chemical structures, notably more than other disciplines of science. So the standard method for naming organic compounds (IUPAC) has developed fairly logically; logically enough, in fact, to consider attempting a computer algorithm.

Considerations

There are many, many, many problems when trying to develop such an algorithm. One not only needs to know exactly how the brain works out a solution, but how to program each individual step and what sorts of data formats for each result from each step that you need. Also one needs to consider future expansion of the algorithm to handle more types of molecules and each of the individual exceptions.

I found that for every single step forward, or problem solved, I would find at least two more that needed to be resolved.

Just stop and consider for a moment something 'simple' like non-cyclo locant minimisation (this is where you decide which end of the hydride chain to start numbering from); to a human with a neural-net like brain, it is easy - you can just 'see' the right answer. A computer can not do that without complex neural-net simulation; you have to program each and every step in your thought process into the computer. You can try this now, or simply read on!

Literature Survey

As there is little or no published material available or even listed about previous work in this field, I cannot give an account of what has been done and published. So I will give my critical views on what is available. This section will be split into two parts; one talking about the literature available and the second about the software available.

There is only one organisation that seems to have done any research into this field and that is Beilstein(3). They have produced an MS Windows version of a program called Autonom(4), for Automatic Nomenclature.

There are a number of publications on this subject and I have listed them here:

A Computer-Oriented Linear Canonical Notational System For The Representation OF Organic Structures With Stereochemistry

AU: Agarwal_KK, Gelernter_HL

NA: Luisiana State University, Dept Comp Sci.

JN: Journal Of Chemical Information And Computer Sciences, 1994, Vol.34, No.3, pp.463-479

This publication describes a method of storing and manipulating construction information along with stereochemical information. It suggests having two separate copies of the molecular structure; one with the connectivity and atom information and the other with the stereochemical information. According to the article this system has a number of advantages over normal methods and has been successfully used for a number years in the SYNCHEM2 organic synthesis discovery program. The stereochemical descriptors can be used in conjunction with other standard nomenclature systems to extend their range to include stereochemistry. I find the concept to be quite an obvious one, yet a good system. If my algorithm were ever to be extended I am sure that this method would be the ideal solution to extending its ability into stereochemistry.

Systematic Chemical Nomenclatures In The Computer-Age

AU: Kirby_GH, Polton_DJ

NA: Hull University, Dept Comp Sci.

JN: Journal Of Chemical Information And Computer Sciences, 1993, Vol.33, No.4, pp.560-563

This paper draws upon the experience in the development of computer software which generates and analyses systematic organic chemical nomenclature and highlights some general points where nomenclature could be improved or tightened. The paper suggests that with the evolution of the software and nomenclature a new system for nomenclature could be developed which would be more logical and systematic. This new system could co-exist with current nomenclature and be supported by software to convert between the two. I am sure that this would be an interesting development, however the drastic change from IUPAC standards would need quite a push. This would lead to easier programming of a conversion algorithm such as mine.

Error-Detection, Recovery, And Repair In The Translation Of Inorganic Nomenclatures - A Study Of The Problem

AU: Ruiz_IL, Soto_JLC, Gomeznieto_MA

NA: University of Cordoba, Fac Sci, Spain

JN: Journal Of Chemical Information And Computer Sciences, 1996, Vol.36, No.1, pp.7-15

This paper discusses a method to overcome the problem with human-computer communication of errors in translation. The method is mainly concerned with lexicographic errors, since that stage of the translation process is completely independent of the model employed and might thus be useful in a more general sense. The information given is not really relevant to my project; however, if it were ever extended to convert IUPAC standard organic names to displayed structure, then it would be of importance.

Augmenting Connectivity Information By Compound Name Parsing - Automatic Assignment Of Stereochemistry And Isotope Labelling

AU: Ihlenfeldt_WD, Gastetger_J

NA: University of Erlangen Nurnberg, Inst Organ Chem

JN: Journal Of Chemical Information And Computer Sciences, 1995, Vol.35, No.4, pp.663-674

Compound names found in catalogues of fine chemical manufacturers generally contain information above and including the basic connectivity, however the distribution of the catalogue for computers usually only contains the basic connectivity. This paper describes algorithms and heuristics which extract the extra information about structure such as stereochemistry from the name. This is not relevant here, but could be should IUPAC to displayed was ever attempted.

Nomenclature And Coding Of Fullerenes

AU: Babic_D, Balaban_AT, Klein_DJ

NA: Rudjer Boskovic Inst, Galveston

JN: Journal Of Chemical Information And Computer Sciences, 1995, Vol.35, No.3, pp.515-526

This paper explains several methods for nomenclature of fullerenes but the method illustrated most is the one where the name is described by a spiral which goes through each of the carbons. The numbering is taken from this. For my project this paper is just an interesting read and one would hope that it would be possible to code support for fullerenes into my algorithm one day.

Algorithm For Selecting The Parent Structural Unit Of A Ring Chain Assembly

AU: Davidson_S

NA: Comp DataSyst Inc, Rockville, MD

JN: Journal Of Chemical Information And Computer Sciences, 1992, Vol.32, No.3, pp.215-221

This paper discusses the method for selecting the parent hydride(5) in ring, chain assemblies. It reports on a recent improved method for uniquely identifying alkanes which defines the parent structure as the chain with the least complex side chains. This definition readily extends to ring-chain assemblies when rings are included, applies regardless of parent-unit size and is consistent with the IUPAC guidelines and examples given, such as the preference for diphenylmethane over benzyl benzene in naming Ph­CH2­Ph. An iterative procedure for selecting the parent unit and a simple method for linking units are described as part of a general skeleton-naming computer program. This paper is wholly significant to my project, however for simplicity in design of the prototype, I use the largest chain or ring as the parent hydride.

Software Review

This section of the literature search is about taking a critical look at Autonom using the publications found and that are available. Due to space restrictions and relative importance this review will be brief.

Version 1 of Autonom(6) was first released in December 1991 and ran on a PC under DOS.

Version 2 of Autonom(7) was first released in 1994 and runs on a PC under windows and on a Mac under system 7.

It is a very comprehensive package but the interface is decidedly shoddy, especially the interface for version 1. It works in the same way as my algorithm; first find the parent hydride and then find the functional groups and put the name together according to IUPAC hierarchy. There is not much really to say other than "it works". None of its literature gives any additional information on how it actually works.

Publications that one can find more information and reviews of Autonom are listed here with details required:

JN: Beilstein newsletter, [01.91], pp.2-3

Out of date

TI: Evaluation And Use Of Autonom, A Systematic Naming Software

JN: Abstracts Of Papers Of The American Chemical Society, 1993 Vol.206, No.Pt1, pp.108-COMP

Out of date

TI: Autonom, Version 1.0

JN: Journal Of The American Chemical Society, 1992, Vol.114, No.26, p.10680

Out of date

TI: Autonom - System For Computer Translation Of Structural Diagrams Into IUPAC-Compatible Names .2. Nomenclature Of Chains And Rings

JN: Journal Of Chemical Information And Computer Sciences, 1991, Vol.31, No.2, pp.216-225

Contains relevant material. Not completely out of date.

TI: Autonom - System For Computer Translation Of Structural Diagrams Into IUPAC-Compatible Names .1. General Design

JN: Journal Of Chemical Information And Computer Sciences, 1990, Vol.30, No.3, pp.324-332

Contains relevant material. Not completely out of date.

Results

This section reports on how the algorithm works. First I shall explain the overall idea behind the algorithm and how all the bits fit together. Then I shall explain how each part works and give example as appropriate. Figures will only be given a reference number if required. In order to discuss the results, for example when a particular section gave me problems I will explain how I solved it. As well as discussing each part I will consider future expansion and improvement. After all the results I shall tabulate and explain the capabilities and limitations giving examples.

Overview

The way my algorithm works is by breaking the problem down in stages. It first examines the connection table to determine what functional groups are present and how they are connected. Then it collects together the groups in the rough order which they will be named and does further processing to convert the result into a better format from which the actual name can be extrapolated. Once the name is decided a further set of rules are applied such as whether or not to add an 'e' at the end of the unsaturation. The process is depicted in the diagram below:

Figure 1

I will assume that the reader understands all the principals and rules associated with simple IUPAC naming, however should I feel that a certain point necessitates further elaboration I shall do so. Due to the way the report is written it is essential for the reader to read the whole of the way though in sequence and he must not skip any part as vital information may be missed explaining a part later on.

All sections are in order of execution.

The Input

My algorithm requires a very simple input; a dual connection table of the atoms and their bonds. One array for the atom connections and type and one for the bonds. The connection table format is not needed for the reader to understand how the algorithm works, but for those technically minded who want to write a replacement interface, you can email me. Hydrogen is not included in the connection table and assumed to take up any spare valency.

Step 1 (as on figure 1) Start looking for functional groups - Cyanides etc.

It is important to find functional groups which contain carbon first so that the carbons are not taken as to be part of a carbon ring or chain. One such example is cyanide (CN). The algorithm (from now on known as Structurer) searches for a 'hetero chain' (8) consisting of a carbon doubly bonded to a nitrogen. When found it marks the atoms in the input table as claimed and puts an entry into a space in a new array with information about the fragment type and which atoms (i.e. which ones in the input array) it consists of. This is a common procedure and so I have written a hetero finding tool, which uses recursion, into Structurer.

Continue looking for functional groups - Nitroso + Carboxylic Acids

This works in the same way as the cyanide functional group finder, it looks for O-C=O. Except that the carboxylic acid finder only looks for them if there are enough carbons as there is no name for a single carbon carboxylic acid - it is called methan-1-oic acid - not carboxylic acid. A later function looks for carboic acid-type functional groups. Note: the difference is in the name not its chemistry.

Start to look for carbon structures - rings

In order to find homo-atom carbon ring structures (as well as other types of homo-atom rings) I have developed a general purpose tool for doing this. The way the tool works is by finding a start atom in the input array with the correct type, and then it recursively tries routes along the connections of atoms of the same type. It continues until it reaches the start atom. If this ring is the smallest then it is stored in a tempory result array and bond array. To make sure that each individual ring is found it tries to find the smallest ring first and works its way up. This is a start of handling fused ring systems which Structurer does not currently support.

Continue finding rings - Nitrogen rings, Silicon rings, Phosphorus rings

This does the same as with carbon rings above, but with nitrogen, silicon and phosphorous. The naming of these non-carbon hydrides is supported. You can now see why IUPAC uses the term hydride rather than hydrocarbon.

Continue finding carbon structures - chains

In order to find homo-atom carbon chain structures (as well as other types of homo-atom chains) I have developed a general purpose tool for doing this. The way the tool works is by finding a start atom in the input array with the correct type, and then it recursively tries routes along the connections of atoms of the same type. It continues until it reaches the end of a chain of that type of unclaimed atom. If this chain is the largest chain connected to a claimed ring or chain atom then it is stored. If no chains are connected to a claimed ring or chain atom and the found chain is the largest it is stored.

To make sure that each individual chain is found it tries to find the largest chain first and works its way down. It is important to give searching priority to chains attached to claimed chain or ring atoms because other wise incorrect results will be make. See example here:


The left hand diagram is without attached priority which gives 2 fragments and an incorrect result. The right hand diagram is with attached priority which gives 3 fragments and a correct result.

Find the nitrogens - Amines & Imines

For functional groups such as amines & imines I chose to write specific searching routines. This part works by, again, looking through the input array for atoms of the required type. The required type this time is nitrogen atoms, of course. After finding a nitrogen atom that is not marked as 'claimed' in the array, it looks at the connection information about the atom and calculates how many other atoms it is connected to, how many bonds it has attached to it and what the atoms are that it is attached to. From this information it works out what type of functional group it is. The two it recognises are amines (-NH3 1 bond & 1 connection) and imines (=NH 2 bonds & 1 connection). Once the functional group has been located it is stored.

Continue finding chains - Nitrogen chains, Silicon chains, Phosphorus chains

Nitrogen chains must be found after searching for functional groups such as imines and amines otherwise the chain finding tool will think that they are methaazanes(9). Nitrogen, silicon and phosphorus chains are found in the same way as carbon chains.

Now find the other types of functional groups - Alcohols, Ketones, Aldehydes, Thiols

Carbonyl groups on the end of chains are known as aldehydes and so a filtering routine is implemented later on once there is enough information to determine if the ketone is actually on the end of a chain. These four types are found in a very similar way to the part which finds amines and imines. Alcohols (-OH) are singly bonded to a hydride. Ketones (=O) are doubly bonded oxygens to a hydride. Thiols (-S) are singly bonded sulfurs to a hydride.

Now find more types of functional groups - Halogens

Halogens are simple to name and find. They are just single atoms singly bonded to a hydride. The names are prefixes 'iodo-, fluoro-, chloro- and bromo-'. This part looks through the input array for atoms of the type iodine, chlorine, bromine and fluorine. If it finds one that is not claimed then it assumes it to be singly bonded and stores the information.

Entering step 2 (as on figure 1) Find Connectivity

At this point it is time to do some processing on the results given in the new array (to be known from now onwards as array 2). What we have is a list of groups and which atoms from the input array are used in each group. What we need is a connection table of just fragments. In order to make this Structurer brings together the information about the connectivity of individual atoms (in the input array) and the information about which atoms make up the groups. The output is a new array. We shall call this array array 3.

It works by looking at fragments in array 2 and goes along the atoms in each of these in turn to see if there is a connection between one of its atoms and an atom of another fragment. The other fragment can either be a chain or ring or a functional group such as a chlorine or an alcohol. The result is put into a free space in array 3. The result is similar to array 2, except rather than having a list of fragments with the atoms that the fragments are made of, it ends up with a list of fragments with the connections between the fragment and others. This is the only routine that creates the entries in array 3.

Check ketones for aldehydes

Now that there is enough information, i.e. fragment connections are known, so that the algorithm can now look for carbonyls at the end of chains. These are not referred to as ketones, but aldehydes. It works by looking through array 3 for ketones and if they are connected to a chain at the start or end, then it changes its type to an aldehyde. This part does actually write to array 3, but only to change the fragment type.

Find carboxylic acids from aldehydes and alcohols

Aldehydes with an alcohol group on the same chain atom are known as carboxylic acids. The earlier function found carboxylic acids by using the hetero-chain finding tool. This part uses a different tool. Also this part is for single carbon atom molecules giving the name -oic acid which does not include the carbon atom in the name, unlike -carboxylic acid. From the results in array 3, it is easy to find carboxylic acids. Carboxylic acids are not found directly attached on rings due to valency limitations. It works by using a tool which searches through array 3 for two particular given groups on the same chain atom and deletes one and changes the other to another given group.

Explanation of the variable values and zones

At this point it is time to explain the values of the group name integers and the zone system I have designed for consistency through out Structurer. Without it many parts would be far more complex than needed. The idea is that different parts of the molecule have different numerical values, for example characteristic group(10) type groups all have a value between 4000 and 5000. Each type of group, e.g. the characteristic groups group, is numerically ordered in order of priority. E.g. a carboxylic acid group has the largest numerical value as it is the type of characteristic group that has priority over most. Also the zones are numerically ordered into the order in which they would be displayed, i.e.

<functional groups><parent hydride><unsaturation><characteristic groups>

Below is a table showing all the known groups and fragments and also the values and which zone they are in:

Zone 1 Zone 2 Zone 3 Zone 4
1000-1900 2000-2999 3000-3999 4000-4999
functional groups parent hydrides unsaturation characteristic groups
hydrogen 1001 silicon chain 2200 triple bond 3500 imine 4200
carbon1006 phosphorus chain 2220 double bond 3600 amine 4250
nitrogen1007 nitrogen chain 2240 multiparent bd 3999 thiol 4275
oxygen1008 silicon ring 2300 alcohol 4300
fluorine1009 phosphorus ring 2320 ketone 4350
silicon1014 nitrogen ring 2340 aldehyde 4400
phosphorus 1015 carbon chain 2400 carboxylic ad. 4800
sulfur 1016 carbon ring 2500 -oic acid 4801
chlorine1017
bromine1035
iodine 1053
complex subs 1999

Note:

complex subs = general term for complex chain/ring substituents

multiparent bd = refers to bond between substituent & parent hydride

Decide which is the most important fragment - i.e. parent hydride

In order to name an organic molecule one needs to find the parent hydride. My parent hydride finding routine is very very basic, but works for the simple purposes that the prototype algorithm needs. When Structurer is improved then this would be one of the sections that needed improving or completely rewriting.

The way my parent hydride finding routine works is by looking for different types of groups in order of priority and length. I.e. largest of the highest priority found. It looks first for the longest and works its way down in size of carbon rings and then carbon chains etc. It uses a fragment type finding tool which looks through array 3 for entries of the type specified.

Re-do the numbering on chains and rings

"Starting to get complicated now". When numbering the atoms in sub chains and rings there is a rule. The rule is that all sub chains or rings have to have the atoms numbered with the minimum number (usually 1) attached to the parent hydride. Eg.:

So for a chain all you do, if a sub chain is not connected to the atom numbered 1, is flip the numbering from one end to the other. But, what happens when you encounter a sub ring? The IUPAC does not seem to mention this case, but following the logic and rules one would gather that you apply the rules of minimising locants. To describe this better have a look at this example:


Of course the ring numbering would have to be rotated, if necessary, first before flipping the ring numbering to get the minimum locants. My algorithm handles all this.

The way it works is by recursively going along the main fragment (parent hydride) entry and its sub chains and rings. As it goes along it checks that the atom number of the first attached atom in a sub chain or ring is 1, and if not flips and rotates the numbering as appropriate. It has a main procedure which starts the correct recursing sub-procedure. There are two sub-procedures; one for chains and one for ring. The rings are far more complex.

When it gets to a chain it checks that the chain has atom numbered 1 attached, if not it goes along the entry for that chain and flips the position numbers so that they are correct and it also corrects the references to the position in all the fragment entries which have a connection to it.

When the routine gets to a ring it does the same as for a chain, except for the actual method for changing the numbering. It rotates the numbers as many times as is necessary until atom 1 is the connecting atom and then it applies the rules of locant minimisation to decide whether or not to flip the numbering to obtain the lowest locants.

To obtain the minimum locants it actually extracts the positions first before any altering has been done (as is suggested in the previous paragraph) and rotates them by the calculated amount and stores the numbers in one array. Then it copies this array and flips the locants numbering and compares it with the previous. If the second set is lower, then it sets a flag to TRUE and returns it to the sub-procedure that called it to say 'yes flip the ring numbering'.

Different type of 'groups' have different priorities. In this case the double bonds are given priority for minimum locants and then the triple bonds and then all the functional groups. See the worked example here:

Now entering stage 3 - Naming, part 1

To do the naming, Structurer has two array stages and one further stage to do the actual real naming. The second two are discussed later. The first takes all the functional groups, unsaturation, parent hydrides (and sub hydrides) and puts them into a new integer array called name array. Sub-hydrides are stored in separate entries and are structured recursively.

It works by recursively going along array 3 (starting from the parent hydride) and first inserting entries for the parent hydride then the unsaturation and finally the functional groups. There is one main entry in name array per sub chain or ring. Then it sorts the entries in the main entry in order of category first and then position in the category. To sort it Structurer rebuilds the whole entry starting from the smallest. Once the smallest is found it is copied into the new ordered main array entry and the original copy is wiped. This procedure is repeated until necessary.

Entering step 3 (as on figure 1) Part 2a of naming

This second stage of naming has four parts. The other three are discussed later. This part of the second stage is where name array gets processed and outputted to name integer array 2. The processing involves recursively searching (starting from the parent hydride) for multi-groups, that is, groups that are the same and are on the same chain or ring. For example you would not name 1,1,1-trichloroethane as 1-chloro-1-chloro-1-chloroethane. The processing also provides a better format to do the actual real naming bit from.

It works by going along the entries for the parent hydride entry and finding groups that are the same and writing the results into name array 2. It does this by taking the first entry in the given entry and comparing it with all the other entries and storing the positions (on the chain or ring) of the ones that are the same and blanking their entry. When it has the results from this search it takes the next valid entry and does the same again. Once Structurer has done this for all the entries, it goes through the newly created main entry in name array 2 and recursively does the same for all the sub-hydride entries in order to find multiplicity of complex substituents.

The way Strcuturer works out if there are multiple entries the same is by comparing the chains or rings referred to by going along both at the same time and using recursive calls to itself to compare each individual group on the sub-hydride (including the hydride itself). If any of them differ what so ever then FALSE is returned. A difference includes differing groups, differing bond to parent hydride, differing positions of any groups. If the sub-hydrides (and their sub-hydrides, if any) are identical then TRUE is returned.

Part 2b of naming - getting the characteristic group

In order to name the molecule the algorithm needs to know what the characteristic(10) group is. The characteristic group is the functional group with which the organic compound is characterised. When an organic compound is named according to IUPAC, it can have many prefixes, but only one suffix (except due to unsaturation). Finding the characteristic group is simple and Structurer does this simply, but would not need any enhancing unlike the parent hydride finding routine.

It works by going along the sub-entries in the main entry for the parent hydride and finding the largest numerically valued characteristic group and returning that. It could just look for the last entry in the main entry as they are sorted by an earlier procedure, however for future expansion, and compatibility, it does not. Assumptions are not always a good thing in programming.

Entering step 5 (as on figure 1) Part 2c of naming - locant minimisation

When naming an organic compound the IUPAC standard is to try and minimise the locants. There is an order of priority; the minimum locants are determined for the most important functional group(s) of the parent hydride and are used from that. The order of preference, or importance, is shown here:

  1. Characteristic group
  2. Double bonds
  3. Triple bonds
  4. Substituents as prefixes

Comparison of numbers is not additive, but 'incremental'. E.g. 4,5,6,7 is less than both 5,6,7,8 being less than 5,7,8,9. If the locants for the characteristic group(s) are the same whether flipped or not then the preference goes to the next one down on the list and this process is repeated until a set of locants numbering from one way is less than a set of locants numbering from the other end. The process of flipping is shown here again:

The locant minimisation for sub rings has already been done and the re-numbering (flipping) of sub chains has also been done so this procedure is only concerned with the parent hydride.

When minimising locants on a ring parent hydride, the locants not only can be numerically flipped, they can also be rotated. See the example below:

It works by first copying the main entry for the parent hydride into a single dimension array for easy access. It then tries to find the 'best way' for the different functional groups (and unsaturation) locants in order of priority as discussed before.

The way Strcuturer cyclo-locant minimisation works is by first extracting the locant values for the types of functional group(s) it has been instructed to find the 'best way' for the ring to be numbered. Then it makes a logically ordered array full of differently adjusted sets of locants. It calculates every single possible way of numbering the locants. Each different set has an integer stored at the end of the set to tell the computer what flipping and rotating was done to obtain that particular set. After creating the sets, the procedure goes through them to find the lowest and uses the information at the end of the set to rewrite the locant positions in name array 2. However if there is more than one way to rewrite the locants it passes the remaining transforms to itself again, but this time with the next priority substituents and so on. Cyclo-locant minimisation is by far the most complex part of the entire project. This is now my 3rd attempt at producing a bug free implementation. The one I used for my 3rd year thesis made mistakes in a large proportion of ring minimisation. The 'fixed' version that was released on the cover of Computer Shopper magazine usually was right except in a few cases. I believe that finally I have made it work in 100% of cases and in almost a complete rewrite compared to the 3rd year thesis version. This is the part I am most proud of, the jewel of the crown of Structurer.

Hydride chain locant minimisation is very simple and works in the same way as was used in the 'Redo numbering section'. It extracts the locants for the functional group range, copies the set of locants, flips the original copy and then compares them and can alter the locants for the parent hydride, if necessary, or try the next in the priority list.

Structurer uses the following equations:

functional groups

new locant value = length of chain - old locant value

unsaturation

new locant value = length of chain - old locant value - 1

For flipping of locant values on a ring:

functional groups

new locant value = (length of chain - old locant value - 1) MOD length

unsaturation

new locant value = (length of chain - old locant value - 1) MOD length

For rotating of locant values on a ring:

functional groups

new locant value = (no. of pos. rotations + old locant value) MOD length

unsaturation

new locant value = (no. of pos. rotations + old locant value) MOD length

Part 2d of naming - recursively numerically sort names group procedure

This is the last part of the second part of the naming process. When naming an organic molecule with multiple functional groups(12) of the same sort you place the locant positions first, then prefix the functional group with di, tri, etc, then the name of the type of functional group. However the numbers must be in numerical order. See the example below to explain:

Not only the numbering on the parent hydride needs ordering, but also all the numbering on any of the sub-hydrides.

It works by going along the parent hydride and sorting each of the locant positions in each individual sub-entry. If it reaches an entry which is not a group, but a reference to a sub-hydride entry, it, not only, sorts its locant numbers, but also recursively calls itself to go along that sub-hydride.

Entering step 6 and 7 (as on figure 1) Part 3 of naming

This is the last stage of the naming process and it has only one part. This is the part that actually does the naming. It names the molecule and takes into consideration a number of rules. Here is a list of the rules (other than very simple ones):

  1. Multiple complex functional groups are denoted by brackets and a multiplicity denoted not by di, tri, tetra, etc, but by bis, tris, tetrakis, pentakis, hexakis(13), etc.
  2. Complex functional groups with complex functional groups shall have square brackets(14). E.g. 2-[chloro-2-(1-hydroxymethyl) 5-oxohexyl] 4,5-dichlorocyclohexane-1-carboxylic acid(15).
  3. Multiple simple functional groups' multiplicities are denoted by di, tri, tetra, penta, hexa, septa, etc.
  4. The bond order of a hydride functional group to the parent hydride is denoted by yl (single), ylidene (double), ylidyne(11) (triple). E.g. 2-methylidenepropane.
  5. Functional groups are given different names for the left of the parent hydride and to that of the right(16). E.g. oxo- & -one.
  6. The order in which the functional groups are named is alphabetical, not numerical. E.g. 4-amino-4,2-dimethyl-3-fluoro-3-(hydroxymethyl)-hexan-1-ol.
  7. For non-carbon homogeneous(17) chains and rings affix a -sil (silicon), a -az (nitrogen), or a -phosph (phosphorus) directly after the parent hydride name. E.g. cyclobutaazane.
  8. Vowels under IUPAC standardisation are: a, i, o, u, y.(18)
  9. If the name of the characteristic group has a IUPAC vowel at the beginning of its name, then the 'e' at the end of the parent hydride [including (un)saturation] is elided. E.g. propan-1-ol.
  10. The sequence "ao" is not allowed in a name. It is replaced by just "o".
  11. (Un)saturation is denoted by ane (no multiple bonds) ene (double bonds) yne (triple bonds).
  12. Words shall not be separated by spaces, but hyphens(19). E.g. 1-bromo-2-chloroethane
  13. Internally the locants have been stored as 0, 1, 2, 3... etc., but the IUPAC standard is for them to be 1, 2, 3, 4... etc.

It works by recursively going along the name array 2 starting from main entry 0 which is the parent hydride entry. When it reaches a sub-hydride group it calls itself passing on the position of this group. When it has finished naming a hydride group it returns the name as a string. If the name is a complex one and it is for a sub-hydride then it puts a "(" at the start before it returns it so that the calling procedure (itself) knows that it is complex and can put a ")" at the end and any different multiplicity (i.e. bis, tris, etc).

It differentiates between the parent hydride naming (which has a characteristic group) and the sub-hydride naming by checking a boolean flags status. After as it goes along the entry for the hydride in question it adds each of the names to the relevant parts. One part for functional groups, one part for the hydride, and one part for the characteristic group. The (un)saturation is added after all the groups have been added so that it can be worked out.

Once it is all put together further processing of the name is done. It removes "ae" and double spaces, and adds "(" to the start if it is a complex sub-hydride. And it puts "-" in the appropriate places. And if the name is the final complete name it is processed to put in square brackets, and to replace spaces with "-" and "_" with spaces. "_" is used to denote a 'hard' space as in a name such as carboxylic acid so that the final processing does not replace the space with a "-".

Capabilities

That is to say what it can do so far...

What it CAN handle

What it CAN'T handle

Table of substituents

To show at a quick glance what substituents that my skeleton algorithm can handle I have compiled the following table(20):

Class Formulaœ
Alcohols -OH
Aldehydes -(C)HO
Amines -NH2
Carboxylic acids -COOH
-(C)OOH
Imines =NH
Halogens -Cl, -I, -Br, -F
Ketones >(C)=O
Nitriles -CœN
Nitrosos -NO
Thiols -SH

œ (C) designates a carbon atom included in the name of the parent hydride and does not belong to a group designated by a suffix or a prefix.

Examples of what it can do

Here are some examples of what my algorithm can do:

Discussion of Results

The whole point of this project is the results, and the discussion of them is not entirely relevant. However I shall consider the following points:

As they are now (the results)

The results show that my method/algorithm does actually work. Whether or not it can be extended to handle multiple fused rings, and esters is another matter. However, knowing and understanding the structure and workings of Structurer, I do not see any insurmountable problems for such expansion.

Future improvements and development

There are a number of areas that could be improved and worked upon. Some of the thought I have had on them include:

Esters & Ethers & Similar (structure-wise)

These are currently totally unsupported. The problem is that they consist of two hydride groups joined by an oxygen. This is not obviously compatible with my algorithm, however all it would theoretically take is adding in code where rings and chains are handled, to cover these. Also the re-numbering and locant minimising routines would have to take consideration of the different numbering.

Carboxylic acids

These are almost supported fully. However due to the ambiguities of the IUPAC system for these functional groups I was unable to come up with a standard system. For example methane-1-carboxylic acid is the same as ethan-1-oic acid. So which one should you choose? Further investigation is needed in this area.

The parent hydride finder

At the moment all it does is try to find the largest ring and then the largest chain of carbons then nitrogens, phosphorus and silicons. It takes the first found. This is not necessarily correct but works for up to a high standard. More work and investigation is needed here. Perhaps try the paper in Journal Of Chemical Information And Computer Sciences, 1992, Vol.32, No.3, pp.215-221.

Heterogeneous atom rings (heterocyclic)

There is currently no handling of this at all. The same sort of considerations for esters are needed for this. More work needed.

Non-systematic ring name systems

There is currently no handling of this at all. The same sort of considerations for esters are needed for this. More work needed. The most notable would be benzene.

Valence checking

A more complete algorithm would need to do valence checking and other such input checks before trying to name it. Mine does not. This would be very easy to implement. Autonom only did this from version 2.

Stereochemical descriptors

It would not be very difficult to add stereochemical handing code to my algorithm due to its modular design. For more information on a method for implementing this try looking at the Journal Of Chemical Information And Computer Sciences, 1994, Vol.34, No.3, pp.463-479.

Conclusion

When you compare Structurer to that of Autonom version 1 the only real differences are that Autonom 1 handled esters, fused rings, and a few more functional groups and non-systematic names. It took a team of programmers about three years for a release version 1 of Autonom. My very simple interface is not much worse than their DOS interface for Autonom 1. Not too bad a result in less than a fifth the time and with less programmers. This newly written Java translation is less than 50 K. This is less than the original ARM BBC BASIC V which was about 68 K.

I have fulfilled enough of the criteria and even surpassed them in some areas such as non-carbon homogenous chains and rings. With more work my algorithm could be very comprehensive indeed. That is not to say that only a small amount of work has been done as it is as complete as it was originally supposed to be.

There are certainly many ways in which the project could be improved and expanded; this shows its good original modular design. But what has been done is certainly enough to show that my design for this algorithm does actually work. After many hours looking into the idea of expansion, for at-least all the sorts of organic structures that one is likely to find at a university, I have found no insurmountable problems.

600 hours or more spent on this project was well worth it and I have certainly learnt some new organic nomenclature.

In conclusion I find that this project has been wholly successful and has provided a very viable modular naming skeleton algorithm. I am very pleased with it.

References & Bibliography

[1] Ref. UMIST 3rd year chemistry exam information booklet 1995.

[2] Defn. Connection table "table showing the type and connection of atoms in a molecule".

[3] Ref. http://www.beilstein.com/

[4] Ref. http://www.beilstein.com/ Ref. Beilstein Newsletter 01.91 page 2-3

[5] Defn. hydride "generally means hydrocarbon-like, but used by IUPAC in preference"

[6] Ref. Beilstein Newsletter 01.91 page 2-3

[7] Ref. http://www.beilstein.com/

[8] Defn. Hetero chain "chain of atoms of different types"

[9] Defn. Methaazanes "non-existent single atom nitrogen chain" Ref. Panico, Powell and Richer: A Guide to IUPAC Nomenclature of Organic Compounds 1993, section 2.2.2

[10] Defn. Characteristic group "the highest priority functional group that can be a characteristic group - the group that characterises the molecule" Ref. Panico, Powell and Richer: A Guide to IUPAC Nomenclature of Organic Compounds 1993, section 2.2.2 & 2.2.3

[11] Ref. Panico, Powell and Richer: A Guide to IUPAC Nomenclature of Organic Compounds 1993, section 3.1.1

[12] Ref. Panico, Powell and Richer: A Guide to IUPAC Nomenclature of Organic Compounds 1993, section 1.4.1

[13] Ref. Panico, Powell and Richer: A Guide to IUPAC Nomenclature of Organic Compounds 1993, section 1.4.2

[14] Ref. Panico, Powell and Richer: A Guide to IUPAC Nomenclature of Organic Compounds 1993, section 1.5.3

[15] Ref. Panico, Powell and Richer: A Guide to IUPAC Nomenclature of Organic Compounds 1993, section 4.2.4

[16] Ref. Panico, Powell and Richer: A Guide to IUPAC Nomenclature of Organic Compounds 1993, section 3.2.1.2

[17] Defn. Homogenous chain "chain of atoms of the same type"

[18] Ref. Panico, Powell and Richer: A Guide to IUPAC Nomenclature of Organic Compounds 1993, section 1.7.1

[19] Ref. Panico, Powell and Richer: A Guide to IUPAC Nomenclature of Organic Compounds 1993, section 1.3.4

[20] Ref. Panico, Powell and Richer: A Guide to IUPAC Nomenclature of Organic Compounds 1993, section 4.1 table 9 & 10