Analysis and Maintenance of a Clean Virus Library

Vesselin Bontchev, research associate
Virus Test Center, University of Hamburg
Vogt–Koelln–Str. 30, 22527 Hamburg, Germany
E–mail: bontchev@fbihh.informatik.uni-hamburg.de

Abstract: A well–maintained virus library, or as it is often called, a virus collection, is an important tool to the anti–virus researcher. It can be used to test anti–virus software, to systemize the knowledge about the thousands of currently existing viruses, as a basis of information exchange with other anti–virus researchers and so on. However, the creation of such a collection and its maintenance in a clean and well–ordered state is not a trivial task, especially with the huge amount of currently existing viruses and new ones popping up literally every day. This paper describes the major gidelines and procedures used to maintain the virus collection in the Virus Test Center Hamburg.

Download this paper as an ASCII text file

Download this paper as a ZIPped ASCII text file

2. Weeding

2.1. Unpacking

2.2. Removing the Duplicates

2.3. Corrupted Files

2.4. Envelopes

2.5. Non–Viruses

1. Introduction

Many people who like to call themselves "virus researchers" seem to think that keeping a big collection of viruses is enough to qualify them as such. Yet, they often fail to achieve even this task. Many of the so–called "virus collections" that we have seen being distributed in the computer underground, are in a very sorry state. They contain huge numbers of viruses, non–viruses, trojan horses, joke programs, intended viruses (programs written with the obvious intent to write a virus, but too buggy to replicate even once), corrupted files, text files, virus creation tools, completely innocent files, and so on.

A typical example of this is the so–called virus collection distributed (sometimes even sold) by John Buchanan. Several people in the anti–virus research community have received it (some have even payed for it) and have spent an enormous amount of time and effort just to discover that it consists mainly of junk. Yes, it contains thousands of files, megabytes of data. Many of those files are viruses—mostly well–known ones, or trivial modifications of the well–known ones, or even (from time to time) completely new ones. The hard thing is to find out which ones they are; i.e., to sieve the important things from the garbage.

Often, after the painful job of weeding is over, one wonders whether the results have been worth the effort. Very often they are not—but one never knows in advance. Yet the task has often to be done over and over—every now and then some "helpful" soul downloads such a "collection" from the virus exchange BBSes and sends it to us—with an intent to help, of course.

At the VTC–Hamburg we have to perform the task of weeding garbage virus collections averagely once or twice per month. In this paper we shall try to describe the procedures that we use to maintain our virus collection in a well organized state when analyzing and merging new virus collections to it.

The first thing to do when you receive a new virus collection is to remove the garbage from it. This is also often the most difficult task. Then the new viruses found in it must be replicated, classified and merged with your own virus collection. Only then the resulting updated collection can be used in testing anti–virus software, sending it to other anti–virus researchers, and so on.

Back to the Table of Contents

2. Weeding

In this section we shall describe the process of removing the garbage from the virus collection—i.e., its weeding.

Back to the Table of Contents

2.1. Unpacking

The times when the number of existing viruses was half a dozen and they all fitted nicely on a single floppy disk are gone forever. Nowadays, a typical collection consists of thousands of files (as of the writing of this paper—June 1993—there are more than 2,800 known viruses), which occupy megabytes of disk space. That is why, some form of compression and archiving is used to save space when transporting the collection. The problem is that there is no standard scheme being used.

The first thing to do when you receive a new virus collection is to unpack it. Unfortunately, this is often less trivial than it sounds. First of all, one needs a lot of free disk space, megabytes of free disk space. Not only for the collection itself—sometimes the collection can be found in a huge archive, which is itself stored on multiple floppy disks, using some kind of backup program. Therefore, one needs an amount of free disk space equal to the size of the collection in unpacked form plus the size of the archive that contains it. A collection of backup/restore programs is also desirable, in case the archive with viruses has been stored on diskettes using some kind of weird backup program, e.g., one that requires a particular version of DOS in order to run.

The next task is to unpack the archive. It is usually created with PKZIP or ARJ, but any of the other popular archivers could also be used, so we need to have all of them handy. When unpacking a ZIP archive, one must remember that it may have a directory structure which will be lost, unless the unarchiver is implicitly instructed to preserve it. In the case of ARJ archives we often see multi–volume archives, which span multiple floppies. They also must be treated with care.

Often the whole archive is encrypted—for security reasons. We have established lists of passwords that the other anti–virus researchers use when sending us viruses, but occasionally they decide to change the password (again for security reasons) and we have to contact them by telephone or fax and obtain the new password. Sometimes unpacking the main archive reveals a set of other archives—often packed with a different archiver and/or encrypted with different passwords. In some cases, those come from an underground virus exchange BBS and the password is unknown. Sometimes the encryption can be cracked or the password guessed, deduced, or just found out by studying the other files that come with the archive.

Even if the archives are not encrypted, unpacking them might pose some problems. Often several archives contain files with different contents, but with one and the same name—README is a favorite one. Some archivers allow the user to specify an alternative destination name for the file that is unpacked, if one with the original name already exists on the disk. Others allow only the alternative to overwrite the existing file or not to unpack the new file at all. In those cases one must write down the names of the files that are not unpacked, then to rename the existing ones and then to unpack the files missed during the first pass.

Back to the Table of Contents

2.2. Removing the Duplicates

Once the collection is unpacked, the next task is to remove everything that you already have—everything that is not new. We have found it useful to keep more than one infected file per virus. In fact, our collection is a superset of all collections that have been sent to us—all files in them have been merged in ours, with the duplicates removed. Since one and the same virus collections are often sent to us over and over (slightly updated each time), it is simpler to just keep a copy of them and when each new collection arrives simply to remove the files, copies of which we already have.

Having in mind that we often have to deal with thousands of files (at the time of writing this paper, our own collection consists of more than 12,000 different files), the task of spotting the duplicated ones and removing them might seem an extremely difficult one. Fortunately, there is a wonderful shareware utility, which helps us doing that.

This is the program Duplicate File Locator (DFL) by William Ataras. It is able to scan a whole hard disk and locate the duplicate files, where the user can specify what exactly s/he means by "duplicate files". This can be files with the same names, with the same first part of the names, with the same contents, or even with the same CRC checksum. The latter is in practice equivalent to comparing the files by their contents, except that it is faster, and requires less memory. DFL even "knows" the format of several popular archives and, when the CRC–comparison mode is used, it is able to extract the value of the CRC field from the archive without even unpacking it. This also tends to speed things up a bit.

The size of the hard disk being scanned is not important—DFL is able to create temporary files and use them as virtual memory, so any size of hard disk and number of files is supported, provided that there is enough free space for the temporary files. The speed of the program is also impressive—in our experience it was able to sort out a dozen of thousands of files in less than half an hour.

The program is able to handle only whole volumes, not single directory trees, but this drawback can be easily circumvented by using the DOS command SUBST and assigning drive letters to the subdirectory trees one wants to process.

Once DFL has scanned the specified volumes and has determined the duplicates (according to the conditions specified by the user), it also allows the duplicated files to be marked and deleted—even if they reside inside an archive. One annoying limitation of the program is that one must manually tag every duplicated file that has to be removed—and those are often thousands of files. It would be much simplier if the program allowed some way to mark all files residing on a specified drive or in a specified directory path at once.

After the files that we already have are removed, the size of the collection is greatly reduced—usually to about 10% of its original size. However, this does not mean that everything that remains are all new viruses.

Back to the Table of Contents

2.3. Corrupted Files

Many of the files that remain are simply corrupted. There are different kinds of corruption that we usually observe. We don't know what causes them, but they are often present in the unkempt collections, regardless that they are usually easy to spot.

One kind of corruption consists of the beginning of the file being overwritten by some text characters, seeming to be randomly typed from the keyboard. In other cases, the first byte of the file seems to be missing, as if somebody has cut it out and shifted the whole file one byte to the beginning. We say that such files are "out of phase".

Another kind of corruption consists of the entry point of the file pointing outside the loadable part of the file. Or, the file is infected (that is, a virus is appended to it), but the original bytes of the file are restored and the virus never gets control. This is probably the result of running some bad anti–virus program on the infected file. Depending on how they work, the different scanners report different things when used on such "partially disinfected" files. Some of them say that the files are not infected—which is, strictly speaking correct, because the virus in them never gets the chance to execute. Other scanners claim that a new variant of the virus is present in the file. Yet others say simply that the file is infected by a particular virus. All such cases have to be investigated (usually—by loading the suspected file with a debugger and inspecting it) and if the file is corrupted, it has to be removed.

In very rare cases, the virus code is slightly damaged. Sometimes this prevents the virus from working at all, sometimes only the payload is not working, and sometimes and entirely new, perfectly working variant of the virus is produced. Relatively often the corruption is caused by overwriting the replication part of the virus with a single INT3 instruction. This usually indicates that somebody has tried to replicate the virus with a debugger. Of course, viruses which are overwritten in such a way are not functional and have to be removed.

Back to the Table of Contents

2.4. Envelopes

Very often some of the virus samples in the collections that appear on the underground virus exchange BBSes are packaged with some kind of envelope. Such enveloping always modifies the original file in some way and such a modified file usually escapes the initial sieving phase when the duplicate files are eliminated. There are different kinds of envelopes.

In the simplest case, it is just some string appended to the file. The most popular ones are "MsDos" (appended by the anti–virus package TNTVIRUS), a "signature" from Todor Todorov's Virus eXchange BBS, and another two signatures from a virus exchange BBS, called Arrested Development. Less popular is the 10–byte checksum, added to the files by McAfee's program VirusScan, when used with the option /AV. In some cases we have seen several such signatures appended to one and the same file. We do not know what is the reason for such "signing" (maybe it is an incompetent attempt to trace the distribution of the file?), but it often causes some scanners to report a known virus as a new variant. Therefore, we are using a small program, which detects those signatures and automatically removes them from the files.

Another kind of packaging that modifies the initial file is to compress this file with one of the existing executable file compressors—LZEXE, PKLite, Diet, etc. For instance, the executable file compressor, known as ICE, was the favorite one used by the Italian virus writer known as Cracker Jack to distribute the first generation samples of his viruses.

Such compressed files have to be detected, recognized, and removed. Failing to do so may lead to a confusion like the "Ramvirus", described in Patricia Hoffman's hypertext document VSUM. In reality, this is a file, infected with both the Jerusalem and Cascade viruses and then compressed with Diet. Since the compression alters the image of the viruses present in the file, many scanners will not be able to detect the virus, even if it is known to them.

There are several scanners which are able to scan inside the files compressed with most of the popular compression programs, but none of them is able to handle all existing compressing schemes. We therefore prefer to use a separate shareware program, UNP, which is specialized in restoring the files compressed or scrambled in some way. It is able to handle all currently existing compression and scrambling schemes. However, care has to be taken, because some viruses actually spread in compressed form, and uncompressing them would create a new, different virus.

Finally, the most sophisticated kind of packaging used are the so–called immunization modules. These modules are small pieces of code that are attached to the executable files much like a virus. They have the task to check the file integrity at runtime and sometimes even to automatically restore the file, if it seems to be infected. The modules that we encounter most often are those added by CPAV, F-Xlock (a part of older versions of Fridrik Skulason's F-PROT), and VirusScan (when used with the /AG option).

We don't know why the samples in the virus collections have these modules so often attached to them. Probably the reason is that somebody runs a scanner (in order to determine which viruses are present in the collection) in some default mode that adds these modules automatically.

Those modules are generally a bad idea. They are not able to detect an infection by a stealth virus. They are also modifying the "protected" program and thus can prevent it from running—e.g., if this is a self–checking program. Also, some executables just cannot be immunized, because they refuse to run if something is appended to them. Examples of such programs include Windows applications, files with internal overlay structure, and so on.

The virus samples which are "immunized" in this way are not only modified (and thus evade the initial sieving files which removes the duplicate files), but also sometimes the "immunization" effectively "hides" the virus from the scanners and they stop recognizing the file as infected.

Again, in those cases we use specialized tools to remove the envelope. Often the product that appends the modules to the executable files has an option to remove them. Some of the scanners that we use have an undocumented option that forces them to remove some of these modules—much in the same way as they remove a virus infection.

One special case of envelopment is when a virus sample is additionally infected by another virus. This sometimes happens when the "collector" accidentally releases a virus and gets the files of his collection infected by that virus.

Most of the scanners that we use in our work have the option of removing the viruses from the infected files one by one, peeling them like an onion. This is useful, when the virus that interests us is deeper in the file than the enveloping infection. However, in some rare cases we are interested in the virus that has last infected the file and is therefore at the outermost level.

In those cases we separate the viruses, using several different techniques. If the two viruses in the file happen to infect under different conditions, then the task of separating them is a relatively easy one. One has just to create a set of conditions, under which only the virus that interests us will infect. For instance, if the "interesting" virus infects only COM files, while the "uninteresting" one infects only EXE files, then it is sufficient to provide the viruses with COM files only to infect and only the virus that interests us will infect them.

Unfortunately, in some cases the two viruses infect under one and the same conditions and it is impossible to separate them in that way. Then we use some kind of debugger or binary file editor to patch the virus that does not interest us in such a way that it becomes unable to replicate. Only then it is possible to extract the "interesting" virus.

Back to the Table of Contents

2.5. Non–Viruses

Once all files are unpacked, the duplicates and the corruptions removed, and the envelopes "peeled", one should not assume that all that rest are viruses. Very often the collectors gather programs that are not viruses, but which they feel belong to a virus collection. Examples of such programs are Trojan horses, joke programs, demos (programs that demonstrate some cute effect of a famous virus), first generation viruses, utility programs, and so on.

We often use our collection to test the detection rate of the scanners. By definition, a scanner is a program to detect viruses. It would be unfair to measure its detection rate on files which are not viruses and report them as "missed" by the scanner. Therefore, we must make sure that the viruses and the non–virus programs in our collection are kept separated.

The prevailing number of non–virus files in the virus collections are Trojan horses. By definition, a Trojan horse is a program that claims to perform some useful functions, while in the same time performing intentionally some harmful ones. Many of the Trojans that we see barely fulfill the above definition. Most of them are extremely mediocre programs, which try to format the hard disk (very often—unsuccessfully, due to incompetence of their author), or to delete all files in the current directory, or to wipe out the file allocation table, and so on. The "useful function" part is often completely ignored.

Often, it is possible to figure out whether a file contains a Trojan by just examining it with a file browser. The typical Trojan is a compiled program, written in a high–level language (usually C or Pascal, but we have also seen BASIC and even Modula–2), which contains some offending message—a message that is supposed to be displayed after the Trojan is run. Of course, this is not mandatory, because it is trivial to keep the message encrypted. However, most people who write Trojans are so incompetent, that they rarely do that. In fact, very often they even don't strip the debugging information at the end of the compiled program, so one can easily see the symbolic names of the used variables and routines. Often they provide a hint about what the Trojan does—e.g., if it is a program written in C and the abswrite() function is used, then it is quite likely that the Trojan overwrites part of the disk.

Sometimes the Trojan programs are accompanied by a text file, which is a parody on real documentation and has the goal to convince the user to execute the program. Very often those text files are written so incompetently (obviously by adolescent kids), that it is fairly obvious that they cannot describe a real, professional product. Nevertheless all those Trojan horses keep appearing in the virus collections, so they have to be isolated and separated from the virus–infected files. We do not attempt to collect or classify such programs; we merely keep them for reference for the process that removes from the unsorted collections the files that we already have.

Another kind of malicious software which we find sometimes in the virus collections are the so–called joke programs. Those are programs which appear to do something harmful, but in practice do not perform the claimed destructive action. For instance, one of the popular joke programs displays an animated picture of bugs which "eat" the text from the screen. When the user presses a key, the bugs disappear and the original contents of the screen is restored. Another joke program reverses the screen image, while yet another one claims to format the hard disk but performs non–destructive disk accesses instead.

Some authors prefer to classify the joke programs as a kind of Trojan horses. Indeed, the fact whether you will find their effect malicious or funny largely depends on your sense of humor and on whether you have been made a victim to the particular joke. Regardless how they are classified, however, one thing is certain—they are not viruses and have to be separated from the infected files in the virus collection. Fortunately, there are not so many jokes floating around, and their recognition is a relatively easy task.

Another kind of program that we often see mixed with viruses in the unkempt collections is the kind of programs that demonstrate the visual or sound effects of particular viruses. While those programs are relatively harmless, they do contain parts of the virus code. Occasionally, a scanner will use a scan string for a particular virus from the payload code of the virus. It wouldn't be surprising then if the scanner detects the innocent program that demonstrates the payload of the virus as "infected". Maybe the existence of such scanners explains why we sometimes see those demonstration programs mixed up with viruses in the collections. Most collectors are unable to disassemble a program and to determine whether it is a virus or not, so they just run their favorite scanner on their collection and blindly believe its reports. This is one more reason why the demonstration programs must be detected and removed from a well–organized virus collection—after all, they are not viruses.

Lately, we have observed another kind of weed in the virus collections that we receive. Doren Rosenthal from Rosenthal Engineering has produced (and distributes as shareware) a program which he calls a "virus simulator". The program generates executable files, which are not viruses themselves, but which contain tiny pieces of code picked from different viruses—the parts of the virus code used as scan strings in different scanners. The idea is that those files can be used to "test" how well the scanners can detect infections, or to demonstrate in a safe environment what happens when a scanner detects a virus and triggers the alarm. This line of reasoning assumes two main premises, both of which are untrue—that each virus has a unique signature somehow "attached" to it, and that all scanners work by simply looking for this signature in the files. Such reasoning demonstrates a deep lack of knowledge about how the modern scanning technology works. The final result is that, of course, those "simulation files" are completely useless for the purposes of anti–virus products testing. In fact, they are even harmful, because can trick the user into a false sense of security.

Fact is, however, that some of them really trigger some scanners. This only increases the general confusion, because the incompetent collectors incorrectly conclude that the files contain viruses. After all, if the scanner says that the file contains a virus, it should be so, right? Unfortunately, this reasoning is completely wrong and the net result of it is that our time is wasted to analyze silly "simulated viruses", in order to detect them and to remove them from the virus collections.

Occasionally, a badly maintained virus collection contains files with executable extensions (COM or EXE), which do not contain executable files. In some cases they contain images of infected boot sectors. Obviously, somebody has saved the image of a boot sector virus in a file with executable extension, with the intent to disassemble it further with some kind of disassembler or debugger, which is able to work only with files with executable extensions.

Sometimes, however, those files contain pure ASCII text. Two popular files which we keep seeing reappearing are HI.COM (containing a worm program for VAX/VMS written in DCL) and CHRISTMA.EXE (containing the text of the chain letter program written in REXX, which was widely spread on BITNET several years ago).

Usually it is relatively trivial to recognize such files and to remove them. We use a file browsing utility to inspect the contents of the files. This utility displays the file contents both in ASCII characters and in hexadecimal. The files which contain pure text are trivial to spot on inspection, and the images of boot sector viruses usually have a particular format, which also makes them relatively easy to detect. Sometimes there are additional levels of concealing—e.g., a text file, containing the description of a virus, is converted into a program that displays it with the popular utility TXT2COM and then is compressed with PKLite or some other executable file compressor program. This makes the detection of such files slightly more difficult, but even that becomes easy after one acquires some experience.

However, undoubtedly the most often seen kind of "irregular" virus samples in the virus collections are the so–called first generation viruses. Those are programs which do contain virus code and will release it on execution, but which are not infected in a normal way. Some authors consider them as kind of Trojan horses, the harmful function of which is to release a virus. We, however, prefer classify them separately and to split them into several groups—germs, droppers, and injectors.

Germs are programs which are produced by assembling the original source code of a virus and which cannot appear in the normal way of virus infection. For instance, a virus could contain a limitation to infect only files larger than a particular size, e.g., 1,000 bytes. Obviously, such a virus cannot infect a 10–byte program in a natural way. But, if one has the source code of the virus (or a good disassembly of it), it is very easy to put only a short host program in the beginning of the virus and to assemble the whole thing. The resulting program will release the virus, if executed. Even some anti–virus programs will be able to "disinfect" it by removing the virus and leaving only the tiny host program. However, such a file will never occur during the normal replication process of the virus.

Sometimes an easy way to detect the germs is to look at the beginning of the file. The germs are almost exclusively found in COM files. Many viruses infect such files by placing a JMP instruction in the beginning of the file, which instruction transfers control to the virus body. If the file is infected in a normal way, this will be always a JMP Near instruction (opcode 0E9h). However, if the virus is assembled from source and is attached to a tiny host program, the assembler usually optimizes the JMP instructions and generates JMP Short instructions (opcode 0EBh). Therefore, if the first byte of an infected COM file contains 0EBh, chances are that this is a germ. However, this should be used only as a heuristic; not as a strict rule. There are viruses which do not use a JMP instruction to transfer control to their body and there are viruses (usually of the prepending type) which begin with 0EBh in all infected files.

The second type of first–generation virus programs are the droppers. These are programs, which install a boot sector virus on a floppy disk—usually, after requesting confirmation from the user. Since their main goal is not to conceal their actions, but merely to provide a convenient means for transport of boot sector viruses, the droppers are relatively easy to detect. Usually just inspecting the suspicious file with a binary editor is sufficient—one can easily spot the messages that describe what the program does and request the confirmation from the user. The dropper programs are not normal replicants of a virus, and should therefore be kept separately of the main virus collection. We usually run them on a sacrificial machine, then inspect the infected diskette that they create, in order to determine which exactly virus is installed by the dropper.

The last variant of first–generation viruses are the so–called injectors. They are probably most similar to the normal Trojan horses. They are programs which are not infected themselves, but which do contain a virus (often—in some concealed form or another) and which release it on execution—usually unnoticeably to the user.

Injectors are the kind of non–natural virus samples which is the most difficult to detect—because the virus writer has taken some steps to conceal the fact that the program releases a virus. The easiest way to deal with such files is simply to try to replicate them and then compare the replicants of the virus with the original file.

Another kind of virus–like programs that often clutter the virus collections but which are not viruses, are the so–called intended viruses. Those are programs written with the obvious intent to write a virus, but which are so buggy (usually due to the lack of programming experience of their author), that they are just unable to replicate. Examples include "viruses" which hang the computer on execution, or which can infect only files with a particular length (e.g., 17 bytes), or which fail to point the entry point of the file to the virus body after the virus is appended to the file, and so on. Usually it is a waste of time to try to find all bugs in such programs. However, it is important to recognize them and to keep them separated from the real virus samples. Usually, the easiest way to do this is to try to replicate them and to check whether the replicants are able to replicate further. If they are, then they contain a real virus, otherwise, it is quite probably just yet another intended virus, written by yet another incompetent "wannabe" virus writer.

Finally, the virus collections sometimes contain perfectly innocent files. Some of them might have something to do with the virus collection, and some may be completely unrelated to it. Examples include archivers, programs which convert a diskette into a file image, goat files used to attach viruses to them (but with no viruses attached), programs which advertise (often with elaborated pseudo–graphic pictures) the different virus exchange and/or pirate BBSes, and so on. We have seen included even such programs like FORMAT and CHKDSK—obviously somebody has noticed that they perform direct disk access and has labeled them as "dangerous"...

While detecting, recognizing, and removing some of those non–virus programs is relatively easy, finding all of them, especially in an unkempt collection can take a lot of time and efforts. We solve the problem by keeping a copy of those programs in separate directories, which are clearly labeled as containing non–viruses. Since a copy of the files is already present, any additional copies will get automatically sieved out during the initial process of removing the duplicates.

An alternative solution is to use a database of checksums of those files and to automatically reject any file, the checksum of which is already contained in the database. One drawback of this method is that one might make a mistake and incorrectly label a virus as an innocent or non–working program. If a copy of the original file is kept, the mistake can still be discovered and corrected—after the file is analyzed and its viral qualities are confirmed. This is significantly more difficult thing to do, if only a checksum of the file is kept.

Back to the Table of Contents

3. Replication

Once the collection that is being analyzed is unpacked and all known files and obvious garbage have been removed from it, what remains is mostly viruses. They are either new replicants of the known viruses (replicants, which just are not present in our collection), or they are samples of new viruses. The recognition of the known viruses is relatively trivial—one just needs to run a couple of good and up–to–date scanners on them and to examine the generated report files. This will be described in details in the next section. When the known viruses are recognized and sorted appropriately, we end up finally with files which very probably contain mostly new viruses. The problem now is to figure out for which of them this is really so.

There is one ultimate proof of whether a program contains a virus. If part of its code is able to replicate and to attach itself to other programs, then it is a virus, regardless of what else it does. Unfortunately, if it does not replicate, this does not mean that it is not a virus. Some viruses are so poorly written, that they require significant effort to replicate. The virus author has freely assumed that some set of conditions which have happened to be true on his machine must be also true on all other machines, so his virus requires them, in order to replicate. We are saying that viruses which are difficult to convince to replicate, require "spoonfeeding".

The conditions on which these viruses rely in order to replicate can be various. For instance, some viruses (e.g., Ping Pong, Bebe) replicate only on a machine which has a 8088 CPU and do not run on anything more contemporary. Others require just the opposite—that the machine has at least 80286 or above. The Uruguay.1 virus replicates only on machines with V20 (a 8088 CPU clone). Some viruses (e.g., TenBytes) assume that the machine has 640 Kb of conventional memory and will not run on anything less powerful, e.g. a machine with 512 Kb of RAM.

Many viruses make assumptions about hard–coded addresses or undocumented structures in the operating system—and are therefore limited to a particular version of the operating system—or a subset of versions. We have found that IBM's PC–DOS 3.30 is the most "virus compatible" version of the operating system—almost all "DOS version dependent" viruses are able to run under it. Examples include Ping Pong, which does not recognize disk partitions formatted with DOS 4.0 or above, Darth Vader, which is supposed to infect files when they are copied but assumes that DOS uses a particular function call to open the destination file (since DOS 4.0 the command interpreter uses a different function call), Dir-II, which does not run under DOS 5.0 or above and does not infect correctly the disk under Compaq DOS 3.31 and so on.

Other viruses pose some limitations on the file system. Some of the Rythem variants replicate only when they are not in the root directory. The Shifter virus infects only very large files. One of the Screaming Fist variants can infect correctly only files which are 17 bytes long. There might be some other limitations too—for instance, the StarShip virus requires several conditions to be fulfilled—from a particular version of the operating system, to the presence of particular video controllers. The Tony virus infects on the 1st day of every month only files, the name of which begins with the letter 'B'; on the 2nd day it infects only files, the name of which begins with a 'C' and so on.

Therefore, if a program does not replicate, it does not mean that it does not contain a virus. Maybe it is a virus which replicates only on Fridays, or which requires a XGA video controller, or which waits until the hard disk is 80% full. Or maybe it is just an intended virus, which is too buggy to replicate and which has not been even alpha–tested by its author. Such programs usually take the most of our time. They have to be analyzed, and the bug (or the replication condition) has to be found—so that we can be 100% certain of whether it is a virus or not.

For virus replication we use a sacrificial machine, running PC–DOS 3.30 and with 640 Kb of memory. Sometimes we change the DOS version—if analysis of the virus code has raised the suspicion that it is DOS version dependent. We also use a variety of goat files—do–nothing programs, which are given to the virus to infect. They have various sizes and names, because some viruses try to detect the goat files and refuse to infect them. In the future, we are planning to broaden our set of goat files to include different kinds of COM, EXE and SYS programs, tiny, middle–sized and large, plain and with internal overlays, Windows applications and so on. This will be useful in testing the detection rate of the anti–virus scanners—we have noted that several well–known scanners fail to detect the virus in all replicants (e.g., they don't look for it in SYS files—if the author of the anti–virus product has failed to notice that this particular virus can infect such files).

In fact, even if the program seems to replicate (i.e., infects some of our goat files), this still does not necessarily mean that it is a real virus. Some intended viruses contain bugs which make them attach themselves to the attacked files—but they either corrupt their own body, or set incorrectly the entry point of the file—so that it never points to the right place in the virus code. This is why, we must verify that the files infected on the first pass are able to replicate further—otherwise they are not real viruses. Examples include Vienna.963 (sets incorrectly the entry point), Virus-9 (corrupts itself), Druid (only the germ is able to replicate) and others.

Back to the Table of Contents

4. Classification

Once all files which contain real viruses have been isolated, they have to be sorted and classified appropriately. In CARO (the Computer Anti–virus Researchers' Organization) we have agreed to classify the viruses into families, groups and variants, according to the similarity of their structure, code, and methods of infection—and not according to what the virus does, who has written it, or where it has been first detected, for instance.

This classification is expressed in a hierarchy, which is represented with a large directory tree (about 4,200 directories), the leaf directories of which contain the actual viruses. All files which are in the same directory are infected by one and the same virus variant, and if two viruses differ by at least one bit in their code or constant data parts, they are put into separate directories.

We are using different tools to help us in the process of virus classification.

First of all, we need to identify the previously known viruses—as they are already classified and the new samples have to be put in the same places of the classification hierarchy. As a primary tool, we are using the scanner FindVirus from Dr. Solomon's Anti–Virus ToolKit. We are always using the latest version—sometimes even a beta version, sent to us by Dr. Solomon. This scanner has an undocumented option, which forces it to perform exact identification of nearly all viruses it can find—and it can find thousands of them. Unfortunately, in this special mode the scanner tends to be rather slow—because it decrypts every encrypted virus, computes checksums of the constant parts of the virus body, and compares this "virus map" with one of the many maps which the scanner carries with itself. However, the speed is not important when sorting a virus collection—more important are the reliable results.

Unfortunately, sometimes the scanner fails to identify some viruses exactly. In those cases we are using our own virus mapping utility, called VIP (Virus Identification Program). It was developed from our design by Ivan Triffonov—a student in the Laboratory of Computer Virology at the Bulgarian Academy of Sciences, in Sofia, Bulgaria. VIP takes a file which is supposed to contain a particular virus and a text file, containing the map of this virus. The map contains a description of the different areas in the virus—code, text, variable data, constant data, or unreferenced junk—together with their offsets from the beginning of the virus, and checksums of the non–variable parts (code, text, and constants). VIP can even assist the user to create the map—it can take a few infected files and try to guess how the map looks like. The rough map generated in this way can be further refined by the user, who can even add comments about the meaning of the different areas. However, VIP cannot be used as a scanner—it is relatively slow and is able to handle only one virus at a time. Also, the current version still has problems with specifying the maps for some encrypted and/or polymorphic viruses, or with viruses which transfer control to their body in a non–standard way. We intend to improve these aspects of the program in the future.

If the virus identification tools turn out to be unable to identify the virus, it is very likely that it is either a new variant of one of the known viruses, or a completely new virus. In order to detect the viruses which belong to the former group (new variants of the known viruses), we are using a different scanner—Fridrik Skulason's F-PROT. Our experience has shown that this scanner is one of the best existing ones for this task—to determine whether a new virus is a variant of one of the already known ones.

Nevertheless, F-PROT's diagnosis is nothing more than a pretty good guess. We often need to inspect the virus manually with a debugger, in order to determine whether it really belongs to one of the already existing virus families. Sometimes even this cannot help us to decide, and then we are using yet another tool, developed by Fridrik Skulason. It can take two files and compute the percentage of substrings with a particular length, which occur in both files. In order to make the virus suitable for this kind of examination, it must be "purified". That is, the virus has to be cut out from the host file it has infected, and its data areas have to be filled with zeroes. This usually requires some efforts and analysis of the code, but the results are usually worth the efforts—because then we have a measurable indication of how similar a particular virus is to the other members of the same family.

When the known viruses and the new variants of the known viruses have been successfully classified, and when we are sure that the rest are real viruses, then they are obviously completely new viruses. Usually they are not that many—normally we find a couple to half a dozen of them in a large, unkempt collection. Obviously, writing a virus from scratch is still too difficult a task for most people, so they prefer to produce copycat viruses or to use one of the existing virus generators. This perfectly suits us, because it means that most of the good scanners are already able to detect their viruses.

The new viruses have to be successfully analyzed, named, and classified in new virus families. The naming is done according to the CARO naming scheme, the full description of which has been published elsewhere and is out of the scope of this paper [CARO].

During the virus analysis we also use a different kind of tools—mainly a disassembler (Sourcer from V Communications), debuggers (AFD Pro, Turbo Debugger, DOS' DEBUG, PDVIM), editors (PE2, MultiEdit), assemblers (TASM, MASM, A86), file editors (FED, FIX) and so on. Many people seem to think that just running Sourcer on a virus produces a disassembly of this virus. Unfortunately, this is only the first step—the disassembly has to be further refined (Sourcer is one of the best disassemblers around, but sometimes generates amazingly stupid things), it must be verified that it assembles to the original binary file, then the virus has to be understood and detailed comments must be added by hand—Sourcer puts only some primitive comments automatically.

Back to the Table of Contents

5. Tests

We often use our virus collection to check how well (or how bad) some scanners detect the viruses in it. Some people seem to think that in order to test an anti–virus product, you just need to run it on a virus collection. We have seen a lot of such "evaluations" published in the computer magazines. In many cases, even the virus collection used is a messy one, containing non–viruses, first generation viruses, and so on.

First of all, the only part of an anti–virus product that can be tested this way is the scanner—and the anti–virus products often include (or even rely mainly on) different parts, which are completely independed of known–virus scanning. Second, the only thing that is easy to test is the detection of file infectors—since the preparation and regular change of all kinds of floppy disks infected with every boot sector virus we know of will take a lot of time and efforts. Third, for a complete scanner test, one needs many samples of each virus—in different kinds of files. This is especially true for the polymorphic viruses—for a good test one needs hundreds, if not thousands of each of them. These would occupy too much disk space. Even now our collection occupies about 120 Mb. A really complete collection, with multiple samples of each virus in all kinds of files would occupy gigabytes of disk space. There are some other important aspects of testing and evaluation of anti–virus software. Most of them are described in [Brunnstein].

Nevertheless, the results of running a scanner on our collection of file infectors can often be very useful to the producer of the scanner. It can reveal bugs in the scanner like unreliable detection (i.e., not all replicants of the virus are detected), false negatives (viruses which are not detected by the scanner at all), problems in the naming (how well the scanner adheres to the CARO naming scheme), identification problems (different viruses are reported with the same name or one and the same virus is reported with different names), multiple detections (reporting more than one virus when only a single one is present in the file), and so on. We are currently performing such test runs for most scanners produced by CARO members. In the future we intend to include more scanners and to make the results public. It should be well understood, however, that those are not complete evaluations of the particular product and not even professional scanner tests. They can give only an estimation of how well or how bad a scanner works.

In those test runs, we preprocess the raw report files generated by the scanners in a three–column report which contains the name of the file (with its directory path), the full CARO name of the virus in it, and the name reported by the scanner. Only the essential information provided by the scanner is kept, in order to make the report more readable. For instance, a report like "FILENAME.EXE has been identified to be infected by the Fish_6.B virus !!!" is converted into something like

FILENAME.EXE | Frodo.Fish_6.B | identified Fish_6.B |

This is performed automatically with a set of batch files, a few Unix utilities ported to MS–DOS (awk, sort, join, cut, paste, grep) and a set of AWK scripts.

However, in order the above to be possible, the scanners must respond to a set of conditions, i.e. to be "testable". For instance, they must be able to generate a report file (some aren't); must be able to run on a huge subdirectory tree (even some file managers crash when presented with our 4,200 subdirectories); must not keep the report file in memory (some scanners are stupid enough to attempt that, and of course there is not enough memory to keep 12,000 entries in); must output the name of the directory being scanned somewhere in the beginning of the report file (some put it at the end); the user must be able to specify an arbitrary name for the report file (some scanners use fixed names); the scanner must be able to output in the report file the names of all files being scanned (many scanners can output only the names of the files which they consider to be infected); the full paths of the files must be present (some scanners output only the file names, without the name of the directory they are in) and so on.

Testing scanners which do not have the capabilities listed above is very troublesome and requires a lot of efforts. Therefore, for the time being, we do not intend to provide such kind of service to their producers. The scanners which we are currently able to handle are F-PROT, FindVirus, TbScan, IBM Antivirus/DOS, UTScan, AVP, VET, HTScan, AVScan, and PCVP.

Back to the Table of Contents

6. Conclusion

Once a good, clean, rich, and well–organized collection of viruses is built, it has to be regularly maintained and updated—definitively not an easy job, which requires a lot of effort, devotion, and qualification. Such a collection can be a useful tool for testing of anti–virus programs and systemizing the knowledge about computer viruses. The maintenance of such a collection is, however, only one of the tasks (and by no means the most difficult one) that a qualified anti–virus researcher must regularly perform.

Back to the Table of Contents

7. References

[Timson] Timson, H., Virus Lab notebook, Virus News International, April 1992.
[CARO] Bontchev, V., Skulason, F, Solomon, A., A Virus Naming Convention.
[Brunnstein] Bontchev, V., Brunnstein, K., Jahn, W–D., Towards Antivirus Quality Evaluation, 3^rd annual EICAR Conference, Munich, 8 December, 1992.

Back to the Table of Contents