Analysis and Maintenance of a Clean Virus Library

                Vesselin Bontchev, research associate
               Virus Test Center, University of Hamburg
             Vogt-Koelln-Str. 30, 22527 Hamburg, Germany
               bontchev@fbihh.informatik.uni-hamburg.de

    A well-maintained virus library, or as it is often called, a
    virus collection, is an important tool to the anti-virus
    researcher. It can be used to test anti-virus software, to
    systemize the knowledge about the thousands of currently
    existing viruses, as a basis of information exchange with
    other anti-virus researchers and so on. However, the
    creation of such a collection and its maintenance in a clean
    and well-ordered state is not a trivial task, especially with
    the huge amount of currently existing viruses and new ones
    popping up literally every day. This paper describes the
    major gidelines and procedures used to maintain the virus
    collection in the Virus Test Center Hamburg.

1. Introduction.

Many people who like to call themselves "virus researchers" seem to
think that keeping a big collection of viruses is enough to qualify
them as such. Yet, they often fail to achieve even this task. Many
of the so-called "virus collections" that we have seen being
distributed in the computer underground, are in a very sorry state.
They contain huge numbers of viruses, non-viruses, trojan horses, joke
programs, intended viruses (programs written with the obvious intent
to write a virus, but too buggy to replicate even once), corrupted
files, text files, virus creation tools, completely innocent files,
and so on.

A typical example of this is the so-called virus collection,
distributed (sometimes even sold) by John Buchanan. Several people in
the anti-virus research community have received it (some have even
payed for it) and have spent an enormous amount of time and effort
just to discover that it consists mainly of junk. Yes, it contains
thousands of files, megabytes of data. Many of those files are
viruses - mostly well-known ones, or trivial modifications of the
well-known ones, or even (from time to time) completely new ones. The
hard thing is to find out which ones they are; i.e., to sieve the
important things from the garbage.

Often, after the painful job of weeding is over, one wonders whether
the results have been worth the effort. Very often they aren't - but
one never knows in advance. Yet the task has often to be done over
and over - every now and then some "helpful" soul downloads such a
"collection" from the virus exchange BBSes and sends it to us - with
an intent to help, of course.

In the VTC-Hamburg we have to perform the task of weeding garbage
virus collections averagely once or twice per month. In this paper we
shall try to describe the procedures that we use to maintain our virus
collection in a well organized state when analyzing and merging new
virus collections to it.

The first thing to do when you receive a new virus collection is to
remove the garbage from it. This is also often the most difficult
task. Then the new viruses found in it must be replicated, classified
and merged with your own virus collection. Only then the resulting
updated collection can be used in testing anti-virus software, sending
it to other anti-virus researchers, and so on.

2. Weeding.

2.1. Unpacking.

The times when the number of existing viruses was half a dozen and
they all fitted nicely on a single floppy disk have gone forever.
Nowadays, a typical collection consists of thousands of files (as of
the writing of this paper - June 1993 - there are more than 2,800
known viruses), which occupy megabytes of disk space. That's why,
some form of compression and archiving is used to save space when
transporting the collection. The problem is that there is no standard
scheme being used.

The first thing to do when you receive a new virus collection is to
unpack it. Unfortunately, this is often less trivial than it sounds.
First of all, one needs a lot of free disk space, megabytes of free
disk space. Not only for the collection itself - sometimes the
collection can be found in a huge archive, which is itself stored on
multiple floppy disks, using some kind of backup program. Therefore,
one needs an amount of free disk space equal to the size of the
collection in unpacked form plus the size of the archive that contains
it. A collection of backup/restore programs is also desirable, in
case the archive with viruses has been stored on diskettes using some
kind of weird backup program, e.g., one that requires a particular
version of DOS in order to run.

The next task is to unpack the archive. It is usually created with
PKZIP or ARJ, but any of the other popular archivers could also be
used, so we need to have all of them handy. When unpacking a ZIP
archive, one must remember that it may have a directory structure
which will be lost, unless the unarchiver is implicitly instructed to
preserve it. In the case of ARJ archives we often see multi-volume
archives, which span multiple floppies. They also must be treated
with care.

Often the whole archive is encrypted for security reasons. We have
established lists of passwords that the other anti-virus researchers
use when sending us viruses, but occasionally they decide to change
the password (again for security reasons) and we have to contact them
by telephone or fax and obtain the new password. Sometimes unpacking
the main archive reveals a set of other archives - often packed with a
different archiver and/or encrypted with different passwords. In some
cases, those come from an underground virus exchange BBS and the
password is unknown. Sometimes the encryption can be cracked or the
password guessed, deduced, or just found out by studying the other
files that come with the archive.

Even if the archives are not encrypted, unpacking them might pose some
problems. Often several archives contain files with different
contents, but with one and the same name - README is a favorite one.
Some archivers allow the user to specify an alternative destination
name for the file that is unpacked, if one with the original name
already exists on the disk. Others allow only the alternative to
overwrite the existing file or not to unpack the new file at all. In
those cases one must write down the names of the files that are not
unpacked, then to rename the existing ones and then to unpack the
files missed during the first pass.

2.2. Removing the duplicates.

Once the collection is unpacked, the next task is to remove everything
that you already have - everything that is not new. We have found it
useful to keep more than one infected file per virus. In fact, our
collection is a superset of all collections that have been sent to us
- all files in them have been merged in ours, with the duplicates
removed. Since one and the same virus collections are often sent to
us over and over (slightly updated each time), it is simpler to just
keep a copy of them and when each new collection arrives simply to
remove the files, copies of which we already have.

Having in mind that we often have to deal with thousands of files (at
the time of writing this paper, our own collection consists of more
than 12,000 different files), the task of spotting the duplicated ones
and removing them might seem an extremely difficult one. Fortunately,
there is a wonderful shareware utility, which helps us doing that.

This is the program Duplicate File Locator (DFL) by William Ataras.
It is able to scan a whole hard disk and locate the duplicate files,
where the user can specify what exactly s/he means by "duplicate
files." This can be files with the same names, with the same first
part of the names, with the same contents, or even with the same CRC
checksum. The latter is in practice equivalent to comparing the files
by their contents, except that it is faster, and requires less memory.
DFL even "knows" the format of several popular archives and, when the
CRC-comparison mode is used, it is able to extract the value of the
CRC field from the archive without even unpacking it. This also tends
to speed things up a bit.

The size of the hard disk being scanned is not important - DFL is able
to create temporary files and use them as virtual memory, so any size
of hard disk and number of files is supported, provided that there is
enough free space for the temporary files. The speed of the program
is also impressive - in our experience it was able to sort out a dozen
of thousands of files in less than half an hour.

The program is able to handle only whole volumes, not single directory
trees, but this drawback can be easily circumvented by using the DOS
command SUBST and assigning drive letters to the subdirectory trees
you want to process.

Once DFL has scanned the specified volumes and has determined the
duplicates (according to the conditions specified by the user), it
also allows the duplicated files to be marked and deleted - even if
they reside inside an archive. One annoying limitation of the program
is that one must manually tag every duplicated file that has to be
removed - and those are often thousands of files. It would be much
simplier if the program allowed some way to mark all files residing on
a specified drive or in a specified directory path at once.

After the files that we already have are removed, the size of the
collection is greatly reduced - usually to about 10 \% of its original
size. However, this does not mean that everything that rests are all
new viruses.

2.3. Corrupted files.

Many of the files that remain are simply corrupted. There are
different kinds of corruption that we usually observe. We don't know
what causes them, but they are often present in the unkempt
collections, regardless that they are usually easy to spot.

One kind of corruption consists of the beginning of the file being
overwritten by some text characters, seeming to be randomly typed from
the keyboard. In other cases, the first byte of the file seems to be
missing, as if somebody has cut it out and shifted the whole file one
byte to the beginning. We say that such files are "out of phase."

Another kind of corruption consists of the entry point of the file
pointing outside the loadable part of the file. Or, the file is
infected (that is, a virus is appended to it), but the original bytes
of the file are restored and the virus never gets control. This is
probably the result of running some bad anti-virus program on the
infected file. Depending on how they work, the different scanners
report different things when used on such "partially disinfected"
files. Some of them say that the files are not infected - which is,
strictly speaking correct, because the virus in them never gets the
chance to execute. Other scanners claim that a new variant of the
virus is present in the file. Yet others say simply that the file is
infected by a particular virus. All such cases have to be
investigated (usually - by loading the suspected file with a debugger
and inspecting it) and if the file is corrupted, it has to be removed.

In very rare cases, the virus code is slightly damaged. Sometimes
this prevents the virus from working at all, sometimes only the
payload is not working, and sometimes and entirely new, perfectly
working variant of the virus is produced. Relatively often the
corruption is caused by overwriting the replication part of the virus
with a single INT3 instruction. This usually indicates that somebody
has tried to replicate the virus with a debugger. Of course, viruses
which are overwritten in such a way are not functional and have to be
removed.

2.4. Envelopes.

Very often some of the virus samples in the collections that appear on
the underground virus exchange BBSes are packaged with some kind of
envelope. Such enveloping always modifies the original file in some
way and such a modified file usually escapes the initial sieving phase
when the duplicate files are eliminated. There are different kinds of
envelopes.

In the simplest case, it is just some string appended to the file.
The most popular ones are "MsDos" (appended by the anti-virus package
TNTVIRUS), a "signature" from Todor Todorov's Virus eXchange BBS, and
another two signatures from a virus exchange BBS, called Arrested
development. Less popular is the 10-byte checksum, added to the files
by McAfee's program VIRUSCAN, when used with the option /AV. In some
cases we have seen several such signatures appended to one and the
same file. We don't know what is the reason for such "signing" (maybe
it is an incompetent attempt to trace the distribution of the file?),
but it often causes some scanners to report a known virus as a new
variant. Therefore, we are using a small program, which detects those
signatures and automatically removes them from the files.

Another kind of packaging that modifies the initial file is to
compress this file with one of the existing executable file
compressors - LZEXE, PKLite, Diet, etc. For instance, the executable
file compressor, known as ICE, was the favorite one used by the
Italian virus writer known as Cracker Jack to distribute the first
generation samples of his viruses.

Such compressed files have to be detected, recognized, and removed.
Failing to do so may lead to a confusion like the "Ramvirus",
described in Patricia Hoffman's hypertext document VSUM. In reality,
this is a file, infected with both the Jerusalem and Cascade, viruses
and then compressed with Diet. Since the compression alters the image
of the viruses present in the file, many scanners will not be able to
detect the virus, even if it is known to them.

There are several scanners which are able to scan inside the files
compressed with most of the popular compression programs, but none of
them is able to handle all existing compressing schemes. We therefore
prefer to use a separate shareware program, UNP, which is specialized
in restoring the files compressed or scrambled in some way. It is
able to handle all currently existing compression and scrambling
schemes. However, care has to be taken, because some viruses actually
spread in compressed form, and uncompressing them would create a new,
different virus.

Finally, the most sophisticated kind of packaging used are the
so-called immunization modules. These modules are small pieces of
code that are attached to the executable files much like a virus.
They have the task to check the file integrity at runtime and
sometimes even to automatically restore the file, if it seems to be
infected. The modules that we encounter most often are those added by
CPAV, F-Xlock (a part of older versions of Fridrik Skulason's F-Prot),
and VIRUSCAN (when used with the /AG option).

We don't know why the samples in the virus collections have these
modules so often attached to them. Probably the reason is that
somebody runs a scanner (in order to determine which viruses are
present in the collection) in some default mode that adds these
modules automatically.

Those modules are generally a bad idea. They are not able to detect
an infection by a stealth virus. They are also modifying the
"protected" program and thus can prevent it from running - e.g., if
this is a self-checking program. Also, some executables just cannot
be immunized, because they refuse to run if something is appended to
them. Examples of such programs include Windows applications, files
with internal overlay structure, and so on.

The virus samples which are "immunized" in this way are not only
modified (and thus evade the initial sieving files which removes the
duplicate files), but also sometimes the "immunization" effectively
"hides" the virus from the scanners and they stop recognizing the file
as infected.

Again, in those cases we use specialized tools to remove the envelope.
Often the product that appends the modules to the executable files has
an option to remove them. Some of the scanners that we use have an
undocumented option that forces them to remove some of these modules -
much in the same way as they remove a virus infection.

One special case of envelopment is when a virus sample is additionally
infected by another virus. This sometimes happens when the
"collector" accidentally releases a virus and gets the files of his
collection infected by that virus.

Most of the scanners that we use in our work have the option of
removing the viruses from the infected files one by one, peeling them
like an onion. This is useful, when the virus that interests us is
deeper in the file than the enveloping infection. However, in some
rare cases we are interested in the virus that has last infected the
file and is therefore at the outermost level.

In those cases we separate the viruses, using several different
techniques. If the two viruses in the file happen to infect under
different conditions, then the task of separating them is a relatively
easy one. One has just to create a set of conditions, under which
only the virus that interests us will infect. For instance, if the
"interesting" virus infects only COM files, while the "uninteresting"
one infects only EXE files, then it is sufficient to provide the
viruses with COM files only to infect and only the virus that
interests us will infect them.

Unfortunately, in some cases the two viruses infect under one and the
same conditions and it is impossible to separate them in that way.
Then we use some kind of debugger or binary file editor to patch the
virus that does not interest us in such a way that it becomes unable
to replicate. Only then it is possible to extract the "interesting"
virus.

2.5. Non-viruses.

Once all files are unpacked, the duplicates and the corruptions
removed, and the envelopes "peeled", one should not assume that all
that rest are viruses. Very often the collectors gather programs that
are not viruses, but which they feel belong to a virus collection.
Examples of such programs are Trojan horses, joke programs, demos
(programs that demonstrate some cute effect of a famous virus), first
generation viruses, utility programs, and so on.

We often use our collection to test the detection rate of the
scanners. By definition, a scanner is a program to detect viruses.
It would be unfair to measure its detection rate on files which are
not viruses and report them as "missed" by the scanner. Therefore, we
must make sure that the viruses and the non-virus programs in our
collection are kept separated.

The prevailing number of non-virus files in the virus collections are
Trojan horses. By definition, a Trojan horse is a program that claims
to perform some useful functions, while in the same time performing
intentionally some harmful ones. Many of the Trojans that we see
barely fulfill the above definition. Most of them are extremely
mediocre programs, which try to format the hard disk (very often -
unsuccessfully, due to incompetence of their author), or to delete all
files in the current directory, or to wipe out the file allocation
table, and so on. The "useful function" part is often completely
ignored.

Often, it is possible to figure out whether a file contains a Trojan
by just examining it with a file browser. The typical Trojan is a
compiled program, written in a high-level language (usually C or
Pascal, but we have also seen BASIC and even Modula-2), which contains
some offending message - a message that is supposed to be displayed
after the Trojan is run. Of course, this is not mandatory, because it
is trivial to keep the message encrypted. However, most people who
write Trojans are so incompetent, that they rarely do that. In fact,
very often they even don't strip the debugging information at the end
of the compiled program, so one can easily see the symbolic names of
the used variables and routines. Often they provide a hint about what
the Trojan does - e.g., if it is a program written in C and the
abswrite() function is used, then it is quite likely that the Trojan
overwrites part of the disk.

Sometimes the Trojan programs are accompanied by a text file, which is
a parody on real documentation and the has the goal to convince the
user to execute the program. Very often those text files are written
so incompetently (obviously by adolescent kids), that it is fairly
obvious that they cannot describe a real, professional product.
Nevertheless all those Trojan horses keep appearing in the virus
collections, so they have to be isolated and separated from the
virus-infected files. We do not attempt to collect or classify such
programs; we merely keep them for reference for the process that
removes from the unsorted collections the files that we already have.

Another kind of malicious software which we find sometimes in the
virus collections are the so-called joke programs. Those are programs
which appear to do something harmful, but in practice do not perform
the claimed destructive action. For instance, one of the popular joke
programs displays an animated picture of bugs which "eat" the text
from the screen. When the user presses a key, the bugs disappear and
the original contents of the screen is restored. Another joke program
reverses the screen image, while yet another one claims to format the
hard disk but performs non-destructive disk accesses instead.

Some authors prefer to classify the joke programs as a kind of Trojan
horses. Indeed, the fact whether you will find their effect malicious
or funny largely depends on your sense of humor and on whether you
have been made a victim to the particular joke. Regardless how they
are classified, however, one thing is certain - they are not viruses
and have to be separated from the infected files in the virus
collection. Fortunately, there are not so many jokes floating around,
and their recognition is a relatively easy task.

Another kind of program that we often see mixed with viruses in the
unkempt collections are the kind of programs that demonstrate the
visual or sound effects of particular viruses. While those programs
are relatively harmless, they do contain parts of the virus code.
Occasionally, a scanner will use a scan string for a particular virus
from the payload code of the virus. It wouldn't be surprising then if
the scanner detects the innocent program that demonstrates the payload
of the virus as "infected." Maybe the existence of such scanners
explains why we sometimes see those demonstration programs mixed up
with viruses in the collections. Most collectors are unable to
disassemble a program and to determine whether it is a virus or not,
so they just run their favorite scanner on their collection and
blindly believe to its reports. This is one more reason why the
demonstration programs must be detected and removed from a
well-organized virus collection - after all, they are not viruses.

Lately, we have observed another kind of weed in the virus collections
that we receive. Doren Rosenthal from Rosenthal Engineering has
produced (and distributes as shareware) a program which he calls a
"virus simulator." The program generates executable files, which are
not viruses themselves, but which contain tiny pieces of code picked
from different viruses - the parts of the virus code used as scan
strings in different scanners. The idea is that those files can be
used to "test" how well the scanners can detect infections, or to
demonstrate in a safe environment what happens when a scanner detects
a virus and triggers the alarm. This line of reasoning assumes two
main premises, both of which are untrue - that each virus has a unique
signature somehow "attached" to it, and that all scanners work by
simply looking for this signature in the files. Such reasoning
demonstrates a deep lack of knowledge about how the modern scanning
technology works. The final result is that, of course, those
"simulation files" are completely useless for the purposes of
anti-virus products testing. In fact, they are even harmful, because
can trick the user into a false sense of security.

Fact is, however, that some of them really trigger some scanners.
This only increases the general confusion, because the incompetent
collectors incorrectly conclude that the files contain viruses. After
all, if the scanner says that the file contains a virus, it should be
so, right? Unfortunately, this reasoning is completely wrong and the
net result of it is that our time is wasted to analyze silly
"simulated viruses", in order to detect them and to remove them from
the virus collections.

Occasionally, a badly maintained virus collection contains files with
executable extensions (COM or EXE), which do not contain executable
files. In some cases they contain images of infected boot sectors.
Obviously, somebody has saved the image of a boot sector virus in a
file with executable extension, with the intent to disassemble it
further with some kind of disassembler or debugger, which is able to
work only with files with executable extensions.

Sometimes, however, those files contain pure ASCII text. Two popular
files which we keep seeing reappearing are HI.COM (containing a worm
program for VAX/VMS written in DCL) and CHRISTMA.EXE (containing the
text of the chain letter program written in REXX, which was widely
spread on BITNET several years ago).

Usually it is relatively trivial to recognize such files and to remove
them. We use a file browsing utility to inspect the contents of the
files. This utility displays the file contents both in ASCII
characters and in hexadecimal. The files which contain pure text are
trivial to spot on inspection, and the images of boot sector viruses
usually have a particular format, which also makes them relatively
easy to detect. Sometimes there are additional levels of concealing -
e.g., a text file, containing the description of a virus, is converted
into a program that displays it with the popular utility TXT2COM and
then is compressed with PKLite or some other executable file
compressor program. This makes the detection of such files slightly
more difficult, but even that becomes easy after one acquires some
experience.

However, undoubtedly the most often seen kind of "irregular" virus
samples in the virus collections are the so-called first generation
viruses. Those are programs which do contain virus code and will
release it on execution, but which are not infected in a normal way.
Some authors consider them as kind of Trojan horses, the harmful
function of which is to release a virus. We, however, prefer classify
them separately and to split them into several groups - germs,
droppers, and injectors.

Germs are the programs which are produced by assembling the original
source code of a virus and which cannot appear in the normal way of
virus infection. For instance, a virus could contain a limitation to
infect only files larger than a particular size, e.g., 1,000 bytes.
Obviously, such a virus cannot infect a 10-byte program in a natural
way. But, if one has the source code of the virus (or a good
disassembly of it), it is very easy to put only a short host program
in the beginning of the virus and to assemble the whole thing. The
resulting program will release the virus, if executed. Even some
anti-virus programs will be able to "disinfect" it by removing the
virus and leaving only the tiny host program. However, such a file
will never occur during the normal replication process of the virus.

Sometimes an easy way to detect the germs is to look at the beginning
of the file. The germs are almost exclusively found in COM files.
Many viruses infect such files by placing a JMP instruction in the
beginning of the file, which instruction transfers control to the
virus body. If the file is infected in a normal way, this will be
always a JMP Near instruction (opcode 0E9h). However, if the virus is
assembled from source and is attached to a tiny host program, the
assembler usually optimizes the JMP instructions and generates JMP
Short instructions (opcode 0EBh). Therefore, if the first byte of an
infected COM file contains 0EBh, chances are that this is a germ.
However, this should be used only as a heuristic; not as a strict
rule. There are viruses which don't use a JMP instruction to transfer
control to their body and there are viruses (usually of the prepending
type) which begin with 0EBh in all infected files.

The second type of first-generation virus programs are the droppers.
These are programs, which install a boot sector virus on a floppy disk
- usually, after the confirmation of the user. Since their main goal
is not to conceal their actions, but merely to provide a convenient
means for transport of boot sector viruses, the droppers are
relatively easy to detect. Usually just inspecting the suspicious
file with a binary editor is sufficient - one can easily spot the
messages that describe what the program does and request the
confirmation from the user. The dropper programs are not normal
replicants of a virus, and should therefore be kept separately of the
main virus collection. We usually run them on a sacrificial machine,
then inspect the infected diskette that they create, in order to
determine which exactly virus is installed by the dropper.

The last variant of first-generation viruses are the so-called
injectors. They are probably most similar to the normal Trojan
horses. They are programs which are not infected themselves, but
which do contain a virus (often - in some concealed form or another)
and which release it on execution - usually unnoticeably to the user.

Injectors are the kind of non-natural virus samples which is the most
difficult to detect - because the virus writer has taken some steps to
conceal the fact that the program releases a virus. The easiest way
to deal with such files is to simply try to replicate them and then
compare the replicants of the virus with the original file.

Another kind of virus-like programs that often clutter the virus
collections but which are not viruses, are the so-called intended
viruses. Those are programs written with the obvious intent to write
a virus, but which are so buggy (usually due to the lack of
programming experience of their author), that they are just unable to
replicate. Examples include "viruses" which hang the computer on
execution, or which can infect only files with a particular length
(e.g., 17 bytes), or which fail to point the entry point of the file
to the virus body after the virus is appended to the file, and so on.
Usually it is a waste of time to try to find all bugs in such
programs. However, it is important to recognize them and to keep them
separated from the real virus samples. Usually, the easiest way to do
this is to try to replicate them and to check whether the replicants
are able to replicate further. If they are, then they contain a real
virus, otherwise, it is quite probably just yet another intended
virus, written by yet another incompetent "wannabe" virus writer.

Finally, the virus collections sometimes contain perfectly innocent
files. Some of them might have something to do with the virus
collection, and some may be completely unrelated to it. Examples
include archivers, programs which convert a diskette into a file
image, goat files used to attach viruses to them (but with no viruses
attached), programs which advertise (often with elaborated
pseudo-graphic pictures) the different virus exchange and/or pirate
BBSes, and so on. We have seen included even such programs like
FORMAT and CHKDSK - obviously somebody has noticed that they perform
direct disk access and has labeled them as "dangerous"...

While detecting, recognizing, and removing some of those non-virus
programs is relatively easy, finding all of them, especially in an
unkempt collection can take a lot of time and efforts. We solve the
problem by keeping a copy of those programs in separate directories,
which are clearly labeled as containing non-viruses. Since a copy of
the files is already present, any additional copies will get
automatically sieved out during the initial process of removing the
duplicates.

An alternative solution is to use a database of checksums of those
files and to automatically reject any file, the checksum of which is
already contained in the database. One drawback of this method is
that one might make a mistake and incorrectly label a virus as an
innocent or non-working program. If a copy of the original file is
kept, the mistake can still be discovered and corrected - after the
file is analyzed and its viral qualities are confirmed. This is
significantly more difficult thing to do, if only a checksum of the
file is kept.

3. Replication.

Once the collection that is being analyzed is unpacked and all known
files and obvious garbage have been removed from it, what rests is
mostly viruses. They are either new replicants of the known viruses
(replicants, which just are not present in our collection), or they
are samples of new viruses. The recognition of the known viruses is
relatively trivial - one just needs to run a couple of good and
up-to-date scanners on them and to examine the generated report files.
This will be described in details in the next section. When the known
viruses are recognized and sorted appropriately, we end up finally
with files which very probably contain mostly new viruses. The
problem now is to figure out for which of them this is really so.

There is one ultimate proof of whether a program contains a virus. If
part of its code is able to replicate and to attach itself to other
programs, then it is a virus, regardless of what else it does.
Unfortunately, if it doesn't replicate, this does not mean that it is
not a virus. Some viruses are so poorly written, that they require
significant effort to replicate. The virus author has freely assumed
that some set of conditions which have happened to be true on his
machine must be also true on all other machines, so his virus requires
them, in order to replicate. We are saying that viruses which are
difficult to convince to replicate, require "spoonfeeding."

The conditions on which these viruses rely in order to replicate can
be various. For instance, some viruses (e.g., Ping Pong, Bebe)
replicate only on a machine which has a 8088 CPU and don't run on
anything more contemporary. Others require just the opposite - that
the machine has at least 80286 or above. The Uruguay.1 virus
replicates only on machines with V20 (a 8088 CPU clone). Some viruses
(e.g., TenBytes) assume that the machine has 640 Kb of conventional
memory and will not run on anything less powerful, e.g. a machine
with 512 Kb of RAM.

Many viruses make assumptions about hard-coded addresses or
undocumented structures in the operating system - and are therefore
limited to a particular version of the operating system - or a subset
of versions. We have found that IBM's PC-DOS 3.30 is the most "virus
compatible" version of the operating system - almost all "DOS version
dependent" viruses are able to run under it. Examples include Ping
Pong, which does not recognize disk partitions formatted with DOS 4.0
or above, Darth Vader, which is supposed to infect files when they are
copied but assumes that DOS uses a particular function call to open
the destination file (since DOS 4.0 the command interpreter uses a
different function call), Dir-II, which doesn't run under DOS 5.0 or
above and does not infect correctly the disk under Compaq DOS 3.31 and
so on.

Other viruses pose some limitations on the file system. Some of the
Rythem variants replicate only when they are not in the root
directory. The Shifter virus infects only very large files. One of
the Screaming Fist variants can infect correctly only files which are
17 bytes long. There might be some other limitations too - for
instance, the StarShip virus requires several conditions to be
fulfilled - from a particular version of the operating system, to the
presence of particular video controllers. The Tony virus infects on
the 1st day of every month only files, the name of which begins with
the letter 'B'; on the 2nd day it infects only files, the name of
which begins with a 'C' and so on.

Therefore, if a program does not replicate, it doesn't mean that it
does not contain a virus. Maybe it is a virus which replicates only
on Fridays, or which requires a XGA video controller, or which waits
until the hard disk is 80 \% full. Or maybe it is just an intended
virus, which is too buggy to replicate and which has not been even
alpha-tested by its author. Such programs usually take the most of
our time. They have to be analyzed, and the bug (or the replication
condition) has to be found - so that we can be 100 \% certain of
whether it is a virus or not.

For virus replication we use a sacrificial machine, running PC-DOS
3.30 and with 640 Kb of memory. Sometimes we change the DOS version -
if analysis of the virus code has raised the suspicion that it is DOS
version dependent. We also use a variety of goat files - do-nothing
programs, which are given to the virus to infect. They have various
sizes and names, because some viruses try to detect the goat files and
refuse to infect them. In the future, we are planning to broaden our
set of goat files to include different kinds of COM, EXE and SYS
programs, tiny, middle-sized and large, plain and with internal
overlays, Windows applications and so on. This will be useful in
testing the detection rate of the anti-virus scanners - we have noted
that several well-known scanners fail to detect the virus in all
replicants (e.g., they don't look for it in SYS files - if the author
of the anti-virus product has failed to notice that this particular
virus can infect such files).

In fact, even if the program seems to replicate (i.e., infects some of
our goat files), this still doesn't mean that it is a real virus.
Some intended viruses contain bugs, which make them attach themselves
to the attacked files, but either corrupt their own body, or set
incorrectly the entry point of the file - so that it never points to
the right place in the virus code. That's why, we must verify that
the files infected on the first pass are able to replicate further -
otherwise they are not real viruses. Examples include Vienna.963
(sets incorrectly the entry point), Virus-9 (corrupts itself), Druid
(only the germ is able to replicate) and others.

4. Classification.

Once all files which contain real viruses have been isolated, they
have to be sorted and classified appropriately. In CARO (the Computer
Anti-virus Researchers' Organization) we have agreed to classify the
viruses into families, groups and variants, according to the
similarity of their structure, code, and methods of infection - and
not according to what the virus does, who has written it, or where it
has been first detected, for instance.

This classification is expressed in a hierarchy, which is represented
with a large directory tree (about 4,200 directories), the leaf
directories of which contain the actual viruses. All files which are
in the same directory are infected by one and the same virus variant,
and if two viruses differ by at least one bit in their code or
constant data parts, they are put into separate directories.

We are using different tools to help us in the process of virus
classification.

First of all, we need to identify the previously known viruses - as
they are already classified and the new samples have to be put in the
same places of the classification hierarchy. As a primary tool, we
are using the scanner FindVirus from Dr. Solomon's Anti-Virus
ToolKit. We are always using the latest version - sometimes even a
beta version, sent to us by Dr. Solomon. This scanner has an
undocumented option, which forces it to perform exact identification
of nearly all viruses it can find - and it can find thousands of them.
Unfortunately, in this special mode the scanner tends to be rather
slow - because it decrypts every encrypted virus, computes checksums
of the constant parts of the virus body, and compares this "virus map"
with one of the many maps which the scanner carries with itself.
However, the speed is not important when sorting a virus collection -
more important are the reliable results.

Unfortunately, sometimes the scanner fails to identify some viruses
exactly. In those cases we are using our own virus mapping utility,
called VIP (Virus Identification Program). It was developed from our
design by Ivan Triffonov - a student in the Laboratory of Computer
Virology at the Bulgarian Academy of Sciences, in Sofia, Bulgaria.
VIP takes a file which is supposed to contain a particular virus and a
text file, containing the map of this virus. The map contains a
description of the different areas in the virus - code, text, variable
data, constant data, or unreferenced junk - together with their
offsets from the beginning of the virus, and checksums of the
non-variable parts (code, text, and constants). VIP can even assist
the user to create the map - it can take a few infected files and try
to guess how the map looks like. The rough map generated in this way
can be further refined by the user, who can even add comments about
the meaning of the different areas. However, VIP cannot be used as a
scanner - it is relatively slow and is able to handle only one virus
at a time. Also, the current version still has problems with
specifying the maps for some encrypted and/or polymorphic viruses, or
with viruses which transfer control to their body in a non-standard
way. We intend to improve these aspects of the program in the future.

If the virus identification tools turn out to be unable to identify
the virus, it is very likely that it is either a new variant of one of
the known viruses, or a completely new virus. In order to detect the
viruses which belong to the former group (new variants of the known
viruses), we are using a different scanner - Fridrik Skulason's
F-Prot. Our experience has shown that this scanner is one of the best
existing ones in this task - to determine whether a new virus is a
variant of one of the already known ones.

Nevertheless, F-Prot's diagnosis is nothing more than a pretty good
guess. We often need to inspect the virus manually with a debugger,
in order to determine whether it really belongs to one of the already
existing virus families. Sometimes even this cannot help us to
decide, and then we are using yet another tool, developed by Fridrik
Skulason. It can take two files and compute the percentage of
substrings with a particular length, which occur in both files. In
order to make the virus suitable for this kind of examination, it must
be "purified." That is, the virus has to be cut out from the host file
it has infected, and its data areas have to be filled with zeroes.
This usually requires some efforts and analysis of the code, but the
results are usually worth the efforts - because then we have a
measurable indication of how similar a particular virus is to the
other members of the same family.

When the known viruses and the new variants of the known viruses have
been successfully classified, and when we are sure that the rest are
real viruses, then they are obviously completely new viruses. Usually
they are not that many - normally we find a couple to half a dozen of
them in a large, unkempt collection. Obviously, writing a virus from
scratch is still a too difficult task for most people, so they prefer
to produce copycat viruses or to use one of the existing virus
generators. This perfectly suits us, because it means that most of
the good scanners are already able to detect their viruses.

The new viruses have to be successfully analyzed, named, and
classified in new virus families. The naming is done according to the
CARO naming scheme, the full description of which has been published
elsewhere and is out of the scope of this paper [CARO].

During the virus analysis we also use a different kind of tools -
mainly a disassembler (Sourcer from V Communications), debuggers (AFD
Pro, Turbo Debugger, DOS' DEBUG, PDVIM), editors (PE2, MultiEdit),
assemblers (TASM, MASM, A86), file editors ( FED, FIX) and so on.
Many people seem to think that just running Sourcer on a virus
produces a disassembly of this virus. Unfortunately, this is only the
first step - the disassembly has to be further refined (Sourcer is one
of the best disassemblers around, but sometimes generates amazingly
stupid things), it must be verified that it assembles to the original
binary file, then the virus has to be understood and detailed comments
must be added by hand - Sourcer puts only some primitive comments
automatically.

5. Tests.

We often use our virus collection to check how well (or how bad) some
scanners detect the viruses in it. Some people seem to think that in
order to test an anti-virus product, you just need to run it on a
virus collection. We have seen a lot of such "evaluations" published
in the computer magazines. In many cases, even the virus collection
used is a messy one, containing non-viruses, first generation viruses,
and so on.

First of all, the only part of an anti-virus product that can be
tested this way is the scanner - and the anti-virus product often
include (or even rely mainly on) different parts, which are completely
independed of known-virus scanning. Second, the only thing that is
easy to test is the detection of file infectors - since the
preparation and regular change of all kinds of floppy disks infected
with every boot sector virus we know of will take a lot of time and
efforts. Third, for a complete scanner test, one needs many samples
of each virus - in different kinds of files. This is especially true
for the polymorphic viruses - for a good test one needs hundreds, if
not thousands of each of them. These would occupy too much disk
space. Even now our collection occupies about 120 Mb. A really
complete collection, with multiple samples of each virus in all kinds
of files would occupy gigabytes of disk space. There are some other
important aspects of testing and evaluation of anti-virus software.
Most of them are described in [Brunnstein].

Nevertheless, the results of running a scanner on our collection of
file infectors can often be very useful to the producer of the
scanner. It can reveal bugs in the scanner like unreliable detection
(i.e., not all replicants of the virus are detected), false negatives
(viruses which are not detected by the scanner at all), problems in
the naming (how well the scanner adheres to the CARO naming scheme),
identification problems (different viruses are reported with the same
name or one and the same virus is reported with different names),
multiple detections (reporting more than one virus when only a single
one is present in the file), and so on. We are currently performing
such test runs for most scanners produced by CARO members. In the
future we intend to include more scanners and to make the results
public. It should be well understood, however, that those are not
complete evaluations of the particular product and not even
professional scanner tests. They can give only an estimation of how
well or how bad a scanner works.

In those test runs, we preprocess the raw report files generated by
the scanners in a three-column report, which contains the name of the
file (with its directory path), the full CARO name of the virus in it,
and the name reported by the scanner. Only the essential information
provided by the scanner is kept, in order to make the report more
readable. For instance, a report like "FILENAME.EXE has been
identified to be infected by the Fish\_6.B virus !!!" is converted
into something like

FILENAME.EXE | Frodo.Fish\_6.B | identified Fish\_6.B |

This is performed automatically with a set of batch files, a few Unix
utilities ported to MS-DOS (awk, sort, join, cut, paste, grep) and a
set of AWK scripts.

However, in order the above to be possible, the scanners must respond
to a set of conditions, i.e. to be "testable." For instance, they
must be able to generate a report file (some aren't); must be able to
run on a huge subdirectory tree (even some file managers crash when
presented with our 4,200 subdirectories); not keep the report file in
memory (some scanners are stupid enough to attempt that, and of course
there is not enough memory to keep 12,000 entries in); must output the
name of the directory being scanned somewhere in the beginning of the
report file (some put it at the end); the user must be able to specify
an arbitrary name for the report file (some scanners use fixed names);
the scanner must be able to output in the report file the names of all
files being scanned (many scanners can output only the names of the
files which they consider to be infected); the full paths of the files
must be present (some scanners output only the file names, without the
name of the directory they are in) and so on.

Testing scanners which do not have the capabilities listed above is
very troublesome and requires a lot of efforts. Therefore, for the
time being, we do not intend to provide such kind of service to their
producers. The scanners which we are currently able to handle are
F-Prot, FindVirus, TbScan, IBM Antivirus/DOS, UTScan, AVP, VET,
HTScan, AVScan, and PCVP.

6. Conclusion.

Once a good, clean, rich, and well-organized collection of viruses is
built, it has to be regularly maintained and updated - definitively
not an easy job, which requires a lot of effort, devotion, and
qualification. Such a collection can be a useful tool for testing of
anti-virus programs and systemizing the knowledge about computer
viruses. The maintenance of such a collection is, however, only one
of the tasks (and by no means the most difficult one) that a qualified
anti-virus researcher must regularly perform.

References

[Timson]     Hariet Timson, "Virus Lab notebook", Virus News
             International, April 1992.

[CARO]       Vesselin Bontchev, Fridrik Skulason, Dr. Alan Solomon,
             "A Virus Naming Convention", available electronically via
             anonymous ftp as
             ftp.informatik.uni-hamburg.de:/pub/virus/texts/tests/naming.zip

[Brunnstein] Vesselin Bontchev, Klaus Brunnstein, Wolf-Dieter Jahn
             "Towards Antivirus Quality Evaluation", 3rd annual EICAR
             Conference, Munich, 8 December, 1992.