Macro Virus Identification Problems

Vesselin Bontchev, anti–virus researcher
FRISK Software International
Postholf 7180, 127 Reykjavik, ICELAND
E–mail: bontchev@complex.is

Abstract: Computer viruses written in the macro programming language of the popular office applications like Microsoft Word have become extremely widespread. Unlike the MS–DOS viruses which are single entities, the macro viruses often consist of entire sets of several independent macros. This poses some interesting theoretical problems to the virus–specific anti–virus software that attempts to identify exactly the viruses it detects. Two viral sets of macros can have common subsets—or one of the sets could be a subset of the other. The paper deals with the problems caused by this, some of which are extremely difficult, if not impossible to solve. Emphasis is put on how the difficulties could be exploited by the virus writers and how the anti–virus products should be improved in order to be made resistant to such attacks and to avoid damaging the user’s documents when misidentifying the virus in it and attempting to remove the wrong virus variant.

Download this paper as a Word document

Download this paper as a ZIPped Word document

1. The Need for Exact Virus Identification

3. Easy Macro Virus Identification Problems

3.1. Devolving Viruses or the Rapi Virus Problem

3.2. Missing Macros or the Dzt Virus Problem

3.3. Variable Macro Sets or the CAP Virus Problem

3.4. Mass–Replicators or the Cebu Virus Problem

4. Difficult Macro Virus Identification Problems

4.1. Richard’s Problem

4.2. Igor’s Problem

4.3. The Importance of Identifying the Macro Names

5. VBA5 Identification problems

5.1. Empty Lines

5.2. White Space

5.3. Ambiguous Upconversion

5.4. Letter Case in the Identifiers

5.5. Other VBA5 Identification Problems

6. Artificially Created Macro Virus Identification Problems

6.1. Insertion of Do–Nothing Lines

6.2. Variable and String Modification

6.3. Line Swapping

6.4. Commenting and Uncommenting Lines

6.5. Encryption

6.6. Parasitic Infection

1. The Need for Exact Virus Identification

Before we begin tackling the macro virus identification problems, it is worthwhile mentioning why exact virus identification in general and exact macro virus identification in particular are important. After all, historically, most scanners have always worked by picking some small part of the virus and using it as a scan string to detect all other instances of that virus. However, such an approach has several drawbacks.

First of all, it carries the very real danger of misidentification—i.e., confusing one virus (e.g., a destructive one) with another (which, for instance, is not intentionally destructive). In the past year we saw one anti–virus producer making fools of themselves in public by publishing a press release which warned about the imminent activation of the destructive virus Tedious which was, according to the press release, widespread—and, of course, urged everybody to get that anti–virus producer's anti–virus program. That press release initially caused significant puzzlement among the competent anti–virus researchers. Even if we left aside the ethical problems caused by using scaring tactics to convince the public to buy one's product, it was relatively well known that Tedious is not widespread at all and, most important, does not have an activation date or any payload whatsoever. That is, it is not intentionally destructive. As it turned out, the scanner of the anti–virus producer in question used a scan string which was unable to distinguish between Tedious and Bandung—and the latter virus is both widespread and destructive—thus leading to the confusion in the press release and, undoubtedly, to a negative publicity for the people who have published it. That incident was relatively benign to the users—but one could easily imagine the opposite mistake caused by misidentification. That is, a destructive virus could be reported by the anti–virus product as a non–destructive one, thus failing to warn the user.

Second, precise identification of the virus found is particularly important when virus removal (i.e., disinfection) is involved. Here, misidentification could lead to an attempt to remove the wrong virus variant—with fatal consequences for the infected object, which could be damaged beyond repair. This is already important enough in the world of DOS viruses. However, it is even more important in the world of macro viruses. For, it could be argued that the proper way of removing DOS viruses is by destroying the infected objects and by replacing them with virus–free backup copies. In such cases it doesn't matter much whether the virus in them has been identified exactly or not—the infected object is destroyed anyway. Macro viruses, however, usually reside in documents—which are bound to change often, and virus–free backup copies of which are usually not available. Therefore, disinfection of macro viruses by their removal from the infected documents is a must—and it is of ultimate importance that it is accomplished correctly, without damaging the document or any user macros present in it. This goal simply cannot be achieved reliably enough without the means of exact virus identification.

Third, exact virus identification is often necessary for the purposes of technical support. The users often ask us "what does this or that virus do?"—because the virus in question has been found on their machine and they want to know what it could have done to their data. Answering this question correctly is often impossible if the scanner which has found the virus is unable to identify the virus exactly. Often the difference between a virus which is extremely destructive (e.g., a data diddling virus like WM/Wazzu.A) and a variant of the same virus which does nothing is only a single byte—or even a single bit. Such a level of identification is practically impossible to achieve with scan strings alone. The only way to achieve it is to compute some kind of checksum of every single bit of the non–modifiable parts of the virus body.

Fourth, exact virus identification is necessary for the purposes of proper reporting and tracking the spread of computer viruses. One authoritative source of such information is the so–called WildList, maintained by the anti–virus researcher Joe Wells. Many testers use it as a source of information what viruses to include in their in–the–wild test sets. Recently, the fact that this list does not identify exactly some of the viruses listed on it caused our scanner to score unfavorably in a comparative review. The list had the Plagiarist virus listed on it. As it turns out, there are several different viruses, all of them—members of the Plagiarist family. Our scanner could detect one of them—the one which is really in–the–wild and which was reported originally to Joe Wells. It could not detect another of the variants, however—a variant which is not in–the–wild. However, since the WildList mentions just "Plagiarist" and does not identify the particular variant, one tester used that other variant which our scanner could not detect and wrote in the review that our product does not have 100% detection of the viruses known to be in–the–wild. One could easily imagine a similar mishap involving a macro virus.

Fifth, exact virus identification is a very powerful protection against false positives. It is well–known that scanners which use only simple scan strings to detect viruses often cause false positives—i.e., report a virus in some innocent file which just happens to contain the same sequence of bytes picked by the anti–virus producer as a means for detecting the virus. In the same time, exact virus identification can never lead to such mishaps—because, if it identifies exactly a virus in a file, it simply means that the virus is there; no doubts about it, since every single bit of it has been identified and found to be present. And indeed, the products which use exact virus identification cause false positives significantly less often (almost never—unless they also try to detect new variants of the virus, because then exact identification cannot be used) than those which rely on scan strings.

Sixth, exact virus identification is essentially the only way to handle VBA macro viruses (i.e., macro viruses written in the programming language of Excel and the Office 97 suite). This is because, due to their design, VBA programs contain lots of variable areas (which contain pointers to a common pool of identifiers—common to all VBA modules in the document). As a result of this the average length of the possible scan strings is only two bytes—clearly unsuitable for any practical use. The problem can be partially circumvented by using very long wildcard scan strings—i.e., scan strings which contain "don't care" bytes in the positions of the variable pointers to the common pool of identifiers. Unfortunately, this is not a good solution either, because several different pieces of VBA code can compile to exactly the same image—with the differences becoming apparent only after the pointers to the identifiers are resolved. Clearly, this will increase the danger of false positives even further.

Due to all these reasons mentioned above, we maintain that it is very important for the virus–specific anti–virus products (e.g., scanners and disinfectors) to attempt to identify exactly the viruses they claim to detect. Exact identification is even more important in the case of macro viruses, because significant changes in the behavior of the macro virus can be the result of only a miniscule change in only one of its macros. Consequently, our own macro virus scanner rigorously attempts to identify exactly every single macro virus it claims to detect—and we urge all other anti–virus producers to take the same approach when implementing their scanners. Fortunately, many of them have already understood the benefits of exact macro virus identification—we are seeing many anti–virus products begin using it. In many cases, even if the scanner does not apply exact identification to the DOS viruses it detects, it at least applies it to the macro viruses it detects.

Document:	Global template:
`AutoOpen`	`RpAO`
`RpAE`	`RpAE`
`RpFO`	`RpFO`
`RpFS`	`RpFS`
`RpFSA`	`RpFSA`
`RpTC`	`ToolsCustomize`
`RpTM`	`ToolsMacro`
	`AutoExec`
	`FileOpen`
	`FileSave`
	`FileSaveAs`

Macro Virus Identification Problems

Table of Contents