sábado, 25 de julio de 2009

File Weeding: Keep clean your collection of duplicate files

Hi, malware collectors of the world!

Today I will comment about file weeders.

A file weeder is a tool that looks for identical files and by default or on demand deletes them. Some weeders allow the user to choose what duplicate files to delete and what ones to keep.

In terms of malware collecting the most important thing to consider before deciding what file weeder to use is what hash algorithm we want to use.

In the first years of collecting most collectors used ThunderByte Weeder aka TbWeeder. This weeder was done by the same author of ThunderByte Antivirus, Frans Veldman, and it used a CRC16 hash.

Some years later the first collisions (different files having same hash) for CRC16 appeared in virus collections so collectors switched to weeders using CRC32 hashing.

Around year 2000 some collectors started to use MD5 hash and some stayed with CRC32.

After 2000 the story repeats and first CRC32 collisions appear in virus collections. As workaround solution for these CRC32 collisions, two weeders (VirWeed and FWeeder) are created, using CRC32 hashing plus file size checking to verify for duplicates.

At the beginning I thought it was not possible that two different files may have the same CRC32 and file size but this was proved to be wrong. This was the end of the use of CRC32 between virus collectors.

Actually traders use MD5 weeders or SHA-1. Some months ago I decided to change the hash of my file weeder and initially I considered using MD5 but I was told that generating collisions for MD5 was simple so I decided to go with SHA-256.

I´m not aware of MD5 collisions in malware collections so I´ld say that at the moment using a weeder that uses MD5 is safe.

If you decide to use a MD5 weeder I recommend FAST! File Weeder (FWeeder) by my friend Bumblebee. Right now it´s open source.

You can get source code here.

You can get binary here.

I will give a brief description of how to use FWeeder.

FWeeder is a command line tool. Run "fweeder -h" to get the help screen.

To create a database of our collection run: fweeder -c . Example: fweeder -c c:\virus

To add new entries to database (new files you got in your collection) run: fweeder -a . Example: fweeder -a c:\newvirus

To look for duplicate files run: fweeder -v . Example: fweeder -v c:\test

By default FWeeder will not delete duplicated files. You must add "-k" switch. Examples:

fweeder -c c:\virus -k
fweeder -a c:\newvirus -k
fweeder -v c:\test -k

With that information you have the basic information to weed your collection.

Old weeders were dangerous when used by inexpert hands. Some collectors deleted their collections because they created a database and then looked for duplicates in their own collection!!!

FWeeder has a "newbie" protection to avoid that situation but anyway it´s always a good idea to make a backup of your collection.

I will make a post exclusively to talk about backups and how important they are but before I do it... make a backup.

No hay comentarios:

Publicar un comentario