Sunday, July 11, 2010

Find Dead Files


Image: Simon Howden / FreeDigitalPhotos.net



I've been doing a lot of web development recently for a friend's website. One of the problems I always seem to have is cleaning up files that are no longer used as the site evolves. This is particularly true with images. I solved this problem by writing a simple Python script to do the following:

1) Produce a list of files in the current directory and all subdirectories (minus a few file types that are ignored)
2) Search all the files in the current directory for references to those files
3) Output a list of files that do not appear to be referenced

Here's some sample output:
Creating list of files...

Searching files...

The following files are not referenced from files in this directory:

images/oldcontactus.jpg
images/header.jpg
images/photos.jpg
index2.html

Every week or so I run the script and clean out any lingering dead files. In the hopes that it might be useful to others, I have uploaded it to the blog. There are couple things to keep in mind if you intend to use it:

1) The search is not intelligent so it will recognize any text that references the file (including comments and the occasional lucky substring).
2) It only searches the content of files in the current rectory, which assumes that your HTML is in the current directory and images, .js files, etc are in subdirectories.
3) If you need to modify the file types that are ignored, edit the .py file and change the constants at the top.

Download the script here.

I've included a listing of the source code below. As always, feedback and constructive criticism is always welcome. I'm always looking for ways to improve my code. Enjoy.


#########################################################################
# find_dead_files (version 0.1) by Andrew Burke
#
# This utility builds a list of files in the current directory and all
# subdirectories. It then searches the files in the current directory
# for references to these files.
# If a file is identified that has no references to it, it
# is reported. The algorithms used is not and intelligent
# search so comments and raw text WILL match.
#
# This script is intended to help identify files that are no longer use
# in a website.
#
# Usage: Place the .py file in the root of the directory structure you'd
# like to analyze and execute the script. To change the file types that are
# ignored altogether or not searched, please modify the "Constants" section
# below.
#
# IMPORTANT NOTE: Files that are referenced from external sources only,
# WILL be reported this tool as being unreferences. Please use care
# before permanently deleting any files.
#
# Copyright (C) 2010 Andrew Burke
# This work is licensed under a Creative Commons GNU General Public License
# http://creativecommons.org/licenses/GPL/2.0/
#
# Revision history:
# v0.1 - First version (Andrew Burke)
# v0.2 - Corrections from PyChecker

import os

# Constants
NON_SEARCHABLE_FILES = ".jpg,.png,.gif" #Files that should not be searched for keywords
IGNORABLE_FILES = ".svn,.py,.bat" #Files that should not referenced anyway
FOLDER = "." #Folder to start from (default is current directory)

def does_string_contain_keywords(in_str, cur_keywords):
# Search the given string for any of the given keywords

for search_term in cur_keywords.split(","):
if in_str.find(search_term) > -1:
return True

# If we get this far, none of the keywords were found
return False

def is_ignored_file(search_filename):
# Should we ignore this file altogether?
return does_string_contain_keywords(search_filename, IGNORABLE_FILES)

def is_not_searchable_file(search_filename):
# Is this a file we can search?
return does_string_contain_keywords(search_filename, NON_SEARCHABLE_FILES)

def concat_with_slash(str1, str2):
# Put two files/directories together
if str1 == "":
return str2
elif str2 == "":
return str1
elif str1 == "" and str2 == "":
return ""
else:
return str1 + "/" + str2

def add_files_to_list(subdir, root_path, mylist, search_subdir):
# Recurse the folder structure and build a list of files
# mylist is modified as a pass-by-ref parameter

cur_path = concat_with_slash(root_path, subdir)

for cur_filename in os.listdir (cur_path):
if is_ignored_file(cur_filename):
continue

if os.path.isdir (concat_with_slash(cur_path, cur_filename)) and search_subdir:
new_subdir = concat_with_slash(subdir, cur_filename)
add_files_to_list(new_subdir, root_path, mylist, True)

mylist.append(concat_with_slash(subdir, cur_filename))

def search_file_for_keywords(cur_path, search_filename, cur_keywords):
# Search the specified file for keywords and remove the keyword if found
# cur_keywords is modified as a pass-by-ref parameter

if is_not_searchable_file(search_filename):
return

if os.path.isdir (concat_with_slash(cur_path, search_filename)):
return

try:
cur_file = open(search_filename, "r")
except:
print ("ERROR: Cannot open file: " + search_filename)
return

line = cur_file.read()
keyword_iterator = list(cur_keywords)
for cur_keyword in keyword_iterator:
if line.find(cur_keyword) > -1:
cur_keywords.remove(cur_keyword)

# Main script
filelist = []
keywords = []

print ("Creating list of files...")
add_files_to_list("", FOLDER, keywords, True)
add_files_to_list("", FOLDER, filelist, False)

print ("\nSearching files...")

#search each file for the filenames
for filename in filelist:
search_file_for_keywords(FOLDER, filename, keywords)

# Print out the list of files that are not referenced
print ("\nThe following files are not referenced from files in this directory:\n")
for keyword in keywords:
print (keyword)

0 comments:

Post a Comment