Dualdflipflop at 06:01, 29 January 2007

2007-01-29T06:01:54Z

New page

[[pyFileSlice]] is a simple utility that will chop out a section of a file that has common starting and ending tags. Pulling out the page referrers section in an [http://awstats.sf.net Awstats] data file for further analysis prompted this little bit of research. After trying 3 methods (one involving regex pattern checking over each element in a list, one involving startswith(), and one that uses startswith() and doesn't read the file all at once) the one presented here works the fastest with the least amount of memory used. [http://www.cs.ucr.edu/~nsoracco/py/fileSlice.html Source Code]

#!/usr/bin/env python
#
# Simple tool to spit out referrer information from an awstats database
# for later searching an analysis. A good example of file slicing!

__author__ = "Nick Guy & Brian Guy"
__license__ = "GPL"

import sys, string;

# lolz, no argc it seems. :P
argc = len(sys.argv)

if argc > 2 :
print sys.argv[0] + " [filename]"
print "[filename] is optional, leave out to use stdin"
sys.exit(1)

# variables instantiated here to keep them in file scope.
awsdata = []
infile = False

if argc == 2:
try:
infile = open( sys.argv[1], 'r' )
except IOError:
print "Can't open " + sys.argv[1] + " for reading."
sys.exit(2)

if argc == 1:
infile = sys.stdin

# fastest method. Note that the strings inside startswith() are
# the start and end block tokens we need. Note also that the strings
# used to delimit the block we want are NOT included in the final output.
while not infile.readline().startswith("BEGIN_PAGEREFS"):
pass

# This is a syntactic hack to implement do/while loops.
line=infile.readline()
while not line.startswith("END_PAGEREFS"):
awsdata.append(line)
line=infile.readline()[:-1] # remove trailing \
, similar to chomp in perl.
infile.close()

# send data to stdout.
for line in awsdata:
print line

Python File Slicing - Revision history

Dualdflipflop at 06:01, 29 January 2007