Website Word Count

Tony's Wiki | bash

Get the word count for a list of webpages to estimate cost for localization of a website.

A fellow freelance translator asked what the easiest way was to get the word count for a list of pages on a website (for estimation purposes for a translation project). This script now not only gives an estimated word count, but will generate an estimated price, as well.

Currently, this is what I have:

# get word counts and generate estimated price for localization of a website
# by tony baldwin /
# with help from the linuxfortranslators group on yahoo!
# released according to the terms of the Gnu Publi License, v. 3 or later
# collecting necessary data:
read -p "Please enter the per word rate (only numbers, like 0.12): " rate
read -p "Enter currency (letters only, EU, USD, etc.): " cur
read -p "Enter url (do NOT include http://www, only the domain, like : " url
# if we've run this script in this dir, old files will mess us up
for i in pagelist.txt wordcount.txt plist-wcount.txt; do
	if [[ -f $i ]]; then 
		echo removing old $i
		rm $i
# downloading webpages from the indicated domain with wget, 
# rejection a long list of irrelevant files, like music, images, tarballs, zips, etc.
echo "getting pages ...  this could take a bit ... "
wget -m -q -E -R jpg,tar,gz,png,gif,mpg,mp3,iso,wav,ogg,ogv,css,zip,djvu,js,rar,mov,3gp,tiff,mng $url
# generating a list 
find . -type f | grep html > pagelist.txt
echo "okay, counting words...yeah...we're counting words..."
for file in $(cat pagelist.txt); do 
	lynx -dump -nolist $file | wc -w >> wordcount.txt
# pasting together list of pages and wordcounts
paste pagelist.txt wordcount.txt > plist-wcount.txt
echo "adding up totals...almost there..."
for t in $(cat wordcount.txt); do
	total=$((total + t))
echo "calculating price ... "
price=`echo "$total * $rate" | bc`
echo -e "\n------------------------\nTOTAL WORD COUNT = $total" >> plist-wcount.txt
echo -e "at $rate, the estimated price is $cur $price
------------------------------" >> plist-wcount.txt
echo "Okay, that should just about do it!"
echo -----------------------------------------
# pretty stuff up a bit more
sed 's/\.\///g' plist-wcount.txt > $url.estimate.txt
rm plist-wcount.txt
cat $url.estimate.txt
echo "--------------------------------
this information is saved in $url.estimate.txt"

I ran this on, and it gave me a text file in the end, $url.estimate.txt, that looks like this:	38	38	52	38	52	322	774	494	382	289	618	93	1205	133	56	38	38	38	85	65	38	671	2027	2027	96	82

at 0.12, the estimated price is USD 1174.68
this information is saved in"

Single page estimate

Now, if you only need the word count for one page, we can really simplify this of course.

# add up wordcounts for one webpage
if [[ ! $* ]]; then
	read -p "Please enter a webpage url: " ur
read -p "How much to you charge per word? " rate
count=`lynx -dump -nolist $url | wc -w`
price=`echo "$count * $rate" | bc`
echo -e "$url has $count words. At $rate, the price would be US\$$price."

The great part of this last one is that it is independent of the file extension in question. My above script grabbed html files, only (although that can be easily amended). Also, without having to deal with listing files, an ftp account is not required for this one.

Thanks to the Linux4Translators group on Yahoo! for some assistance.


tonybaldwin 2012.