Website Word Count

Tony's Wiki | bash

Get the word count for a list of webpages to estimate cost for localization of a website.

A fellow freelance translator asked what the easiest way was to get the word count for a list of pages on a website (for estimation purposes for a translation project). This script now not only gives an estimated word count, but will generate an estimated price, as well.

Currently, this is what I have:

#!/bin/bash
 
# get word counts and generate estimated price for localization of a website
# by tony baldwin / baldwinsoftware.com
# with help from the linuxfortranslators group on yahoo!
# released according to the terms of the Gnu Publi License, v. 3 or later
 
# collecting necessary data:
read -p "Please enter the per word rate (only numbers, like 0.12): " rate
read -p "Enter currency (letters only, EU, USD, etc.): " cur
read -p "Enter url (do NOT include http://www, only the domain, like somedomain.com) : " url
 
# if we've run this script in this dir, old files will mess us up
for i in pagelist.txt wordcount.txt plist-wcount.txt; do
	if [[ -f $i ]]; then 
		echo removing old $i
		rm $i
	fi
done
 
 
# downloading webpages from the indicated domain with wget, 
# rejection a long list of irrelevant files, like music, images, tarballs, zips, etc.
echo "getting pages ...  this could take a bit ... "
wget -m -q -E -R jpg,tar,gz,png,gif,mpg,mp3,iso,wav,ogg,ogv,css,zip,djvu,js,rar,mov,3gp,tiff,mng $url
 
# generating a list 
find . -type f | grep html > pagelist.txt
 
echo "okay, counting words...yeah...we're counting words..."
for file in $(cat pagelist.txt); do 
	lynx -dump -nolist $file | wc -w >> wordcount.txt
done
 
# pasting together list of pages and wordcounts
paste pagelist.txt wordcount.txt > plist-wcount.txt
 
echo "adding up totals...almost there..."
total=0
for t in $(cat wordcount.txt); do
	total=$((total + t))
done
 
echo "calculating price ... "
price=`echo "$total * $rate" | bc`
 
echo -e "\n------------------------\nTOTAL WORD COUNT = $total" >> plist-wcount.txt
echo -e "at $rate, the estimated price is $cur $price
------------------------------" >> plist-wcount.txt
 
echo "Okay, that should just about do it!"
echo -----------------------------------------
 
# pretty stuff up a bit more
sed 's/\.\///g' plist-wcount.txt > $url.estimate.txt
rm plist-wcount.txt
cat $url.estimate.txt
echo "--------------------------------
this information is saved in $url.estimate.txt"
 
exit

I ran this on tonybaldwin.net, and it gave me a text file in the end, $url.estimate.txt, that looks like this:

tonybaldwin.net/log/archives/environment/index.html	38
tonybaldwin.net/log/archives/cuisine/index.html	38
tonybaldwin.net/log/archives/music/index.html	52
tonybaldwin.net/log/archives/philosophy/index.html	38
tonybaldwin.net/log/archives/nanoblogger-help/index.html	52
tonybaldwin.net/log/archives/2011/09/11/911/index.html	322
tonybaldwin.net/log/archives/2011/09/index.html	774
tonybaldwin.net/log/archives/2011/09/01/mit_intro_to_cs_and_programming_assignment_1/index.html	494
tonybaldwin.net/log/archives/2011/08/26/come_on_irene/index.html	382
tonybaldwin.net/log/archives/2011/08/26/welcome_to_nanoblogger_3_4_2/index.html	289
tonybaldwin.net/log/archives/2011/08/26/here_we_roll_again/index.html	618
tonybaldwin.net/log/archives/2011/08/27/couldnt_stand_the_weather/index.html	93
tonybaldwin.net/log/archives/2011/08/index.html	1205
tonybaldwin.net/log/archives/2011/index.html	133
tonybaldwin.net/log/archives/technology/index.html	56
tonybaldwin.net/log/archives/politic/index.html	38
tonybaldwin.net/log/archives/religion/index.html	38
tonybaldwin.net/log/archives/art/index.html	38
tonybaldwin.net/log/archives/index.html	85
tonybaldwin.net/log/archives/personal/index.html	65
tonybaldwin.net/log/archives/health/index.html	38
tonybaldwin.net/log/articles/about/index.html	671
tonybaldwin.net/log/index.html	2027
tonybaldwin.net/log.1.html	2027
tonybaldwin.net/index.html	96
tonybaldwin.net/social.html	82

------------------------
TOTAL WORD COUNT = 9789
at 0.12, the estimated price is USD 1174.68
------------------------------
this information is saved in tonybaldwin.net.estimate.txt"

Single page estimate

Now, if you only need the word count for one page, we can really simplify this of course.

#!/bin/bash
 
# add up wordcounts for one webpage
 
if [[ ! $* ]]; then
	read -p "Please enter a webpage url: " ur
else
	url=$*
fi
read -p "How much to you charge per word? " rate
 
count=`lynx -dump -nolist $url | wc -w`
 
price=`echo "$count * $rate" | bc`
 
echo -e "$url has $count words. At $rate, the price would be US\$$price."
 
exit

The great part of this last one is that it is independent of the file extension in question. My above script grabbed html files, only (although that can be easily amended). Also, without having to deal with listing files, an ftp account is not required for this one.

Thanks to the Linux4Translators group on Yahoo! for some assistance.

Enjoy!

tonybaldwin 2012.01.03.23.10


~~DISQUS~~