Bash “while read line” Vs Awk Large File Processing

Recently I had to fudge some data so that it would be imported into a database after an outage caused our “php” data loader to try and allocate a crazy amount of memory and die fantastically.

Being a fan of automating everything I can I started out down the trail of “okay lets script this”.A few moments later I had a simple bash script looking somewhat like:

#!/bin/shfilename=$1while read line; do#Read each line and grab the necessary fields, create the insert statements.     field1=`echo ${line} | awk {'print $1'}`     field2=`echo ${line} | awk {'print $7'}`     echo "INSERT INTO testtable VALUES ('${field1}',UNIX_TIMESTAMP($field2}));" > data.indone < ${filename}#Assume all is good and just feed the file to mysql for processing.mysql -u root testdatabase < EOF

Simplistic script – using my favourite awk statement for field delimitation (so much easier than relying on cut). Now im aware awk can do a lot more however everytime I pick up a book or decide to learn some I end up being told “ITS URGENT!” and having to drop everything to “just get it working”.

Its a common problem I feel when you’re really just support staff trying to keep everything hunky-dory.

This time around while work finished at 17:30, it was now 19:00 and my script was still running having been set off at about 10am. A quick wc -l and some dodgy division told me that it still had about another 56 hours to run. I was processing 2,647,012 lines and wasn’t even above 500,000 lines yet.

Although I had backgrounded the process I didn’t want it to fail without me knowing. Anxious and having already done 1.5hrs of unpaid overtime I decided to see if there was a better way of doing the job.

Sure enough AWK reared its head again its a turing-complete programming language specifically meant for text processing so why not have a decent look at it now that everyone else in the office has gone home and you’re stuck here until its done.

So a poke around the internet for a bit of guidance in AWK and I come out with a solution of.

#!/bin/shfilename=$1awk BEGIN{     # Special characters represented by octal values to prevent any escaping issues.     q="47" # single quotation mark     lb="50" # left bracket     rb="51" # right bracket     c="54" # comma     sc="73" # semi-colon}{     print "INSERT INTO testtable VALUES " lb q $1 q c "UNIX_TIMESTAMP(" $7 rb rb sc >>} ${filename}mysql -u root testdatabase <

Time taken for processing: under 90 seconds.

Time taken for bash to process: Over 60 hours.

Just had to wait then for mysql to catch up and import the 2.6 million entries. Left work at 10:30pm with the solution in place.

There was a further issue with it. Every hour we collect new data and normally the php data loader works through those few thousand lines in a few minutes and no issues, however with the backlog we had mounted up it could no longer be trusted. Unfortunately the bash solution also took so long that while it processed entries the “queue” of entries got longer every hour as it was falling behind. Now with the awk solution in place it takes less than 30 seconds to get the entries into the database.

Lesson Learned: Never use bash while loops for iterating through large text files.