Archive for the ‘Geek’ Category

The Most Common Things You Do To A Large Data File With Bash

I find that whenever I get a large data file from somewhere (i.e. extract some data from a database, crawl some sites and dump the data in a file) I always need to do just that little bit of extra processing before I can actually use it. This processing is always just non-trivial enough and I do it just uncommonly enough for me to always forget exactly how to go about it. Of course, this is to be expected, if you learn something and want it to stick you have to keep doing it. It’s all part and parcel of how our brain works when it comes tolearning new skills, but that doesn’t make it any less annoying.

Back to our data file, for me I find that I almost always need to do 3 things (amongst others) before doing anything else with my file.

  • delete the first line (especially when pulling data out of the database)
  • delete the last line
  • remove all blank lines

Don’t ask me why but for whatever reason, you always get an extraneous first line and unexpected blank lines (and less often an extraneous last line) no matter how you produce the file :) .

Anyways, my tool of choice in the matter is bash - it is just too trivial to use anything else (plus I love the simplicity and power of the shell). So, to make sure I never forget again here is the easiest way of doing all the three things above using sed:

sed -e 1d -e ‘$d’ -e ‘/^$/d’ input_file > output_file

Of course since we’re using bash, there should be numerous ways of doing the above.

You can remove the first line using awk:

awk 'FNR>1'

but I don’t know how to remove the last line using awk. Anyone?

You can use head or tail to get rid of the first and last line:

head --lines=-1 input_file | tail --lines=+2

but not to remove blank lines.

You can use grep to remove blank lines

grep -v "^$" input_file

but it would be silly to try and use it to remove the first and last line (possible though).

If you know of an easier way to do the above three things in a one-liner using bash – do share it.

What are some of the most common (but non-trivial enough) things that you find yourself doing with bash when it comes to pre-processing that large data file?

WebSVN Error ‘svn: Unable to open an ra_local session to URL’

Tonight I was setting up subversion and websvn on a small CentOS VPS and I came across an issue where, after commit to a change my WebSVN interface would break, complaining “svn: Unable to open an ra_local session to URL”. There was also a permission denied error mixed in there, which immediately pointed to file permissions.


$ svn commit -m "testing commit"
Sending        testing.txt
Transmitting file data .
Committed revision 4.

However when then logging into my websvn – I saw the following:

Error running this command: svn --non-interactive --config-dir /tmp log --xml --verbose --limit 2 'file:///var/svn-repos/mytestrepo/trunk/testing.txt'
svn: Unable to open an ra_local session to URL
svn: Unable to open repository 'file:///var/svn-repos/mytestrepo/trunk/testing.txt'
svn: Can't open file '/var/svn-repos/mytestrepo/trunk/testing.txt/format': Permission denied

Basically when I commit to the change, I am overwriting files in my repository, including the permissions of those files. So, In order to get a quick-fix, I simply changed my ssh user’s group to the same group I have setup on the svn-repos directory and adding myself to any secondary groups.

So to add a user to a primary group called “subversion”:

$ useradd -g subversion jake

Then view that user:

$ id jake

Then add user ‘jake’ to any other secondary user groups:

$ usermod -a -G ftp jake

Please note that the user syntax will vary when using Debian, From memory it is along the lines of: groupadd <user> <group>

An Obsession With Code Names

Code names have been around for a long time. Remember the Manhattan project in the 1940s? That turned out to be the atomic bomb. Thankfully, not all code names hide such sinister projects.

Code names can be about secrecy, but when it comes to software development, it’s usually not so much about secrecy as it is about the convenience of having a name for a specific version of a software. It can be very practical to have a unique identifier for a project to get everyone on the same page and avoid confusion.

And we want to name our darlings, don’t we?

Read More

HOW TO: Sort web server logs to find top users

Sometimes there is a very quick need to determine what user(s) are causing high load to a particular page. Instead of tailing high-speed logs and giving yourself a headache, Throw in a one-lined piped cat command to give you the info you’re after without the foreplay.

cat /path/to/access.log | awk '{print $1}' | sort | uniq -c | sort -n | tail