Fork me on GitHub

klashxx    Archive    About    Talks    Feed

Text processing performance,... Perl vs sed

Post first published in nixtip

Here’s the situation:

We have a large file to process, say 897872 lines (72 Mb), a small sample looks like this:

TheWhisperers23Chapter 22    9781442305151    "WHISPERERS"    "CONNOLLY, JOHN" 2010
TheWhisperers23Chapter 22    9781442305151    "WHISPERERS"    "TORRES, FERNANDO" 2010

Our goal is to re-order and re-case the names to:

TheWhisperers23Chapter 22    9781442305151    "WHISPERERS"    "John Connolly" 2010
TheWhisperers23Chapter 22    9781442305151    "WHISPERERS"    "Fernando Torres" 2010

PERL

At firs glance Perl seems to be the rigth tool for this purpose.

First we need to find the rigth regexp.

The facts are:

  • Names are enclosed between double quotes.
  • We have a comma as surname – firstname separator.
  • This two creates a unique pattern to search & replace.

Now we have alter the order, and the case of the name-surname.

One one of the approaches is to memorize portions of the pattern to do the right substitution.

So CONNOLLY, JOHN will match a quote, one singular letter ,the rest of the word,some spaces (or not) ,a comma, some spaces (or not), one sigular letter the rest of the letters of the name plus the final quote.

The translation to a Perl regexp could be:

/"(\w)(\w+)\s*,\s*(\w)(\w+)"/

We used four parentheses groups to memorize what whe have to change.

The complete substituion will be:

s/"(\w)(\w*)\s*,\s*(\w)(\w*)"/"$3\L$4 \U$1\L$2"/

Note the use of the upper/lowercase flags and how the order of the words is altered.

In Perl we can use the infile edition, so we can use this one-liner to get the requested result:

perl -pi -e 's/"(\w)(\w*)\s*,\s*(\w)(\w*)"/"$3\L$4 \U$1\L$2"/;' file

SED

It will be pretty much the same, the regexp is:

/"\([A-Z]\)\([A-Z]\{1,\}\) *, *\([A-Z]\)\([A-Z]\{1,\}\)"/

In standard sed we don’t have handy perl word flags \w, so whe have to use the old fashioned way.

The in-file substitution is impossible also.

The complete command will be:

sed -ei 's/"\([A-Z]\)\([A-Z]\{1,\}\) *, *\([A-Z]\)\([A-Z]\{1,\}\)"/"\3\L\4 \U\1\L\2"/' file

PERFORMANCE

Ok, before this experiment I would bet on Perl, but the results are clear ….

It’s a draw (if not a sed winning)

$ time -p perl -pi -e 's/"(\w)(\w*)\s*,\s*(\w)(\w*)"/"$3\L$4 \U$1\L$2"/' file
real 6.71
user 6.38
sys 0.32
$ time -p sed -ei 's/"\([A-Z]\)\([A-Z]\{1,\}\) *, *\([A-Z]\)\([A-Z]\{1,\}\)"/"\3\L\4 \U\1\L\2"/' file
real 6.51
user 6.19
sys 0.31

Perl execution flags NOTES:

-iextension specifies that files processed by the construct are to be edited in-place. It does this by renaming the >input file, opening the output file by the same name, and selecting that output file as the default for >print statements. The extension, if supplied, is added to the name of the old file to make a backup >copy. If no extension is supplied, no backup is made. Saying “perl -p -i.bak -e “s/foo/bar/;” … ” is the >same as using the script:

-p causes perl to assume the following loop around your script, which makes it iterate over filename >arguments somewhat like sed:

while () { … # your script goes here } continue { print; }

Note that the lines are printed automatically. To suppress printing use the -n switch. A -p overrides a ->n switch.

-e commandline may be used to enter one line of script. Multiple -e commands may be given to build up a multi-line >script. If -e is given, perl will not look for a script filename in the argument list.

Use of the ternary operator, an awk example.

Post first published in nixtip

Use of the ternary operator, an awk example.

Ok, suppose this source file:

$ cat infile
21/tcp   closed ftp
22/tcp   open   ssh
23/tcp   closed telnet
80/tcp   closed http
90/tcp   closed dnsix
95/tcp   closed supdup
100/tcp  closed newacct
162/tcp  closed snmptrap
205/tcp  closed at-5
335/tcp  closed unknown
435/tcp  closed mobilip-mn
555/tcp  closed dsf
8080/tcp closed http-proxy
8081/tcp closed blackice-icecap

Our mission will be to get a formatted output port:state:service like this

21:1:ftp
22:0:ssh
23:1:telnet
80:1:http

… and so on …

All closed ports should be marked as 1 the rest will be 0.

Like always, in *nix system we have plenty of tools (and approaches) to get the expected result, lets try the awk way…

At first sight we can identify three fields in our input file and tree tasks be solved.

  1. Get rid of the slash + tcp string of the first field.
  2. Change the value of the second field for 1 or 0.
  3. Field separator should be :

A simply text replacing, is a straightforward way to get the expected result:

$ awk '{sub(/\/.*closed +/,":1:");sub(/\/.*open +/,":0:")}1' infile

Here’s the internals:

  • We look for a string started by as slash (note de escape char \/) followed by any number of any character (dot + star .*) ,followed by the string closed and ended by any number of space chars * and replace it with :1: .For the first line: 21/tcp closed ftp will be replace for :1:

  • Same thing for “open” in this case “:0:” will be the substitution string , example: 22/tcp open ssh will be replace for :0:

Our initial tasks get solved ,but we can refine our efforts.

Let’s use the conditional operator.

expr ? action1 : action2

Its pretty straight forward : if expr then acction1 is performed/evaluated , if not action2.

For our example , field two must change to 1 if it’s value is closed, if not it should be 1.

The needed conditional operator:

$2=="closed" ? "1" : "0"

Depending of second field value, our program will perform a different action, in this case its returning a string : 1 or 0.

At this point, a variable is needed to store it:

n= $2=="closed" ? "1" : "0"

Finally we perform the text substitution:

awk '{n= $2=="closed" ? "1" : "0";sub(/\/.*(open|closed) +/,":"n":")}1' infile

Note that we reduce the calls to the sub function to just one.

A final (and total different) approach , field substitution instead of text replacing.

Remember our tasks:

a) Get rid of the slash+tcp string of the first field. b) Change the value of the second field for 1 or 0 c) Field separator should be :

Our input file has naturally three fields (by the default awk FS ):

21/tcp   closed ftp
22/tcp   open   ssh
23/tcp   closed telnet

It’s clear that we can think in a four fields based line, if we add the slash / to our field separators by using a regex as FS='( *)|(/)' where ( *) represents any number of spaces as separator and (/) represents the slash:

So:

awk '{print $1,$2,$3,$4}' OFS='>' FS='( *)|(/)'  infile|head -3
21>tcp>closed>ftp
22>tcp>open>ssh
23>tcp>closed>telnet

Note that the Output Field Separator OFS is changed to > for clarify.

Now, we want to get rid of the second field, technically is not possible, but we can assign the null value (empty string) to it:

awk '{$2=""}1' OFS='>' FS='( *)|(/)'  infile|head -4
21>>closed>ftp
22>>open>ssh
23>>closed>telnet
80>>closed>http

Attention, the use of the print statement is not needed, awk will print the input line if the result of applying the inner statements to the current input line is true.

The assignment $2="" is not an action statement but we force a true return by placing 1 at the end of the program.

If we set the OFS to null value:

awk '{$2=""}1' OFS= FS='( *)|(/)'  infile|head -4
21closedftp
22openssh
23closedtelnet
80closedhttp

We’re close to or goal, the last step is to process the third field:

$3=="closed" ? ":1:" : ":0:"

Like we saw before we need to assign it to a variable,… look the trick:

$3= $3=="closed" ? ":1:" : ":0:"

We say , hey! change `$3 depending of its previous value. So :

awk '{$2="";$3=$3=="closed" ? ":1:" : ":0:"}1' OFS= FS='( *)|(/)'  infile|head -4
21:1:ftp
22:0:ssh
23:1:telnet
80:1:http

A final optimization, the conditional operator performs always an action that imply the print statement, so:

awk '{$2="";$3=$3=="closed" ? ":1:" : ":0:"}1' OFS= FS='( *)|(/)' infile

Is equivalent to:

awk '$2="";$3=$3=="closed" ? ":1:" : ":0:"' OFS= FS='( *)|(/)'  infile
21:1:ftp
22:0:ssh
23:1:telnet
80:1:http
90:1:dnsix
95:1:supdup
100:1:newacct
162:1:snmptrap
205:1:at-5
335:1:unknown
435:1:mobilip-mn
555:1:dsf
8080:1:http-proxy
8081:1:blackice-icecap

We’re done.

Using sed + xargs to rename multiple files

Post first published in nixtip

Lets say that whe have a bunch of txt files and we need to rename to sql.

$ touch a.txt  b.txt  c.txt  d.txt  e.txt  f.txt
$ ls
a.txt  b.txt  c.txt  d.txt  e.txt  f.txt

We can use ls combined with sed and xargs to achieve our goal.

$ ls | sed -e "p;s/\.txt$/\.sql/"|xargs -n2 mv
$ ls
a.sql  b.sql  c.sql  d.sql  e.sql  f.sql

How it works:

$ ls | sed -e "p;s/\.txt$/\.sql/"
a.txt
a.sql
b.txt
b.sql
c.txt
c.sql
d.txt
d.sql
e.txt
e.sql
f.txt
f.sql

The ls output is piped to sed , then we use the p flag to print the argument without modifications, in other words, the original name of the file.

The next step is use the substitute command to change file extension.

NOTE: We’re using single quotes to enclose literal strings (the dot is a metacharacter if using double quotes scape it with a backslash).

The result is a combined output that consist of a sequence of old_file_name and new_file_name.

Finally we pipe the resulting feed through xargs to get the effective rename of the files.

$ ls | sed -e "p;s/.txt$/.sql/"|xargs -n2 mv

PD: Alternative path to take care of spaces in the file names:

$ touch "a a d.txt.txt" "b b b.txt" "c c.txt" d.txt e.txt f.txt
$ ls
a a d.txt.txt  b b b.txt      c c.txt        d.txt          e.txt          f.txt

Here’s the CMD:

$ ls | awk '{gsub(/^|$/,"\"");print;gsub(/\.txt\"$/,".sql\"")}1' |xargs -n2 mv

Result:

$ ls
a a d.txt.sql  b b b.sql      c c.sql        d.sql          e.sql          f.sql

From the man page:

DESCRIPTION

xargs combines the fixed initial-arguments with arguments read from standard input to execute the specified command one or more times. The number of arguments read for each command invocation and the manner in which they are combined are determined by the options specified. [/sourcecode]

The n parameter

-n number Execute command using as many standard input arguments as possible, up to number arguments maximum. Fewer arguments are used if their total size is greater than size bytes, and for the last invocation if there are fewer than number arguments remaining. If option -x is also coded, each number arguments must fit in the size[/sourcecode]

The -n2 flag force xargs to take 2 arguments from the piped output each time and parses it to the mv command to get the job done.

Print lines between two patterns , the awk way ...

Post first published in nixtip

Example input file:

test -3
test -2
test -1
OUTPUT
top 2
bottom 1
left 0
right 0
page 66
END
test 1
test 2
test 3

The standard way ..

awk '/OUTPUT/ {flag=1;next} /END/{flag=0} flag {print}' infile
top 2
bottom 1
left 0
right 0
page 66

Self-explained indented code:

awk '
/OUTPUT/ {flag=1;next} # Initial pattern found --> turn on the flag and read the next line
/END/    {flag=0}      # Final pattern found   --> turn off rhe flag
flag     {print}       # Flag on --> print the current line
' infile

The first optimization is to get rid of the print , in awk when a condition is true print is the default action , so when the flag is true the line is going to be echoed.

To delete de NEXT statement , in order o prevent printing the TAG line, we need to activate the flag after the OUTPUT pattern discovery and after the flag evaluation.

A slight variation of the program flow and we’re done:

awk '/END/{flag=0}flag;/OUTPUT/{flag=1}' infile

PD: What if we only want to print the lines enclosed between the OUTPUT && END tags ?

© Juan Diego Godoy Robles