command line data processing: adding awk to the arsenal

2021-07-21

 | 

~3 min read

 | 

404 words

Continuing my exploration of slicing and dicing data from the command line, today I want to write about awk. In many ways awk is similar to grep, however, it seems more ergonomic for certain types of conditional filtering to me.

For example, one of the questions I asked in Combining sed, cut, and xargs To Clean Data was “What is the list of primary keys where the attribute is true?”

As a reminder, I was dealing with an example.csv where the data looked like this:1

example.csv
"pk","fk","hasAttribute","_count"
"69870","80576","true","12"
"142927","80776","true","6"
"98738","57138","false","4496"
"98828","57138","true","200"

What Is awk

Fundamentally, the way awk works is that it will break a file up into space delimited records (i.e., a row).

Using the -F flag, we can change the delimiter however:

awk -F, # comma delimited
awk -F\; # semicolon delimited
awk -F\| # pipe delimited

Once you have the delimiters in place, awk will separate a record by variable placeholders (similar to SQL). $0 is the full row, while $1, $2, etc. are each individual attribute.

So, in our example, for the first record:

  • $0 is "pk","fk","hasAttribute","_count"
  • $1 is "pk"
  • $2 is "fk"
  • $3 is "hasAttribute"
  • $4 is "_count"

Using awk

Now that we have the basics down, let’s use awk to filter our file to only show the foreign keys where the attribute is true.

% awk -F, '$3~/true/' example.csv
"69870","80576","true","12"
"142927","80776","true","6"
"98828","57138","true","200"

But that’s not just the column we care about - to do that, we’d pass in $2 to the action print:

% awk -F, '$3~/true/ {print $2}' example.csv
"80576"
"80776"
"57138"

With this list in hand, we can return to what we learned about using sed and xargs to create a single list of comma separated keys. (We don’t need cut because awk handled that with the print action):

% awk -F, '$3~/true/ {print $2}' example.csv |
sed 's/"//g' |
xargs echo |
sed 's/ /,/g'
80576,80776,57138

Et voila! We have our filtered list!

Additional Resources

Footnotes

  • 1 For what it’s worth, I know that this example could be easily accomplished with grep, but that’s a feature of the fact that there’s only one boolean flag in a row. If the data were more complicated, awk’s column specificity would prove invaluable.

Related Posts
  • Combining sed, cut, and xargs To Clean Data


  • Hi there and thanks for reading! My name's Stephen. I live in Chicago with my wife, Kate, and dog, Finn. Want more? See about and get in touch!