Two Flavors of Pipes

On June 1, I started my journey at DocuSign. While starting remote has been a bit weird, I have had an incredible experience so far, shipped some code, learned some things, and even made a fool of myself a couple times. I am extremely grateful that I am able to find employment in these crazy times, especially meaningful and interesting employment. There is lots of C# at DocuSign, so I have been migrating away from C++ & Linux focused development into new, uncharted waters. A side effect of this is moving from Unix –> Powershell for some of the automated scripting. Just before I started at DocuSign, I gained a newfound appreciation for Unix (and Unix pipes) after reading Brian Kernighan’s latest book Unix: A History and Memoir. Never used pipes? Well, today is your lucky day! This post will show you how to use pipes in Unix and then how a similar effect can be created in powershell.

Pipes

Ah yes, so what is a pipe? This is a good place to start. Pipes, or more formally a pipeline, is basically a mechanism for chaining multiple processes together by feeding the output of one into the input of the next. A slightly more computer-sciency way of saying this is “a set of processes chained together by their standard streams, so that the output text of each process (stdout) is passed directly as input (stdin) to the next one” - (Wikipedia). So yes, a pipeline is probably almost exactly what you thought it was.


Douglas McIlroy

This concept that was first brought into the world of computing by Douglas McIlroy at Bell Labs (and subsequently implemented into Unix by the mythical Ken Thompson. Here is a fantastic quote about the invention of the pipe: “His ideas were implemented in 1973 when (“in one feverish night”, wrote McIlroy) Ken Thompson added the pipe() system call and pipes to the shell and several utilities in Version 3 Unix. “The next day”, McIlroy continued, “saw an unforgettable orgy of one-liners as everybody joined in the excitement of plumbing.”” (Wikipedia).

Pipes in Unix

Ok, time for an example. For anyone running a version of Unix, MacOS, Ubuntu, any Linux derivative, etc. this one is for you (feel free to open up terminal and follow along - I ran each line of code as I wrote this so the examples should work). I will be creating an example very similar to Brian Kernighan’s favorite example - we will be determining the misspelled words in a text file.

For starters, we have a textfile (text.txt):

I like to write commands using pipes and pipes are what I like to wriite

First, we want to read from the file. We can do this uisng cat:

> cat text.txt
I like to write commands using pipes and pipes are what I like to write

Then we want to lowercase all the words since our dictionary only has lowercase letters. To do this on the text from our text file, we use the | to pipe the output from our first command to the input of our second command:

> cat text.txt | sed -e 's/.*/\L&/g'
i like to write commands using pipes and pipes are what i like to write

The sed command takes a regular expression, matches every occurance of this expression, and replaces these matches with the second expression. In this case our find expression is .* which says to match on everything and our replace expression is L& which turns all ascii characters to lowercase.

Now, we want to put every word on its own line. Using the sed command again, we match against whitespace \s and we replace with a newline \n:

> cat text.txt | sed -e 's/.*/\L&/g' | sed -e 's/\s/\n/g'
i
like
to
write
commands
using
pipes
and
pipes
are
what
i
like
to
write

The next step is to alphabetize our list and then remove duplicates. Removing duplicates allows us to perform fewer checks against our dictionary in the next step:

> cat text.txt | sed -e 's/.*/\L&/g' | sed -e 's/\s/\n/g' | sort | uniq
and
are
commands
i
like
pipes
to
using
what
wriite
write

Finally, it is time to check for misspelled words. To do this, we want to check to see if each word is in the dictionary, if it is not, it must be misspelled. Assuming the computer has a textfile called dictionary, we will use the grep command to do just that. The grep command is a unix command for searching for lines of text that match a certain pattern (regex again!!). By using this with some flags, v which returns the inversion, f a path to a file to match against, and x matching only against whole lines, we get the desired effect:

> cat text.txt | sed -e 's/.*/\L&/g' | sed -e 's/\s/\n/g' | sort | uniq | grep -xvf dictionary
wriite

Yep, so next time you’re in a Zoom party with friends on a Friday night, share your screen and impress them with this.

Pipes in Powershell

Ok, so you run Windows but still want to impress your friends. How does the above translate? Are there pipes in the powershell? Questions, questions, questions.

It turns out there are pipes, and have been for quite a while. However, they come with one small caveat… when piping in the powershell, output is formatted as objects, not just unformatted streams of text. This actually turns out to be quite useful! Let’s dive in and see how we can recreate the above.

To start with, we could use cat, but that seems like cheating… let’s use a powershell cmdlet! Cmdlets are powershell methods. Many useful ones are included with Windows and you can of course write your own as well.

> Get-Content text.txt
I like to write commands using pipes and pipes are what I like to wriite

Ok, so far so good. Now let’s expose what I meant by passing objects instead of unstructured text. When we use the Get-Content command here, we get a string. We can access this string in the next pipe via the special $_ variable (variables in powershell are denoted by the $). Thus, we can actually combine the splitting of lines by whitespace and the lowercase conversion in one go (in a much friendlier, readable way… no regex!):

> Get-Content text.txt | % {$_.Split(" ").ToLower()}
i
like
to
write
commands
using
pipes
and
pipes
are
what
i
like
to
wriite

Now it is time to sort and remove duplicates. Windows ships some similar commands to what we used in Unix: Sort and Get-Unique:

>  Get-Content text.txt | % {$_.Split(" ").ToLower()} | Sort | Get-Unique
and
are
commands
i
like
pipes
to
using
what
wriite
write

Wow - this is almost a 1:1 translation so far and we even avoided regex! Ok, now it gets a little messy. Unlike grep above, Compare-Object does not show one-sided differences. As you may imagine, this is rather unfortunate when comparing against a large dictionary; it returns each difference with a SideIndicator, <= meaning only in left and => meaning only in right. In our case we want the words which are unique to the input so we filter =>:

>  Get-Content text.txt | % {$_.Split(" ").ToLower()} | Sort | Get-Unique | Compare-Object (Get-Content) dictionary) | Where-Object {$_.SideIndictor -eq "=>}
InputObject SideIndicator
----------- -------------
wriite       =>

Ah, so we have what we want, but now it is in this wonky table. IMO, this is an unfortunate side-effect of powershell’s object approach. However, it is not that hard to transform the output into the desired format:

> Get-Content text.txt | % {$_.Split(" ").ToLower()} | Sort | Get-Unique | Compare-Object (Get-Content) dictionary) | Where-Object {$_.SideIndictor -eq "=>} | Select-Object "InputObject" | Format-Table -HideTableHeaders
wriite

My apologies, I realize that this one-liner is now flowing into the abyss off the right side of the code box. That being said, I actually believe this one-liner is much more readable than the unix one, albeit that takes away some of the fantastical magic that comes with it when you try to impress your friends.

Conclusion

So, besides this being an interesting exercise, what’s the point? My point is that after two summers deep in the world of cryptocurrency in San Francisco, I was indoctrinated into a certain mindset. This mindset said things like:

This is of course a minor exaggeration. That being said, I am learning (or rather relearning if you talk to my 11th grade English professor) it is almost always best to take a nuanced approach - by considering the tradeoffs of a decision and both/all points of view, you end up with a much stronger, more comprehensive final outcome. In software this means using Linux when appropriate and Windows when it is not. In astrophotography, it means taking some wide-angle shots and some close-ups. In collegiate track and field 400m training, it means 1000m workouts and 50m speed work (yes 1k’s are no fun for short sprinters, but hey, I cut 4 sec off my 400 pr in one season, thanks Dave). So, since I’ll probably be living in both Windows and Unix systems for the next X years of my software engineering career, it’ll pay off to understand both.