NCBI NOW, Lecture 3, Introduction to the Linux Shell

NCBI NOW, Lecture 3, Introduction to the Linux Shell


Hello, this is Ben Busby, Genomics Outreach Coordinator
for NCBI again. And today we’re going to learn
about basic Linux commands. Two caveats before we begin. First, if you are a Linux user, this tutorial is going
to be too basic for you. What I would suggest is turn off this video and go
to module four, “DNA Seq Mapping.” If you’re not sure
if you’re enough of a Linux user
to benefit from this, you might as well watch
the first few minutes and then you can skip
through until you feel like we’ve met you at your
particular skill level. Caveat number two
is that we’ve based this heavily on the Software
Carpentry program. While we’re not using
any of their data files, we use some fairly
similar examples. And I will try to point out
when we do that. And I would really
encourage you, if you are new to Linux, to go try out
the Software Carpentry thing. Additionally, if you
are at an institution where they offer Software
Carpentry courses, go check them out. And if you’re not, you should
encourage your institution to offer Software Carpentry, or a similar thing. Here I am on a command line
of a Mac. There are many ways
to get into the command line, and some popular ones are
outlined in the handout for the “DNA Seq” section,
so you can grab them. The first command I
like to use is “Who am I.” This tells me my user name
on whatever system I have. Another question I can
ask is “Where am I.” This will tell me where,
on the system, I am. As you can see, I’m a couple
of levels up from the bottom, and for the purposes
of this lecture, we are going to stay there. I can also ask
the question very easily, “What was I just doing?” I do this by typing “ls -ltr.”
Great. So I can see all of
the directories that are here in my home directory
on this computer. Now I’d like to go with you
through what “ls -ltr” means. So “ls” simply lists what’s
in my current directory. “Ls -l” lists the stuff
in my directory and it shows its
long form attributes. Periodically I am
going to type “clear,” and that’s simply
to erase the screen and move my cursor
up to the top. If I type “ls -lt” I can see
the timestamps of things, shown here, in most
recent to oldest. But say I had a lot of files
in a particular directory and it scrolled off the screen, I would want to see “Most
Recent” at the bottom, and that is why I use the “ls
-ltr” command shown here. Another thing that you might want
to do when you’re first starting to learn Linux is
to make directories. The way you would
make a directory here is you would say “mkdir”
and a directory name. I like to use “foo” for things
when I’m doing demonstrations. As you will learn when you
progress in your Linux life, “foo” is a metasyntactic
variable. It can be used
for lots of things. But say you write
some sort of program that creates 10,000
different temporary files. You’ll want an easy way
to delete them, and putting “foo” in each of the file names is
a great way to do that. So I can make
a directory called “foo.” Now, if I hit “ls -ltr,” which I can do by finding it
in my recent commands just by pressing– I’m now just
pressing the up arrow and then pressing
the down arrow, if I go up and find it I can now
see that the most recent thing I’ve done is to create
the directory “foo.” I can see this is a directory because this digit
here is a “d.” These other
things talk about read, write, and executable access, me and my group, and all users. As you can see, not all users have right access
to this directory. I can change that by using
the command “chmod.” For the purposes
of this workshop, we’re going to use
the command “chmod 777,” but you might want
to talk to your sysadmin or someone who can advise you
on how to use Linux about what settings to set. When you initialize the cloud
instance for the hands-on part of this workshop you’re going
to use the command “chmod 400.” That’s an important thing
to do for internet security. So, once again, we’ve made
the directory “foo.” We’re going to check, once again, by pressing up, and we’ll see now that we– the directory “foo”
is still there and everyone now has right
access to that directory. Great.
So I’m going to clear the screen and now I’m going
to move into the directory. We move in directories
by hitting the “cd” command. Also, I can tab
complete things in Linux; this is very important. So if I hit “cd f” I can then press “Tab”
to complete that command, as you can see here. Now I can check
by pressing the “Up” button if there is anything in “foo.” Sure enough, there is nothing in “foo.” And I’d like to show you
how to make a file. So what we’re going to do
is use the command “echo.” And we’re going to echo “hello!” We’re going to press “Enter.” Now, what that does is it just
echoes the command “hello!” to our screen.
Alternatively, by pressing up we can
bring it back, we can echo the word “hello!” to a file. Call this file “bar.” Now, if we look at our files, we have one file
that is called “bar.” Note that it is not a directory, and that can be seen here. We can make
a directory called “foo2,” and change
the directory into it. If we want to go back
to our directory “foo,” we can go back one level. And, once again, we can always
check what’s in that directory. We see that we
have a file called “bar,” the directory called
“foo2.” Now, if we want to see
what’s in “bar” there are many ways to do this. Three popular ways are more,
cat, and less. I’m going to go over the “cat” and “less” command, as they will be useful
to us later in this discussion. So we can say “cat bar,” and that will print out
the contents of the file “bar.” We can also say “less bar” and that will take us
into the “less” interface where we can
see everything that’s in there. If we want to get out of
the “less” interface we can hit “q.”
Also, for any commands
in the Linux environment, there should be a “man” page,
which is a manual. So we can go to “man less” and look at the description
of this program. So here are commands, and the nice thing about “less” is we can navigate around using
the up and down keys, the page up and down keys. We can also search for, say,
the word “right” by hitting “/” and then typing
what we’re searching for. We can get out of
the “less” functionality, because man pages are, by default, displayed in “less,” by using the letter “q.” We can also look
at a man page for “ls.” “Ls” means
to list directory contents. And you can see
that there are many, many options here that I will
let you explore on your own. So, to recap
what we’ve done so far, we’ve created two directories, and we’ve also made
a file called “bar” which contains the string
“hello!.” Once again, by typing or scrolling up
to “ls -ltr” we can see how we did that. If we remember
that we made a command but we can’t remember exactly
what we typed, we can use the “ctrl+r” function to search for the thing
we typed. So, for example, if we wanted to go back to our
last big directory command we would type “ctrl+r mkdir.” We can see that the last thing
we typed was to make directory “foo2.” But if we
did that again, that’s not what we want. So what we’ll do now is to press
“ctrl+c” and get out of that. That’s an important note. “Ctrl+c” will get out of most
things in the Linux environment. Now what I would like
to show you how to do is use a text editor
in the Linux environment. Most Linux environments comes with several
simple text editors, including nano and vim. I personally use vim when I’m in
the Linux environment, but there’s a bit
of a steep learning curve, and I’d encourage you
to check it out and maybe print a cheat sheet. However, for the purposes
of this course, we are going to use nano. If you’re using Git Bash,
nano may not work, but what you can do is you can
make a file in a text editor, such as TextPad– please don’t use Microsoft Word
for this exercise– and then save it
as a plain text file, and then you can
open it whenever your Linux implementation is. So now what I’m going to do
is I am going to go into nano. When I’m in nano I would like to
make a file with three columns. So the first column I want to
put the letters “A,” “B,” “C.” The second column I am going
to make the numbers “1,” “2,” and “3.” And then in the third column I am
going to put a bunch of names, like “Wayne,” “Ben,” “Jonathan.”
Great. Then what I will do
is press “ctrl+x” to exit, and it will ask me
if I want to save. I do want to save, and I can call the file
whatever I want. In this case, I’m going to call
the file “hi mom.” Now, once again, I can take a look
at my file by “cat” or “less.” I’m going to use “cat” here. And, once again, I can tab complete
by simply typing the word “hi” and then
finishing it up with that. So here “cat” will allow me
to look at that file. I can make another file,
if I like, and this one I’m going to make
a comma separated value file. So what I’m going to do is I’m going to use a different
string of numbers– I’m sorry, letters, and a different
string of numbers. Oh, I told you I would make
this a comma separated file, so I’m going to simply do this. And I’m going to use
some different names. By the way, these are
all names of people that have contributed
to this course. So this is my way
of thanking them. So what else can I
do with these files? Well, you’ve seen
the “cat” command, but what “cat” actually
stands for is “concatenate.” Once again, we can look at the man page, which is kind of humorous
because it’s “man cat.” We can see
what “cat” is capable of. The main thing
that people use “cat” for is to join different files. So we could concatenate “hi mom” and “fake_csv.” However, the problem, as we will see later, is that these two
files have different delimiters. To recap, we’ve made a bunch
of files in the directory “foo.” We can check what those files
are by typing “ls -ltr.” Remember that we
have the two files, “hi mom” and “fake_csv”
that look like this. One thing we might want to do– and this is very
common in genomics– is change the delimiters
in one of the files so it matches the other one. There are many ways to do this. One way would be to use
“said” and another way would be to use “tr.” When I was putting together this presentation I considered
using “sed” to do this. However, upon Googling it, it became very obvious
by looking at Stack Overflow that it is fairly complicated
to use “sed” for this process, but by using “tr” it’s a very,
very simple thing. So I’m going to show you
how to do this using “tr.” Also, Stack Overflow is
an invaluable resource for asking simple Linux
and programming questions. So, using “tr,” I could actually simply take
the command from Stack Overflow, I can copy it here, and paste it in here. Now, one thing I need
to do is replace the file name with my actual file name and replace whatever they
were trying to replace in the original question
in Stack Overflow with the character
I’m interested in. And now we can see that we get
the command we wanted. So one thing we could do
is we could replace the output that we just saw
to an intermediate file called “foo10” and “cat,” “hi mom” and “foo10”
to either an output or a file. The nice thing about having
a file called “foo10” is then I know that later on I can
probably delete this file. Now that I
have my file called “bar100,” and then look at “bar100.”
But say I had a file that was a million lines long, there’s several things I might
want to know how to do with it. For one thing, I might just want
to look at the first five lines. One way to do that would be
to type “head -5” and “bar100.” Or I could
look at the last three lines, I can do that by typing “tail
-3 bar100.” If I typed this, “bar100,” I am going to
get something totally incorrect. So the best thing to do in this case is
to tab complete what I want. Here I can see that
I’ve tab completed bar. If I press “Tab” two more times I can see that there’s
a file “bar” and a file “bar100.” So if I type “1” I can tab
complete “bar100.” So, once again, now I have the last
three lines of “bar100.” This is very important
when you use large files. You could also sort “bar100,” and what you would
get are these lines in alphabetical order
for the first column. You can also sort by column two by typing “sort -k 2.”
For more information on that you could check out
the “man” page of “Sort.” One thing we might want
to check out is this option, numeric sort. So we can see that
even though it sorted all the lines on these values, it puts 100 and 1,000
and 10,000 between one and two. So what we can do is add
the “n” flag after “sort.” And now we see that it
does an actual numeric sort. Let’s check
what files we’ve made so far. One thing I would like to show
you now is how to use pipes. I have really co-opted quite a bit of this discussion
from Software Carpentry. So thank you to Greg Wilson
and his associates. So, once again, I can take
the head of “bar100” by typing “head -2 bar100.” And I can get the tail
of that by typing in “bar100.” But say I wanted to get the two
middle lines of “bar100,” one easy way to do that would be
to take “head -2” of “bar100” –sorry, “head -4” of “bar100” and then
take the last two lines of this. I can do that with a simple
character called “pipe.” What I can then do
is take “head -4 tail -2,” and that will give me the two
middle lines of this file. I find that to be
extremely useful. Additionally, I could use this in conjunction with
the “sort” command. So if I wanted the two first
names in alphabetical order, I could add “sort -k3” to the beginning of my command,
like this. And I said I wanted the first
two names of “bar100.” So, however, this command is wrong, and it is a common mistake
students make with pipes. The reason this command is wrong is because this has to refer to
the file in the first command. If I type “Enter”
this will simply hang and I will need
to press “ctrl+c.” What I will then
do is correct this. I need to clear this. What I will then
do is correct this by moving the file name
into the first command. And, as we can see, this command now
works perfectly. Another thing I can do is I can
search for something in a file. So if we look
at “bar100,” perhaps we’d like to search for the line
that contains the name “Monica.” If you’re starting
to do genomics work, you can easily imagine ways you’d want to search
very large files, or cases in which you might want
to search very large files. For that, I use
the command “grep.” And I can “grep” for
the term “Monica” in this file. And it will print me
out the line that contains the name “Monica.” If I wanted to grep any name
with the character “M” in it I can look at the file– typically I wouldn’t
look at a big file– and then I would grep
the character “M bar100.” Now, one problem
with that is if there was an “M” anywhere else in the file
it would also print those lines. If you want more information about how to do
specify those searches, please check out
the grep man page. There are many, many options,
and it is very, very powerful. So say we wanted
to grep the lines with names starting with “M,”
just like we did, but then we wanted to sort
by the first column, we could do that very easily by piping the grep command
to the sort command, as we learned
in the last example. So now that we’ve seen how
to pipe two commands together, we can make what is called
a “bash script” with these two commands. What we can do is echo “grep
M bar100” into my first script. And you want to end it
with this suffix “.sh.” So then what we can do is run
that script by typing “bash” and then tab completing
the file name. And we get exactly
the same output we did when we ran it
from the command line. For those of you who are new
to Linux, congratulations, you’ve written your
first computer program. Another neat thing we could do is go
into our script by using nano, and we can replace “M”
with a variable. In this case, we can
use “$1.” We’ll exit and we’ll save it. Then we put in a variable
for “$1.” So, in this case, we can use “M.”
We could also use “W” or “B.” I encourage you
to try this at home. So now you’ve written
a script with a variable. Now that we’ve written a script, I’d like to show you
another command and then come back
to our script. So another thing we might
want to do is just to print the first column
of a particular file, or the second column,
or the third column. There are many ways
to do this in Linux, but I like to use
something called “awk.” “Awk” is a programming language which can be used
to do many things. One reason that “awk”
is important in genomics is
there are instances, like in epigenomics, where you may want to cut
the fourth or fifth column of a three-million-line file. In the next two minutes
it should become apparent how you might do this with awk. So, for example, I could
say “awk print $1 bar100.” That would give me “$1.” As
I’m sure you can guess, if I use “print $2” or “print $3”
I get similar results. The nice thing is I can
also “print $1” and separate it with a comma
from “$3.” Hence, it should be intuitive how you
could add or delete a column of X things. For example, ones. Let me show you that explicitly. So if I want to insert
a column of ones into my file, I can simply do
that by replacing the comma with “space 1 space”
and press “Enter.” One common application of Linux in genomics is finding
each element of a column from one file in another file. One way to do this is to use
a four loop with grep and awk. In the final lesson of this particular lecture
that’s what we’ll go over. So, first, let’s
define a variable by saying “for i in ‘awk
print $3 foo10’.” So what we’ve said here is
for each item in column three, find a variable “i” in a loop,
and we do, and then we’re going
to say “echo i done.” So this is simply going
to print each item in the third column of “foo10.”
Now, using the same logic,
I could also “grep i” into “bar100.” And I see
the results we would expect. Finally, to obtain
a similar result, one could call the script
that we previously wrote. And we see that, again,
we get the intended results. Thank you very much. I hope this has been a useful
introduction to Linux. This should give you the basics to be able to execute
all of the commands in the next four lectures
as well as the hands-on. And I hope this will be,
for those of you new to Linux, this will be a jumping off
point to being able to process lots of data sets
on the command line. Thanks and have a nice day.