Web www.grok2.com
grok2.gif (391 bytes)

 


3. TECHNICAL

3.1. More detailed explanation of basic sed

Sed takes a script of editing commands and applies each command, in order, to each line of input. After all the commands have been applied to the first line of input, that line is output. A second input line is taken for processing, and the cycle repeats. Sed scripts can address a single line by line number or by matching a /RE pattern/ on the line. An exclamation mark '!' after a regex ('/RE/!') or line number will select all lines that do NOT match that address. Sed can also address a range of lines in the same manner, using a comma to separate the 2 addresses.

     $d               # delete the last line of the file
     /[0-9]\{3\}/p    # print lines with 3 consecutive digits
     5!s/ham/cheese/  # except on line 5, replace 'ham' with 'cheese'
     /awk/!s/aaa/bb/  # unless 'awk' is found, replace 'aaa' with 'bb'
     17,/foo/d        # delete all lines from line 17 up to 'foo'

Following an address or address range, sed accepts curly braces '{...}' so several commands may be applied to that line or to the lines matched by the address range. On the command line, semicolons ';' separate each instruction and must precede the closing brace.

     sed '/Owner:/{s/yours/mine/g;s/your/my/g;s/you/me/g;}' file

Range addresses operate differently depending on which version of sed is used (see section 3.4, below). For further information on using sed, consult the references in section 2.3, above.

3.1.1. Regular expressions on the left side of "s///"

All versions of sed support Basic Regular Expressions (BREs). For the syntax of BREs, enter "man ed" at a Unix shell prompt. A technical description of BREs from IEEE POSIX 1003.1-2001 and the Single UNIX Specification Version 3 is available online at: http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html#tag_09_03

Sed normally supports BREs plus '\n' to match a newline in the pattern space, plus '\xREx' as equivalent to '/RE/', where 'x' is any character other than a newline or another backslash.

Some versions of sed support supersets of BREs, or "extended regular expressions", which offer additional metacharacters for increased flexibility. For additional information on extended REs in GNU sed, see sections 3.7 ("GNU/POSIX extensions to regular expressions") and 6.7.3 ("Special syntax in REs"), below.

Though not required by BREs, some versions of sed support \t to represent a TAB, \r for carriage return, \xHH for direct entry of hex codes, and so forth. Other versions of sed do not.

ssed (super-sed) introduced many new features for LHS pattern matching, too many to give here. The complete list is found in section 6.7.3.H ("ssed"), below.

3.1.2. Escape characters on the right side of "s///"

The right-hand side (the replacement part) in "s/find/replace/" is almost always a string literal, with no interpolation of these metacharacters:

       .   ^   $   [   ]   {   }   (   )  ?   +   *

Three things are interpolated: ampersand (&), backreferences, and options for special seds. An ampersand on the RHS is replaced by the entire expression matched on the LHS. There is never any reason to use grouping like this:

       s/\(some-complex-regex\)/one two \1 three/

since you can do this instead:

       s/some-complex-regex/one two & three/

To enter a literal ampersand on the RHS, type '\&'.

Grouping and backreferences: All versions of sed support grouping and backreferences on the LHS and backreferences only on the RHS. Grouping allows a series of characters to be collected in a set, indicating the boundaries of the set with \( and \). Then the set can be designated to be repeated a certain number of times

       \(like this\)*   or   \(like this\)\{5,7\}.

Groups can also be nested "\(like \(this\) is here\)" and may contain any valid RE. Backreferences repeat the contents of a particular group, using a backslash and a digit (1-9) for each corresponding group. In other words, "/\(pom\)\1/" is another way of writing "/pompom/". If groups are nested, backreference numbers are counted by matching \( in strict left to right order. Thus, /..\(the \(word\)\) \("foo"\)../ is matched by the backreference \3. Backreferences can be used in the LHS, the RHS, and in normal RE addressing (see section 3.3). Thus,

       /\(.\)\1\(.\)\2\(.\)\3/;      # matches "bookkeeper"
       /^\(.\)\(.\)\(.\)\3\2\1$/;    # finds 6-letter palindromes

Seds differ in how they treat invalid backreferences where no corresponding group occurs. To insert a literal ampersand or backslash into the RHS, prefix it with a backslash: \& or \\.

ssed, sed16, and sedmod permit additional options on the RHS. They all support changing part of the replacement string to upper case (\u or \U), lower case (\l or \L), or to end case conversion (\E). Both sed16 and sedmod support awk-style word references ($1, $2, $3, ...) and $0 to insert the entire line before conversion.

     echo ab ghi | sed16 "s/.*/$0 - \U$2/"   # prints "ab ghi - GHI"

*Note:* This feature of sed16 and sedmod will break sed scripts which put a dollar sign and digit into the RHS. Though this is an unlikely combination, it's worth remembering if you use other people's scripts.

3.1.3. Substitution switches

Standard versions of sed support 4 main flags or switches which may be added to the end of an "s///" command. They are:

       N      - Replace the Nth match of the pattern on the LHS, where
                N is an integer between 1 and 512. If N is omitted,
                the default is to replace the first match only.
       g      - Global replace of all matches to the pattern.
       p      - Print the results to stdout, even if -n switch is used.
       w file - Write the pattern space to 'file' if a replacement was
                done. If the file already exists when the script is
                executed, it is overwritten. During script execution,
                w appends to the file for each match.

GNU sed 3.02 and ssed also offer the /I switch for doing a case-insensitive match. For example,

     echo ONE TWO | gsed "s/one/unos/I"      # prints "unos TWO"

GNU sed 4.x and ssed add the /M switch, to simplify working with multi-line patterns: when it is used, ^ or $ will match BOL or EOL. \` and \' remain available to match the start and end of pattern space, respectively.

ssed supports two more switches, /S and /X, when its Perl mode is used. They are described in detail in section 6.7.3.H, below.

3.1.4. Command-line switches

All versions of sed support two switches, -e and -n. Though sed usually separates multiple commands with semicolons (e.g., "H;d;"), certain commands could not accept a semicolon command separator. These include :labels, 't', and 'b'. These commands had to occur last in a script, separated by -e option switches. For example:

     # The 'ta' means jump to label :a if last s/// returns true
     sed -e :a -e '$!N;s/\n=/ /;ta' -e 'P;D' file

The -n switch turns off sed's default behavior of printing every line. With -n, lines are printed only if explicitly told to. In addition, for certain versions of sed, if an external script begins with "#n" as its first two characters, the output is suppressed (exactly as if -n had been entered on the command line). A list of which versions appears in section 6.7.2., below.

GNU sed 4.x and ssed support additional switches. -l (lowercase L), followed by a number, lets you adjust the default length of the 'l' and 'L' commands (note that these implementations of sed also support an argument to these commands, to tailor the length separately of each occurrence of the command).

-i activates in-place editing (see section 4.41.1, below). -s treats each file as a separate stream: sed by default joins all the files, so $ represents the last line of the last file; 15 means the 15th line in the joined stream; and /abc/,/def/ might match across files.

When -s is used, however all addresses refer to single files. For example, $ represents the last line of each input file; 15 means the 15th line of each input file; and /abc/,/def/ will be "reset" (in other words, sed will not execute the commands and start looking for /abc/ again) if a file ends before /def/ has been matched. Note that -i automatically activates this interpretation of addresses.

3.2. Common one-line sed scripts

A separate document of over 70 handy "one-line" sed commands is available at

       http://sed.sourceforge.net/sed1line.txt

Here are several common sed commands for one-line use. MS-DOS users should replace single quotes ('...') with double quotes ("...") in these examples. A specific filename usually follows the script, though the input may also come via piping or redirection.

   # Double space a file
   sed G file

   # Triple space a file
   sed 'G;G' file

   # Under UNIX: convert DOS newlines (CR/LF) to Unix format
   sed 's/.$//' file    # assumes that all lines end with CR/LF
   sed 's/^M$// file    # in bash/tcsh, press Ctrl-V then Ctrl-M

   # Under DOS: convert Unix newlines (LF) to DOS format
   sed 's/$//' file                     # method 1
   sed -n p file                        # method 2

   # Delete leading whitespace (spaces/tabs) from front of each line
   # (this aligns all text flush left). '^t' represents a true tab
   # character. Under bash or tcsh, press Ctrl-V then Ctrl-I.
   sed 's/^[ ^t]*//' file

   # Delete trailing whitespace (spaces/tabs) from end of each line
   sed 's/[ ^t]*$//' file               # see note on '^t', above

   # Delete BOTH leading and trailing whitespace from each line
   sed 's/^[ ^t]*//;s/[ ^]*$//' file    # see note on '^t', above

   # Substitute "foo" with "bar" on each line
   sed 's/foo/bar/' file        # replaces only 1st instance in a line
   sed 's/foo/bar/4' file       # replaces only 4th instance in a line
   sed 's/foo/bar/g' file       # replaces ALL instances within a line

   # Substitute "foo" with "bar" ONLY for lines which contain "baz"
   sed '/baz/s/foo/bar/g' file

   # Delete all CONSECUTIVE blank lines from file except the first.
   # This method also deletes all blank lines from top and end of file.
   # (emulates "cat -s")
   sed '/./,/^$/!d' file       # this allows 0 blanks at top, 1 at EOF
   sed '/^$/N;/\n$/D' file     # this allows 1 blank at top, 0 at EOF

   # Delete all leading blank lines at top of file (only).
   sed '/./,$!d' file

   # Delete all trailing blank lines at end of file (only).
   sed -e :a -e '/^\n*$/{$d;N;};/\n$/ba' file

   # If a line ends with a backslash, join the next line to it.
   sed -e :a -e '/\\$/N; s/\\\n//; ta' file

   # If a line begins with an equal sign, append it to the previous
   # line (and replace the "=" with a single space).
   sed -e :a -e '$!N;s/\n=/ /;ta' -e 'P;D' file

3.3. Addressing and address ranges

Sed commands may have an optional "address" or "address range" prefix. If there is no address or address range given, then the command is applied to all the lines of the input file or text stream. Three commands cannot take an address prefix:

  • labels, used to branch or jump within the script
  • the close brace, '}', which ends the '{' "command"
  • the '#' comment character, also technically a "command"

An address can be a line number (such as 1, 5, 37, etc.), a regular expression (written in the form /RE/ or \xREx where 'x' is any character other than '\' and RE is the regular expression), or the dollar sign ($), representing the last line of the file. An exclamation mark (!) after an address or address range will apply the command to every line EXCEPT the ones named by the address. A null regex ("//") will be replaced by the last regex which was used. Also, some seds do not support \xREx as regex delimiters.

     5d               # delete line 5 only
     5!d              # delete every line except line 5
     /RE/s/LHS/RHS/g  # substitute only if RE occurs on the line
     /^$/b label      # if the line is blank, branch to ':label'
     /./!b label      # ... another way to write the same command
     \%.%!b label     # ... yet another way to write this command
     $!N              # on all lines but the last, get the Next line

Note that an embedded newline can be represented in an address by the symbol \n, but this syntax is needed only if the script puts 2 or more lines into the pattern space via the N, G, or other commands. The \n symbol does not match the newline at an end-of-line because when sed reads each line into the pattern space for processing, it strips off the trailing newline, processes the line, and adds a newline back when printing the line to standard output. To match the end-of-line, use the '$' metacharacter, as follows:

     /tape$/       # matches the word 'tape' at the end of a line
     /tape$deck/   # matches the word 'tape$deck' with a literal '$'
     /tape\ndeck/  # matches 'tape' and 'deck' with a newline between

The following sed commands usually accept only a single address. All other commands (except labels, '}', and '#') accept both single addresses and address ranges.

     =       print to stdout the line number of the current line
     a       after printing the current line, append "text" to stdout
     i       before printing the current line, insert "text" to stdout
     q       quit after the current line is matched
     r file  prints contents of "file" to stdout after line is matched

Note that we said "usually." If you need to apply the '=', 'a', 'i', or 'r' commands to each and every line within an address range, this behavior can be coerced by the use of braces. Thus, "1,9=" is an invalid command, but "1,9{=;}" will print each line number followed by its line for the first 9 lines (and then print the rest of the rest of the file normally).

Address ranges occur in the form

       <address1>,<address2>    or    <address1>,<address2>!

where the address can be a line number or a standard /regex/. <address2> can also be a dollar sign, indicating the end of file. Under GNU sed 3.02+, ssed, and sed15+, <address2> may also be a notation of the form +num, indicating the next num lines after <address1> is matched.

Address ranges are:

(1) Inclusive. The range "/From here/,/eternity/" matches all the lines containing "From here" up to and including the line containing "eternity". It will not stop on the line just prior to "eternity". (If you don't like this, see section 4.24.)

(2) Plenary. They always match full lines, not just parts of lines. In other words, a command to change or delete an address range will change or delete whole lines; it won't stop in the middle of a line.

(3) Multi-linear. Address ranges normally match 2 lines or more. The second address will never match the same line the first address did; therefore a valid address range always spans at least two lines, with these exceptions which match only one line:

  • if the first address matches the last line of the file
  • if using the syntax "/RE/,3" and /RE/ occurs only once in the file at line 3 or below
  • if using HHsed v1.5. See section 3.4.

(4) Minimalist. In address ranges with /regex/ as <address2>, the range "/foo/,/bar/" will stop at the first "bar" it finds, provided that "bar" occurs on a line below "foo". If the word "bar" occurs on several lines below the word "foo", the range will match all the lines from the first "foo" up to the first "bar". It will not continue hopping ahead to find more "bar"s. In other words, address ranges are not "greedy," like regular expressions.

(5) Repeating. An address range will try to match more than one block of lines in a file. However, the blocks cannot nest. In addition, a second match will not "take" the last line of the previous block. For example, given the following text,

       start
       stop  start
       stop

the sed command '/start/,/stop/d' will only delete the first two lines. It will not delete all 3 lines.

(6) Relentless. If the address range finds a "start" match but doesn't find a "stop", it will match every line from "start" to the end of the file. Thus, beware of the following behaviors:

     /RE1/,/RE2/  # If /RE2/ is not found, matches from /RE1/ to the
                  # end-of-file.

     20,/RE/      # If /RE/ is not found, matches from line 20 to the
                  # end-of-file.

     /RE/,30      # If /RE/ occurs any time after line 30, each
                  # occurrence will be matched in sed15+, sedmod, and
                  # GNU sed v3.02+. GNU sed v2.05 and 1.18 will match
                  # from the 2nd occurrence of /RE/ to the end-of-file.

If these behaviors seem strange, remember that they occur because sed does not look "ahead" in the file. Doing so would stop sed from being a stream editor and have adverse effects on its efficiency. If these behaviors are undesirable, they can be circumvented or corrected by the use of nested testing within braces. The following scripts work under GNU sed 3.02:

     # Execute your_commands on range "/RE1/,/RE2/", but if /RE2/ is
     # not found, do nothing.
     /RE1/{:a;N;/RE2/!ba;your_commands;}

     # Execute your_commands on range "20,/RE/", but if /RE/ is not
     # found, do nothing.
     20{:a;N;/RE/!ba;your_commands;}

As a side note, once we've used N to "slurp" lines together to test for the ending expression, the pattern space will have gathered many lines (possibly thousands) together and concatenated them as a single expression, with the \n sequence marking line breaks. The REs within the pattern space may have to be modified (e.g., you must write '/\nStart/' instead of '/^Start/' and '/[^\n]*/' instead of '/.*/') and other standard sed commands will be unavailable or difficult to use.

     # Execute your_commands on range "/RE/,30", but if /RE/ occurs
     # on line 31 or later, do not match it.
     1,30{/RE/,$ your_commands;}

For related suggestions on using address ranges, see sections 4.2, 4.15, and 4.19 of this FAQ. Also, note the following section.

3.4. Address ranges in GNU sed and HHsed

(1) GNU sed 3.02+, ssed, and sed15+ all support address ranges like:

       /regex/,+5

which match /regex/ plus the next 5 lines (or EOF, whichever comes first).

(2) GNU sed v3.02.80 (and above) and ssed support address ranges of:

       0,/regex/

as a special case to permit matching /regex/ if it occurs on the first line. This syntax permits a range expression that matches every line from the top of the file to the first instance of /regex/, even if /regex/ is on the first line.

(3) HHsed (sed15) has an exceptional way of implementing

       /regex1/,/regex2/

If /RE1/ and /RE2/ both occur on the same line, HHsed will match that single line. In other words, an address range block can consist of just one line. HHsed will then look for the next occurrence of /regex1/ to begin the block again.

Every other version of sed (including sed16) requires 2 lines to match an address range, and thus /regex1/ and /regex2/ cannot successfully match just one line. See also the comments at section 7.9.4, below.

(4) BEGIN~STEP selection: ssed and GNU sed (v2.05 and above) offer a form of addressing called "BEGIN~STEP selection". This is not a range address, which selects an inclusive block of consecutive lines from /start/ to /finish/. But I think it seems to belong here.

Given an expression of the form "M~N", where M and N are integers, GNU sed and ssed will select every Nth line, beginning at line M. (With gsed v2.05, M had to be less than N, but this restriction is no longer necessary). Both M and N may equal 0 ("0~0" selects every line). These examples illustrate the syntax:

     sed '1~3d' file      # delete every 3d line, starting with line 1
                          # deletes lines 1, 4, 7, 10, 13, 16, ...

     sed '0~3d' file      # deletes lines 3, 6, 9, 12, 15, 18, ...

     sed -n '2~5p' file   # print every 5th line, starting with line 2
                          # prints lines 2, 7, 12, 17, 22, 27, ...

(5) Finally, GNU sed v2.05 has a bug in range addressing (see section 7.5), which was fixed in the higher versions.

3.5. Debugging sed scripts

The following two debuggers should make it easier to understand how sed scripts operate. They can save hours of grief when trying to determine the problems with a sed script.

(1) sd (sed debugger), by Brian Hiles

This debugger runs under a Unix shell, is powerful, and is easy to use. sd has conditional breakpoints and spypoints of the pattern space and hold space, on any scope defined by regex match and/or script line number. It can be semi-automated, can save diagnostic reports, and shows potential problems with a sed script before it tries to execute it. The script is robust and requires the Unix shell utilities plus the Bourne shell or Korn shell to execute.

       http://sed.sourceforge.net/grabbag/scripts/sd.ksh.txt (2003)
       http://sed.sourceforge.net/grabbag/scripts/sd.sh.txt  (1998)

(2) sedsed, by Aurelio Jargas

This debugger requires Python to run it, and it uses your own version of sed, whatever that may be. It displays the current input line, the pattern space, and the hold space, before and after each sed command is executed.

       http://sedsed.sourceforge.net

3.6. Notes about s2p, the sed-to-perl translator

s2p (sed to perl) is a Perl program to convert sed scripts into the Perl programming language; it is included with many versions of Perl. These problems have been found when using s2p:

(1) Doesn't recognize the semicolon properly after s/// commands.

       s/foo/bar/g;

(2) Doesn't trim trailing whitespace after s/// commands. Even lone trailing spaces, without comments, produce an error.

(3) Doesn't handle multiple commands within braces. E.g.,

       1,4{=;G;}

will produce perl code with missing braces, and miss the second "G" command as well. In fact, any commands after the first one are missed in the perl output script, and the output perl script will also contain mismatched braces.

3.7. GNU/POSIX extensions to regular expressions

GNU sed supports "character classes" in addition to regular character sets, such as [0-9A-F]. Like regular character sets, character classes represent any single character within a set.

"Character classes are a new feature introduced in the POSIX standard. A character class is a special notation for describing lists of characters that have a specific attribute, but where the actual characters themselves can vary from country to country and/or from character set to character set. For example, the notion of what is an alphabetic character differs in the USA and in France." [quoted from the docs for GNU awk v3.1.0.]

Though character classes don't generally conserve space on the line, they help make scripts portable for international use. The equivalent character sets for U.S. users follows:

     [[:alnum:]]  - [A-Za-z0-9]     Alphanumeric characters
     [[:alpha:]]  - [A-Za-z]        Alphabetic characters
     [[:blank:]]  - [ \x09]         Space or tab characters only
     [[:cntrl:]]  - [\x00-\x19\x7F] Control characters
     [[:digit:]]  - [0-9]           Numeric characters
     [[:graph:]]  - [!-~]           Printable and visible characters
     [[:lower:]]  - [a-z]           Lower-case alphabetic characters
     [[:print:]]  - [ -~]           Printable (non-Control) characters
     [[:punct:]]  - [!-/:-@[-`{-~]  Punctuation characters
     [[:space:]]  - [ \t\v\f]       All whitespace chars
     [[:upper:]]  - [A-Z]           Upper-case alphabetic characters
     [[:xdigit:]] - [0-9a-fA-F]     Hexadecimal digit characters

Note that [[:graph:]] does not match the space " ", but [[:print:]] does. Some character classes may (or may not) match characters in the high ASCII range (ASCII 128-255 or 0x80-0xFF), depending on which C library was used to compile sed. For non-English languages, [[:alpha:]] and other classes may also match high ASCII characters.


Site Links
  The Books I Own
  Main Page
  Vi in Emacs
  Linux on Vaio
  Study NZ
  Utilities
  Programming Fun?
  SED FAQ
  C Language
  Source Code Browsers
  C Struct Packing
  Walt Disney World
  PPP RFCs
  FSM/HSM
  Tcl/Tk
  Photographs of Flowers
  Random Photogaphs
  Put this on your site!
  SQLite
  The Sundial Bridge
  Repetitive Strain Injury (RSI)
  Selling Software Online (MicroISV)
  Tcl Tk Life-Savers
  The Experience Shows!
  Green Tips
  .htaccess tricks
  Web-Site Development Online Tools
  Blog
 

 

 

 


Site copyright of domain owner. All rights reserved.