Go Recipes - Strings

Table of Contents

Strings

These recipes offer some tips for handling Go strings. There are some excellent Go packages in the standard library, most notably the strings package but these recipes talk you through some of the subtleties of Go strings.

For other recipes see here.

Strings one character at a time

Summary

If you just want to go through the string from start to end then you can use a simple range loop, this will present the characters one at a time as follows:

s := "Pi: π Pie: tasty"

for i, r := range s {
	fmt.Printf("%2d: %c\n", i, r)
}
fmt.Println()

If you want random access to the runes (Go's name for a Unicode Code point) then you need to convert the string to a slice of runes first.

s := "Pi: π"
r := []rune(s)
fmt.Printf("%c\n", r[4])

Discussion

Let's establish what we mean by 'character'. In Go a string is a sequence of bytes but this is not what we are talking about. By 'character' we mean a Unicode Code Point or, as Go calls it, a rune.

This is an important distinction but one which native speakers of English may never be aware of. Go encodes runes using UTF8 which only needs a single byte for the 128 ASCII characters so most code you see will not use any non-ASCII characters. Let's look at an example with non-ASCII characters.

s := "Pi: π"
fmt.Printf("s: %q\n", s)
fmt.Println("len(       s ):", len(s))
fmt.Println("len([]rune(s)):", len([]rune(s)))

The string has a Greek letter Pi as it's 5th character. This will print the string and will show it as having length 6 as a string but only 5 as a slice of runes.

What's happening? There are 6 bytes being used to encode the 5 characters (runes). Let's look inside the string, we'll add a period at the end of the string to illustrate the difference.

s := "Pi: π."

for i, r := range s {
	fmt.Printf("[%d]%c 0x%[2]x %08[2]b  ", i, r)
}
fmt.Println()

for i, r := range []rune(s) {
	fmt.Printf("[%d]%c 0x%[2]x %08[2]b  ", i, r)
}
fmt.Println()

When you run this you will see almost exactly the same results for each loop but the index values have a gap in them for the loop over the string but not for the loop over the slice of runes. So let's look at what's happening in the string.

Note that the Printf statements each use the explicit argument index to reference the character ('ch') three times.

s := "Pi: π"

fmt.Printf("       s [4]%c 0x%[1]x %08[1]b\n", s[4])
fmt.Printf("       s [5]%c 0x%[1]x %08[1]b\n", s[5])
fmt.Printf("[]rune(s)[4]%c 0x%[1]x %08[1]b\n", []rune(s)[4])

When you run this to examine the contents of the string at characters 4 and 5 you'll see that you get s[4] shown as Ï and s[5] is a non-printing character (PAD). This is the UTF8 encoding at work showing the two characters which together form the UTF8-encoded rune for the Greek letter pi.

Substrings

Go offers several ways of extracting substrings but you need to understand strings and how they are encoded to know which is best.

Summary

To get the last character of a string you will need to convert it into a slice of runes and then take the last element of that slice as follows:

s := "Pi: π"
r := []rune(s)
fmt.Printf("r[end] = %c\n", r[len(r)-1])

To get a substring of the first two characters we can use standard slicing operations as follows:

s := "Pi: π"
r := []rune(s)
fmt.Printf("r[:2] = %s\n", string(r[:2]))

Note that we have to convert the slice of runes back into a string in order to print it.

Discussion

You can take a substring in a number of ways but you need to be aware of the distinction between bytes and runes. A string is an array of bytes and the same array slicing techniques can be applied but if a character is not represented by a single byte then you will need to know how long it is or else (safer) convert it into a slice of runes and work with that.

What are the downsides of this? There is a cost to constructing the rune slice; the slice of runes takes up to four times as much space as a string since potentially each byte in the string will be held as a 4-byte rune in the slice of runes. If the string has a lot of non-ASCII characters then the inflation will be less, most European or Middle Eastern alphabets will see the slice of runes take twice as much space, strings of Chinese characters will increase in size by a third and strings of emojis will suffer no increase in size. There is also the runtime cost of constructing the rune slice out of the string. If you are just doing this occasionally then you can ignore the cost but if you are doing it in some high-performance loop you should be aware that converting strings to slices of runes is not free. Also, be aware that strings with multi-byte runes will take longer to convert to a rune-slice than strings holding just single-byte runes (ASCII characters). The more multi-byte runes the longer it takes but there is an initial cost as soon as there is even a single multi-byte rune.

References

see the standard packages unicode/utf8, strings and regexp and the Go blog entry by Rob Pike Strings, bytes, runes and characters in Go.

Splitting a string into words

Summary

The simplest way is to use the Fields function from the strings package to divide the string up around spaces, as follows:

words := strings.Fields("The quality of mercy")
fmt.Printf("%q\n", words)

Alternatively you can use the strings.Split function:

words := strings.Split("The quality of mercy", " ")
fmt.Printf("%q\n", words)

Finally, you can use the regexp package. Like this:

re := regexp.MustCompile(`\s+`)
words := re.Split("The quality of mercy", -1)
fmt.Printf("%q\n", words)

Note that you must pass the Split method a final parameter of -1 so that it will return all the substrings.

Discussion

If you have run these code fragments in gosh you will notice that they all produce the same results so what's the difference. To understand this let's take a look at how these approaches work with a different string. Let's try a string with multiple spaces and some non-space whitespace. First with the strings.Field version:

words := strings.Fields("The quality  of mercy\nis not strained ")
fmt.Printf("%q\n", words)

Now with strings.Split

words := strings.Split("The quality of  mercy\nis not strained ", " ")
fmt.Printf("%q\n", words)

And using the regexp version:

re := regexp.MustCompile(`\s+`)
words := re.Split("The quality of  mercy\nis not strained ", -1)
fmt.Printf("%q\n", words)

If you run these you'll see that the strings.Fields version yields only non-empty strings separated by one or more whitespace characters, the strings.Split version returns any string separated by a single space character and the regexp.Split version returns strings (possibly empty) separated by one or more whitespace characters.

The disadvantages with the regexp approach are that it's slower, there's a little more code needed (an extra line), you need to understand regular expressions and the Split method is a bit more complicated to use - it's like the SplitN function from the strings package. So why is it presented here? Regular expressions are very powerful and can do a lot more than this so they are introduced here to make you aware of them.

The strings package also has a FieldsFunc function that allows you to provide a function that will categorise the characters (strings.Fields uses this if the string contains non-ASCII characters)

References

See the standard packages strings and regexp.

Author: Nick Wells

Created: 2021-06-22 Tue 14:19

Validate