Previously we relied heavily on regexp to filter out and grab all
indentation white space, and then to strip away indentation shared
across all lines. This was reasonably fast. However I wanted to see if I
could make it faster by manually iterating over the input. Turns out
doing so makes is around 20 times faster.
The code is a lot more complicated though, but I'll attempt to break it
down. There's three main phases to it:
1. Iterate over every character of the input to locate all
line-feed (\n) characters, storing their indexes in a integer slice.
2. Iterate over the list of life-feed indexes, and for each line-feed,
scan forward until a non-whitespace character is found, counting how
many whitespace characters we encountered directly after the
life-feed. If the number is lower than our previously lowest number
of leading whitespace characters, store that as the new lowest
number.
3. Now that we know the lowest number of leading whitespace characters
common across every line of the input, we can iterate over the list
of life-feed indexes again. This time to build the final output, but
reading all characters from the life-feed index + whitespace number,
until the next life-feed character, or end of input.
Overall this approach yields a 15-20x speed improvement over the old
method.
Benchmarks, before:
goos: darwin
goarch: amd64
pkg: github.com/jimeh/undent
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkBytes/empty-8 78280611 15.18 ns/op
BenchmarkBytes/single-line-8 2361297 515.1 ns/op
BenchmarkBytes/single-line_indented-8 317440 3618 ns/op
BenchmarkBytes/multi-line-8 630370 1920 ns/op
BenchmarkBytes/multi-line_space_indented-8 156266 7664 ns/op
BenchmarkBytes/multi-line_space_indented_without_any_leading_line-breaks-8 155672 8168 ns/op
BenchmarkBytes/multi-line_space_indented_with_leading_line-breaks-8 144655 8165 ns/op
BenchmarkBytes/multi-line_tab_indented-8 206425 5462 ns/op
BenchmarkBytes/multi-line_tab_indented_without_any_leading_line-breaks-8 223620 5542 ns/op
BenchmarkBytes/multi-line_tab_indented_with_leading_line-breaks-8 208132 5857 ns/op
BenchmarkBytes/multi-line_tab_indented_with_tabs_and_spaces_after_indent-8 199480 5687 ns/op
BenchmarkBytes/multi-line_space_indented_with_blank_lines-8 148402 8072 ns/op
BenchmarkBytes/multi-line_tab_indented_with_blank_lines-8 200929 5691 ns/op
BenchmarkBytes/multi-line_space_indented_with_random_indentation-8 197412 6515 ns/op
BenchmarkBytes/multi-line_tab_indented_with_random_indentation-8 281493 4272 ns/op
BenchmarkBytes/long_block_of_text-8 9894 115752 ns/op
BenchmarkString/empty-8 100000000 12.75 ns/op
BenchmarkString/single-line-8 2224165 529.0 ns/op
BenchmarkString/single-line_indented-8 314088 3784 ns/op
BenchmarkString/multi-line-8 645804 1968 ns/op
BenchmarkString/multi-line_space_indented-8 149310 8103 ns/op
BenchmarkString/multi-line_space_indented_without_any_leading_line-breaks-8 145390 8496 ns/op
BenchmarkString/multi-line_space_indented_with_leading_line-breaks-8 145579 8161 ns/op
BenchmarkString/multi-line_tab_indented-8 223596 5487 ns/op
BenchmarkString/multi-line_tab_indented_without_any_leading_line-breaks-8 214842 5641 ns/op
BenchmarkString/multi-line_tab_indented_with_leading_line-breaks-8 209067 5685 ns/op
BenchmarkString/multi-line_tab_indented_with_tabs_and_spaces_after_indent-8 210307 5584 ns/op
BenchmarkString/multi-line_space_indented_with_blank_lines-8 133948 9280 ns/op
BenchmarkString/multi-line_tab_indented_with_blank_lines-8 178296 5769 ns/op
BenchmarkString/multi-line_space_indented_with_random_indentation-8 206030 6222 ns/op
BenchmarkString/multi-line_tab_indented_with_random_indentation-8 236450 4259 ns/op
BenchmarkString/long_block_of_text-8 10000 113065 ns/op
PASS
ok github.com/jimeh/undent 44.800s
Benchmarks, after:
goos: darwin
goarch: amd64
pkg: github.com/jimeh/undent
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkBytes/empty-8 596493562 2.074 ns/op
BenchmarkBytes/single-line-8 20044598 60.64 ns/op
BenchmarkBytes/single-line_indented-8 12449749 84.43 ns/op
BenchmarkBytes/multi-line-8 5086376 232.3 ns/op
BenchmarkBytes/multi-line_space_indented-8 3077774 400.4 ns/op
BenchmarkBytes/multi-line_space_indented_without_any_leading_line-breaks-8 3011881 386.6 ns/op
BenchmarkBytes/multi-line_space_indented_with_leading_line-breaks-8 3034299 402.9 ns/op
BenchmarkBytes/multi-line_tab_indented-8 4500271 266.2 ns/op
BenchmarkBytes/multi-line_tab_indented_without_any_leading_line-breaks-8 4355886 277.5 ns/op
BenchmarkBytes/multi-line_tab_indented_with_leading_line-breaks-8 3758012 289.5 ns/op
BenchmarkBytes/multi-line_tab_indented_with_tabs_and_spaces_after_indent-8 4425787 271.9 ns/op
BenchmarkBytes/multi-line_space_indented_with_blank_lines-8 3035809 412.2 ns/op
BenchmarkBytes/multi-line_tab_indented_with_blank_lines-8 3771512 334.2 ns/op
BenchmarkBytes/multi-line_space_indented_with_random_indentation-8 4461404 275.6 ns/op
BenchmarkBytes/multi-line_tab_indented_with_random_indentation-8 6960343 174.6 ns/op
BenchmarkBytes/long_block_of_text-8 315788 3776 ns/op
BenchmarkString/empty-8 338024905 3.761 ns/op
BenchmarkString/single-line-8 20067831 59.28 ns/op
BenchmarkString/single-line_indented-8 13826002 88.16 ns/op
BenchmarkString/multi-line-8 4451938 261.6 ns/op
BenchmarkString/multi-line_space_indented-8 2911797 411.1 ns/op
BenchmarkString/multi-line_space_indented_without_any_leading_line-breaks-8 2699631 416.5 ns/op
BenchmarkString/multi-line_space_indented_with_leading_line-breaks-8 2737174 436.3 ns/op
BenchmarkString/multi-line_tab_indented-8 4208000 304.6 ns/op
BenchmarkString/multi-line_tab_indented_without_any_leading_line-breaks-8 4029422 295.8 ns/op
BenchmarkString/multi-line_tab_indented_with_leading_line-breaks-8 3929960 310.3 ns/op
BenchmarkString/multi-line_tab_indented_with_tabs_and_spaces_after_indent-8 3978992 292.5 ns/op
BenchmarkString/multi-line_space_indented_with_blank_lines-8 2829766 428.5 ns/op
BenchmarkString/multi-line_tab_indented_with_blank_lines-8 3788185 304.8 ns/op
BenchmarkString/multi-line_space_indented_with_random_indentation-8 4104337 279.4 ns/op
BenchmarkString/multi-line_tab_indented_with_random_indentation-8 7092417 177.4 ns/op
BenchmarkString/long_block_of_text-8 283140 4398 ns/op
PASS
ok github.com/jimeh/undent 47.252s
The old method signature was just nonsensical, as you would always be
providing indented values via a string literal. So it makes much more
sense to have all methods accept a string argument, and then return
different types.
This also allows use of a `Bytesf` method.
This is technically a breaking change, but I'm classifying it as a
bugfix cause the old method signature was basically useless.
This effectively cleans up what I consider syntactical sugar required
due to Go's syntax. For example:
str := undent.String(`
hello
world`,
)
In the above example I would consider the initial line-break after the
opening back-tick (`) character syntactical sugar, and hence should be
discarded from the final undented string.
However if the literal string contains more than one initial line-break,
only the first one should be removed, as the rest would intentionally be
part of the input.