Overview

Package scanner provides a scanner and tokenizer for UTF-8-encoded text. It
takes an io.Reader providing the source, which then can be tokenized through
repeated calls to the Scan function. For compatibility with existing tools, the
NUL character is not allowed. If the first character in the source is a UTF-8
encoded byte order mark (BOM), it is discarded.

By default, a Scanner skips white space and Go comments and recognizes all
literals as defined by the Go language specification. It may be customized to
recognize only a subset of those literals and to recognize different identifier
and white space characters.


Example:

Package files

Constants

  1. const (
  2. ScanIdents = 1 << -Ident
  3. ScanInts = 1 << -
  4. ScanFloats = 1 << -Float // includes Ints
  5. ScanChars = 1 << -
  6. ScanStrings = 1 << -String
  7. ScanRawStrings = 1 << -
  8. ScanComments = 1 << -Comment
  9. SkipComments = 1 << -skipComment // if set with ScanComments, comments become white space
  10. GoTokens = | ScanFloats | | ScanStrings | | ScanComments |
  11. )

Predefined mode bits to control recognition of tokens. For instance, to
configure a Scanner such that it only recognizes (Go) identifiers, integers, and
skips comments, set the Scanner’s Mode field to:

  1. ScanIdents | ScanInts | SkipComments

With the exceptions of comments, which are skipped if SkipComments is set,
unrecognized tokens are not ignored. Instead, the scanner simply returns the
respective individual characters (or possibly sub-tokens). For instance, if the
mode is ScanIdents (not ScanStrings), the string “foo” is scanned as the token
sequence ‘“‘ Ident ‘“‘.

  1. EOF = -(iota + 1)
  2. Ident
  3. Int
  4. Float
  5. Char
  6. String
  7. RawString
  8. Comment
  9. )

The result of Scan is one of these tokens or a Unicode character.

  1. const GoWhitespace = 1<<'\t' | 1<<'\n' | 1<<'\r' | 1<<' '

GoWhitespace is the default value for the Scanner’s Whitespace field. Its value
selects Go’s white space characters.

TokenString returns a printable string for a token or Unicode character.

type

  1. type Position struct {
  2. Filename // filename, if any
  3. Offset int // byte offset, starting at 0
  4. Line // line number, starting at 1
  5. Column int // column number, starting at 1 (character count per line)
  6. }

A source position is represented by a Position value. A position is valid if
Line > 0.

func (*Position)

  1. func (pos *) IsValid() bool

IsValid reports whether the position is valid.

  1. func (pos ) String() string

  1. type Scanner struct {
  2.  
  3. // Error is called for each error encountered. If no Error
  4. // function is set, the error is reported to os.Stderr.
  5. Error func(s *, msg string)
  6.  
  7. // ErrorCount is incremented by one for each error encountered.
  8. ErrorCount
  9.  
  10. // The Mode field controls which tokens are recognized. For instance,
  11. // to recognize Ints, set the ScanInts bit in Mode. The field may be
  12. Mode uint
  13.  
  14. // The Whitespace field controls which characters are recognized
  15. // as white space. To recognize a character ch <= ' ' as white space,
  16. // set the ch'th bit in Whitespace (the Scanner's behavior is undefined
  17. // for values ch > ' '). The field may be changed at any time.
  18. Whitespace
  19.  
  20. // IsIdentRune is a predicate controlling the characters accepted
  21. // as the ith rune in an identifier. The set of valid characters
  22. // must not intersect with the set of white space characters.
  23. // If no IsIdentRune function is set, regular Go identifiers are
  24. // accepted instead. The field may be changed at any time.
  25. IsIdentRune func(ch rune, i ) bool
  26.  
  27. // Start position of most recently scanned token; set by Scan.
  28. // Calling Init or Next invalidates the position (Line == 0).
  29. // The Filename field is always left untouched by the Scanner.
  30. // If an error is reported (via Error) and Position is invalid,
  31. // the scanner is not inside a token. Call Pos to obtain an error
  32. // position in that case, or to obtain the position immediately
  33. // after the most recently scanned token.
  34. // contains filtered or unexported fields
  35. }

A Scanner implements reading of Unicode characters and tokens from an io.Reader.

func (*Scanner) Init

Init initializes a Scanner with a new source and returns s. Error is set to nil,
ErrorCount is set to 0, Mode is set to GoTokens, and Whitespace is set to
GoWhitespace.

func (*Scanner) Next

  1. func (s *Scanner) Next()

Next reads and returns the next Unicode character. It returns EOF at the end of
the source. It reports a read error by calling s.Error, if not nil; otherwise it
prints an error message to os.Stderr. Next does not update the Scanner’s
Position field; use Pos() to get the current position.

  1. func (s *Scanner) Peek()

Peek returns the next Unicode character in the source without advancing the
scanner. It returns EOF if the scanner’s position is at the last character of
the source.

func (*Scanner) Pos

  1. func (s *Scanner) Pos() (pos )

Pos returns the position of the character immediately after the character or
token returned by the last call to Next or Scan. Use the Scanner’s Position
field for the start position of the most recently scanned token.

func (*Scanner) Scan

  1. func (s *Scanner) Scan()

Scan reads the next token or Unicode character from source and returns it. It
only recognizes tokens t for which the respective Mode bit (1<<-t) is set. It
returns EOF at the end of the source. It reports scanner errors (read and token
errors) by calling s.Error, if not nil; otherwise it prints an error message to
os.Stderr.

TokenText returns the string corresponding to the most recently scanned token.
Valid after calling Scan().