A HTML file
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<div id="test">
<ul>
<li>0</li>
<li>1</li>
<li>2</li>
<li>3</li>
</ul>
<div>
<p>Hexilee</p>
<p>20</p>
<p>true</p>
</div>
<p>Hello World!</p>
<p>10</p>
<p>3.14</p>
<p>true</p>
</div>
</body>
</html>
Read it
AllTypeHTML, _ := ioutil.ReadFile("testHTML/all-type.html")
If we want to parse it and get the values we want, like the following structs, how should we do it?
package example
type (
PartTypesStruct struct {
Slice []int
Struct TestUser
String string
Int int
Float64 float64
Bool bool
}
TestUser struct {
Name string
Age uint
LikeLemon bool
}
)
In the traditional way, we should do it like this:
package example
import (
"bytes"
"github.com/PuerkitoBio/goquery"
"strconv"
)
func parsePartTypesLogically() (PartTypesStruct, error) {
doc, err := goquery.NewDocumentFromReader(bytes.NewReader(AllTypeHTML))
partTypes := PartTypesStruct{}
if err == nil {
selection := doc.Find(partTypes.Root())
partTypes.Slice = make([]int, 0)
selection.Find(`ul > li`).Each(func(i int, selection *goquery.Selection) {
Int, parseErr := strconv.Atoi(selection.Text())
if parseErr != nil {
err = parseErr
}
partTypes.Slice = append(partTypes.Slice, Int)
})
if err == nil {
partTypes.Struct.Name = selection.Find(`#test > div > p:nth-child(1)`).Text()
Int, parseErr := strconv.Atoi(selection.Find(`#test > div > p:nth-child(2)`).Text())
if err = parseErr; err == nil {
partTypes.Struct.Age = uint(Int)
Bool, parseErr := strconv.ParseBool(selection.Find(`#test > div > p:nth-child(3)`).Text())
if err = parseErr; err == nil {
partTypes.Struct.LikeLemon = Bool
String := selection.Find(`#test > p:nth-child(3)`).Text()
Int, parseErr := strconv.Atoi(selection.Find(`#test > p:nth-child(4)`).Text())
if err = parseErr; err != nil {
return partTypes, err
}
Float64, parseErr := strconv.ParseFloat(selection.Find(`#test > p:nth-child(5)`).Text(), 0)
if err = parseErr; err != nil {
return partTypes, err
}
Bool, parseErr := strconv.ParseBool(selection.Find(`#test > p:nth-child(6)`).Text())
if err = parseErr; err != nil {
return partTypes, err
}
partTypes.String = String
partTypes.Int = Int
partTypes.Float64 = Float64
partTypes.Bool = Bool
}
}
}
}
return partTypes, err
}
It works pretty well, but is boring. And now, you can do it like this:
package main
import (
"encoding/json"
"fmt"
"github.com/Hexilee/unhtml"
"io/ioutil"
)
type (
PartTypesStruct struct {
Slice []int `html:"ul > li"`
Struct TestUser `html:"#test > div"`
String string `html:"#test > p:nth-child(3)"`
Int int `html:"#test > p:nth-child(4)"`
Float64 float64 `html:"#test > p:nth-child(5)"`
Bool bool `html:"#test > p:nth-child(6)"`
}
TestUser struct {
Name string `html:"p:nth-child(1)"`
Age uint `html:"p:nth-child(2)"`
LikeLemon bool `html:"p:nth-child(3)"`
}
)
func (PartTypesStruct) Root() string {
return "#test"
}
func main() {
allTypes := PartTypesStruct{}
_ := unhtml.Unmarshal(AllTypeHTML, &allTypes)
result, _ := json.Marshal(&allTypes)
fmt.Println(string(result))
}
Result:
{
"Slice": [
0,
1,
2,
3
],
"Struct": {
"Name": "Hexilee",
"Age": 20,
"LikeLemon": true
},
"String": "Hello World!",
"Int": 10,
"Float64": 3.14,
"Bool": true
}
I think it can really improve the efficiency of my development, but what about its performance?
There are two benchmarks:
func BenchmarkUnmarshalPartTypes(b *testing.B) {
assert.NotNil(b, AllTypeHTML)
for i := 0; i < b.N; i++ {
partTypes := PartTypesStruct{}
assert.Nil(b, Unmarshal(AllTypeHTML, &partTypes))
}
}
func BenchmarkParsePartTypesLogically(b *testing.B) {
assert.NotNil(b, AllTypeHTML)
for i := 0; i < b.N; i++ {
_, err := parsePartTypesLogically()
assert.Nil(b, err)
}
}
Test it:
> go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/Hexilee/unhtml
BenchmarkUnmarshalPartTypes-4 30000 54096 ns/op
BenchmarkParsePartTypesLogically-4 30000 45188 ns/op
PASS
ok github.com/Hexilee/unhtml 4.098s
Not very bad, in consideration of the small size of the demo HTML. In true development with more complicated HTML, their efficiency is almost the same.
The only API this package exposes is the function:
func Unmarshal(data []byte, v interface{}) error
which is compatible with the standard library's json
and xml
. However, you can do some jobs with the data types in your code.
This package supports all kinds of type in the reflect
package except Ptr/Uintptr/Interface/Chan/Func
.
The following fields are invalid and will cause UnmarshalerItemKindError
.
type WrongFieldsStruct struct {
Ptr *int
Uintptr uintptr
Interface io.Reader
Chan chan int
Func func()
}
However, when you call the function Unmarshal
, you MUST pass a pointer, otherwise you will get an UnmarshaledKindMustBePtrError
.
a := 1
// Wrong
Unmarshal([]byte(""), a)
// Right
Unmarshal([]byte(""), &a)
Return the root selector.
You are only supported to define a Root() string
method for the root type, like
func (PartTypesStruct) Root() string {
return "#test"
}
If you define it for a field type, such as TestUser
func (TestUser) Root() string {
return "#test"
}
In this case, in PartTypesStruct
, the field selector will be covered.
type (
PartTypesStruct struct {
...
Struct TestUser `html:"#test > div"`
...
}
)
// real
type (
PartTypesStruct struct {
...
Struct TestUser `html:"#test"`
...
}
)
This package is based on github.com/PuerkitoBio/goquery
and supports standard css selectors.
You can define selectors of a field in tags, like this:
type (
PartTypesStruct struct {
...
Int int `html:"#test > p:nth-child(4)"`
...
}
)
In most cases, this package will find the #test > p:nth-child(4)
element and try to parse its innerText
as int.
However, when the field type is Struct
or Slice
, it will be more complex.
type (
PartTypesStruct struct {
...
Struct TestUser `html:"#test > div"`
...
}
TestUser struct {
Name string `html:"p:nth-child(1)"`
Age uint `html:"p:nth-child(2)"`
LikeLemon bool `html:"p:nth-child(3)"`
}
)
func (PartTypesStruct) Root() string {
return "#test"
}
First, it will call *goquery.Selection.Find("#test")
, we get:
<div id="test">
<ul>
<li>0</li>
<li>1</li>
<li>2</li>
<li>3</li>
</ul>
<div>
<p>Hexilee</p>
<p>20</p>
<p>true</p>
</div>
<p>Hello World!</p>
<p>10</p>
<p>3.14</p>
<p>true</p>
</div>
Then, it will call *goquery.Selection.Find("#test > div")
, we get
<div>
<p>Hexilee</p>
<p>20</p>
<p>true</p>
</div>
Then, in TestUser
, it will call
*goquery.Selection.Find("p:nth-child(1)") // as Name
*goquery.Selection.Find("p:nth-child(2)") // as Age
*goquery.Selection.Find("p:nth-child(3)") // as LikeLemon
type (
PartTypesStruct struct {
Slice []int `html:"ul > li"` ...
}
)
func (PartTypesStruct) Root() string {
return "#test"
}
As above, we get
<div id="test">
<ul>
<li>0</li>
<li>1</li>
<li>2</li>
<li>3</li>
</ul>
<div>
<p>Hexilee</p>
<p>20</p>
<p>true</p>
</div>
<p>Hello World!</p>
<p>10</p>
<p>3.14</p>
<p>true</p>
</div>
Then it will call *goquery.Selection.Find("ul > li")
, we get
<li>0</li>
<li>1</li>
<li>2</li>
<li>3</li>
Then, it will call *goquery.Selection.Each(func(int, *goquery.Selection))
, iterate the list and parse values for slice.
This package supports three tags, html
, attr
and converter
Provide the css selector
of this field.
By default, this package regards the innerText
of a element as its value
<a href="https://google.com">Google</a>
type Link struct {
Text string `html:"a"`
}
You will get Text = Google
. However, what should we do if we want to get href
?
type Link struct {
Href string `html:"a" attr:"href"`
Text string `html:"a"`
}
You will get link.Href == "https://google.com"
Sometimes, you want to process the original data
<p>2018-10-01 00:00:01</p>
You may unmarshal it like this
type Birthday struct {
Time time.Time `html:"p"`
}
func TestConverter(t *testing.T) {
birthday := Birthday{}
assert.Nil(t, Unmarshal([]byte(BirthdayHTML), &birthday))
assert.Equal(t, 2018, birthday.Time.Year())
assert.Equal(t, time.October, birthday.Time.Month())
assert.Equal(t, 1, birthday.Time.Day())
}
Absolutely, you will fail, because you don't define the way it converts a string to time.Time. unhtml
will regard it as a struct.
However, you can use converter
type Birthday struct {
Time time.Time `html:"p" converter:"StringToTime"`
}
const TimeStandard = `2006-01-02 15:04:05`
func (Birthday) StringToTime(str string) (time.Time, error) {
return time.Parse(TimeStandard, str)
}
func TestConverter(t *testing.T) {
birthday := Birthday{}
assert.Nil(t, Unmarshal([]byte(BirthdayHTML), &birthday))
assert.Equal(t, 2018, birthday.Time.Year())
assert.Equal(t, time.October, birthday.Time.Month())
assert.Equal(t, 1, birthday.Time.Day())
}
Make it.
The type of converter MUST be
func (inputType) (resultType, error)
resultType
MUST be the same as the field type, and they can be any type.
inputType
MUST NOT violate the requirements in Types.