Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auth middleware #23

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 97 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Fast web scraping
- [Web Interface](#web-interface)
- [API](#api)
- [Healthchecks](#healthchecks)
- [Authorization](#authorization)
- [Database Options](#database-options)
- [Building and Developing](#building-and-developing)
- [Building](#building)
Expand All @@ -23,14 +24,17 @@ Fast web scraping
- [Acknowledgements](#acknowledgements)

## Description
`scrape` provides a self-contained low-to-no-setup tool to grab metadata and text content from web pages. The server provides a REST API to scrape web metadata, with support for batches, using either a direct client, with headless browser option that's useful for pages that need javascript to load.
`scrape` provides a self-contained low-to-no-setup tool to grab metadata and text content from web pages. The server provides a REST API to scrape web metadata, with support for batches, using either a direct client, with headless browser option that's useful for pages that need javascript to load.

Results are stored, so subsequent fetches of a particular URL are fast. Install the binary, and operate it as a shell command or as a server with a REST API. The default SQLite storage backend is performance-optimized and can store to disk or in memory. MySQL is also supported. Resources are stored with a configurable TTL.

The `scrape` cli tool provides shell access to scraped content via command-line entry or CSV files, and also provides database management functionality. `scrape-server` provides web and API access to content metadata in one-offs or batches.

RSS and Atom feeds are supported via an endpoint in `scrape-server`. Loading a feed returns the parsed results for all item links in the feed.

Authorization via JWT keys is supported with a configuration option. The companion `scrape-jwt-encode` tool can be used to generate tokens, and to make the secret you need
to securely sign and verify JWT tokens.

The `scrape` and `scrape-server` binaries should be buildable and runnable in any environment where `go` and `SQLite3` (or a `MySQL` server) are present. A docker build is also included. See `make help` for build instructions.

## Output Format
Expand Down Expand Up @@ -311,6 +315,98 @@ database runtime info.

This just returns a status `200` with the content `OK`

### Authorization

By default, `scrape` runs without any authorization, all endpoints are open. JWT based authentication is supported, via the `scrape-jwt-encode` tool and a `scrape-server` configuration option. Here's how to enable it:

#### Generate a Secret

(`./build` is the path for executables built locally with `make` update this path if your binaries are elsewhere)

Running `scrape-jwt-encode` with the `make-key` flag will generate a cryptographically random HS256 secret, and encode it to Base64.

```
scrape % ./build/scrape-jwt-encode -make-key
Be sure to save this key, as it can't be re-generated:
b4RThFbyMKfQE3+jAjJcR5rjVgVOeA2Ub9eethtX83M=
```

You will need this secret to generate tokens and to configure the server for authentication; you'll need to save it, but don't share it via non-secure means, etc.

If the key is in your environment as `SCRAPE_SIGNING_KEY` it'll be picked up by both the JWT encoding tool and the server.

#### Generate Tokens

Tokens are also generated using `scrape-jwt-encode`. Here's the output of `scrape-jwt-encode -h`:

```
scrape % ./build/scrape-jwt-encode -h

Generates JWT tokens for the scrape service. Also makes the signing key to use for the tokens.

Usage:
-----
scrape-jwt-encode -sub subject [-signing-key key] [-exp expiration] [-aud audience]
scrape-jwt-encode -make-key

-aud string
Audience (recipient) for the key (default "moz")
-exp value
Expiration date for the key, in RFC3339 format. Default is 1 year from now. (default 2025-05-16T11:45:45.842362-04:00)
-make-key
Generate a new signing key
-signing-key value
HS256 key to sign the JWT token
Environment: SCRAPE_SIGNING_KEY (default &[])
-sub string
Subject (holder name) for the key (required)
```

When generating a key, only a `subject` is required. Authorization doesn't actually check this value, but it's a good idea to use unique identifying values here, as this may get used in the future and may also get logged.

Here's the output from token generation, after putting the key from above into the environment. The claims are printed out for reference, but it's the token at the bottom that you want to share with API consumers.

```
scrape % export SCRAPE_SIGNING_KEY=b4RThFbyMKfQE3+jAjJcR5rjVgVOeA2Ub9eethtX83M=
scrape % ./build/scrape-jwt-encode -sub some_user

Claims:
------
{
"iss": "scrape",
"sub": "some_user",
"aud": [
"moz"
],
"exp": 1747410570,
"iat": 1715874570
}

Token:
-----
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzY3JhcGUiLCJzdWIiOiJzb21lX3VzZXIiLCJhdWQiOlsibW96Il0sImV4cCI6MTc0NzQxMDU3MCwiaWF0IjoxNzE1ODc0NTcwfQ.4AapsWfYAK78JhP9AyhupmZAHLeRDyIPlE8ODwwsVRg
```

### Using Tokens When Making API Requests

When authorization is enabled on the server, API requests must have an `Authorization` header that begins with the string `Bearer ` (with a space) followed by the token.

### Enabling Authorization In `scrape-server`

To enable authorization on the server either:

1. Start the server with the `SCRAPE_SIGNING_KEY` environment variable
2. Pass a `-signing-key [key]` argument to the server when starting up

When enabled, `scrape` will check the following qualities of the token, and reject API requests with a `401` unless there is a token passed in the `Authorization` header fulfilling the following criteria:

1. It's a valid JWT token
2. The token has been signed with the same signing key that the server is using
3. The token's issuer is `scrape`
4. The token is not expired

Healthcheck paths don't require authorization, and neither does the web interface at the root URL. (The test console uses a short-lived key to authorize requests -- it _is_ possible to lift a key from here and use it for a little while; trying to strike a balance here between securing access and making exercising the server easy. You will also need to reload the test console periodically or calls from here will 401 as well)

## Database Options

`scrape` supports SQLite or MySQL for data storage. Your choice depends on your requirements and environment.
Expand Down
17 changes: 17 additions & 0 deletions cmd/scrape-server/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import (
"github.com/efixler/envflags"
"github.com/efixler/scrape/fetch"
"github.com/efixler/scrape/fetch/trafilatura"
"github.com/efixler/scrape/internal/auth"
"github.com/efixler/scrape/internal/cmd"
"github.com/efixler/scrape/internal/headless"
"github.com/efixler/scrape/internal/server"
Expand All @@ -34,6 +35,7 @@ const (
var (
flags flag.FlagSet
port *envflags.Value[int]
signingKey *envflags.Value[*auth.HMACBase64Key]
ttl *envflags.Value[time.Duration]
userAgent *envflags.Value[*ua.UserAgent]
dbFlags *cmd.DatabaseFlags
Expand All @@ -56,13 +58,21 @@ func main() {
headlessFetcher, _ = trafilatura.Factory(headlessClient)()
}

// TODO: Implement options pattern for NewScrapeServer
ss, _ := server.NewScrapeServer(
ctx,
dbFactory,
defaultFetcherFactory,
headlessFetcher,
)

if sk := *signingKey.Get(); len(sk) > 0 {
ss.SigningKey = sk
slog.Info("scrape-server authorization via JWT is enabled")
} else {
slog.Info("scrape-server authorization is disabled, running in open access mode")
}

mux, err := server.InitMux(ss)
if err != nil {
slog.Error("scrape-server error initializing the server's mux", "error", err)
Expand Down Expand Up @@ -106,6 +116,13 @@ func init() {
port = envflags.NewInt("PORT", DefaultPort)
port.AddTo(&flags, "port", "Port to run the server on")

signingKey = envflags.NewText("SIGNING_KEY", &auth.HMACBase64Key{})
signingKey.AddTo(
&flags,
"signing-key",
"Base64 encoded HS256 key to verify JWT tokens. Required for JWT auth, and enables JWT auth if set.",
)

ttl = envflags.NewDuration("TTL", resource.DefaultTTL)
ttl.AddTo(&flags, "ttl", "TTL for fetched resources")

Expand Down
7 changes: 7 additions & 0 deletions internal/auth/jwt.go
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,13 @@ func (c Claims) Sign(key HMACBase64Key) (string, error) {
return c.Token().SignedString([]byte(key))
}

func ExpiresIn(d time.Duration) option {
return func(c *Claims) error {
c.ExpiresAt = jwt.NewNumericDate(time.Now().Add(d))
return nil
}
}

func ExpiresAt(t time.Time) option {
return func(c *Claims) error {
if t.Before(time.Now()) {
Expand Down
58 changes: 58 additions & 0 deletions internal/auth/middleware.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
package auth

import (
"context"
"fmt"
"net/http"
"strings"
)

type ClaimsAuthorizer func(claims *Claims) error

//type AuthHandler func

// Checks the Authorization header for a JWT token and verifies it using the provided key.
// The token is always validated against the HMAC key, the issuer, and the Claims.Validate
// function.
//
// The ClaimsAuthorizer functions, if any are called in order. If any of them return an
// error, the request is rejected with a 401 Unauthorized status and the error message
// is written to the response body.
//
// If the token is valid, the claims are added to the request context.
func JWTAuthMiddleware(
key HMACBase64Key,
contextKey any,
cc ...ClaimsAuthorizer,
) func(http.HandlerFunc) http.HandlerFunc {
return func(next http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
_, token, found := strings.Cut(r.Header.Get("Authorization"), " ")
if !found || (token == "") {
http.Error(w, "No Authorization Passed", http.StatusUnauthorized)
return
}
claims, err := VerifyToken(key, strings.TrimSpace(token))
if err != nil {
http.Error(
w,
fmt.Sprintf("Invalid Token %q: %v", token, err),
http.StatusUnauthorized,
)
return
}
for _, c := range cc {
if err := c(claims); err != nil {
http.Error(
w,
fmt.Sprintf("Not authorized for this request: %v", err),
http.StatusUnauthorized,
)
return
}
}
r = r.WithContext(context.WithValue(r.Context(), contextKey, claims))
next(w, r)
}
}
}
98 changes: 98 additions & 0 deletions internal/auth/middleware_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
package auth

import (
"fmt"
"io"
"net/http"
"net/http/httptest"
"testing"
"time"
)

func TestJWTAuthMiddleWare(t *testing.T) {
t.Parallel()
realKey := MustNewHS256SigningKey()
c, _ := NewClaims(
ExpiresAt(time.Now().Add(24*time.Hour)),
WithSubject("subject"),
WithAudience("audience"),
)
token, _ := c.Sign(realKey)
tests := []struct {
name string
key HMACBase64Key
authHeader string
extra []ClaimsAuthorizer
expectStatus int
}{
{
name: "no auth",
key: realKey,
authHeader: "",
expectStatus: http.StatusUnauthorized,
},
{
name: "valid auth",
key: realKey,
authHeader: fmt.Sprintf("Bearer %s", token),
expectStatus: http.StatusOK,
},
{
name: "No token in header",
key: realKey,
authHeader: "Bearer",
expectStatus: http.StatusUnauthorized,
},
{
name: "Garbage token in header",
key: realKey,
authHeader: "Bearer llkllKjLKDLD.kkajhdakjsdhakdjh.ajkshdakjshd",
expectStatus: http.StatusUnauthorized,
},
{
name: "Only token in header",
key: realKey,
authHeader: token,
expectStatus: http.StatusUnauthorized,
},
{
name: "Key mismatch",
key: MustNewHS256SigningKey(),
authHeader: fmt.Sprintf("Bearer %s", token),
expectStatus: http.StatusUnauthorized,
},
{
name: "With extra authorizer, passthru",
key: realKey,
authHeader: fmt.Sprintf("Bearer %s", token),
extra: []ClaimsAuthorizer{func(c *Claims) error { return nil }},
expectStatus: http.StatusOK,
},
{
name: "With extra authorizer, reject",
key: realKey,
authHeader: fmt.Sprintf("Bearer %s", token),
extra: []ClaimsAuthorizer{func(c *Claims) error { return fmt.Errorf("nope") }},
expectStatus: http.StatusUnauthorized,
},
}
type contextKey struct{}
for _, tt := range tests {
req := httptest.NewRequest("GET", "http://example.com", nil)
recorder := httptest.NewRecorder()
req.Header.Set("Authorization", tt.authHeader)
m := JWTAuthMiddleware(tt.key, contextKey{}, tt.extra...)

m(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
claims, ok := r.Context().Value(contextKey{}).(*Claims)
if !ok {
t.Fatalf("[%s] JWTAuthMiddleware, expected claims, got %v", tt.name, claims)
}
}))(recorder, req)
response := recorder.Result()
if response.StatusCode != tt.expectStatus {
body, _ := io.ReadAll(response.Body)
t.Fatalf("[%s] JWTAuthMiddleware, expected status %d, got %d (%s)", tt.name, tt.expectStatus, response.StatusCode, body)
}
}
}
3 changes: 1 addition & 2 deletions internal/server/middleware.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,7 @@ type middleware func(http.HandlerFunc) http.HandlerFunc

type payloadKey struct{}

// type fetcherKey struct{}

// Prepend the middlewares to the handler in the order they are provided.
func Chain(h http.HandlerFunc, m ...middleware) http.HandlerFunc {
if len(m) == 0 {
return h
Expand Down
5 changes: 5 additions & 0 deletions internal/server/middleware_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,11 @@ func TestIsJson(t *testing.T) {
content: "multipart/form-data; charset=utf-8",
expected: false,
},
{
name: "empty",
content: "",
expected: true,
},
}
for _, tt := range tests {
req := httptest.NewRequest("POST", "http://example.com", nil)
Expand Down
5 changes: 4 additions & 1 deletion internal/server/pages/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,10 @@
event.preventDefault();
const action = form.action;
const headers = new Headers();
headers.append('Authorization', 'Bearer not_implemented_yet');
const token = '{{AuthToken}}'
if (token) {
headers.append('Authorization', `Bearer ${token}`);
}
const formData = new FormData(form);

try {
Expand Down
Loading