Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeed doesn't pick up IMPORT INTO #43784

Closed
dbist opened this issue Jan 7, 2020 · 12 comments
Closed

changefeed doesn't pick up IMPORT INTO #43784

dbist opened this issue Jan 7, 2020 · 12 comments
Labels
A-cdc Change Data Capture T-cdc

Comments

@dbist
Copy link
Contributor

dbist commented Jan 7, 2020

Describe the problem
I recently truncated a table, created a changefeed, imported data in with IMPORT command, didn't see changefeed working.

To Reproduce
create a changefeed on a table and perform IMPORT INTO.

If possible, provide steps to reproduce the behavior:

  1. Create a changefeed
  2. perform IMPORT INTO
  3. json will not be produced

Expected behavior
imported data should be picked up by changefeed.

Additional data / screenshots
If the problem is SQL-related, include a copy of the SQL query and the schema
of the supporting tables.

If a node in your cluster encountered a fatal error, supply the contents of the
log directories (at minimum of the affected node(s), but preferably all nodes).

Note that log files can contain confidential information. Please continue
creating this issue, but contact support@cockroachlabs.com to submit the log
files in private.

If applicable, add screenshots to help explain your problem.

Environment:

  • CockroachDB version 19.2.2
  • Server OS: OSX
  • Client app: cockroach sql

Additional context
What was the impact?

Add any other context about the problem here.

@piyush-singh piyush-singh added the A-cdc Change Data Capture label Jan 7, 2020
@ajwerner
Copy link
Contributor

ajwerner commented Jan 8, 2020

This is real. @danhhz do you have thoughts about the right behavior here?

@danhhz
Copy link
Contributor

danhhz commented Jan 8, 2020

I suppose our two options are 1) emit the sstables created by the IMPORT INTO or 2) disconnect the changefeed with an error. Either way, I think RangeFeed will have to do something with AddSSTables

@ajwerner
Copy link
Contributor

Either way, I think RangeFeed will have to do something with AddSSTables

Maybe eventually this is true but I don't think that'd be my first approach to fixing this issue. As of 20.1 we take tables offline when performing an IMPORT INTO. I think it'd be reasonable to instead perform a time-bound backfill when detecting a table coming back online. I think that'd be a smaller and less invasive change than emitting logical ops for AddSSTable requests. I worry that getting that to work with the buffering of rangefeeds might be quite hard with large requests.

@danhhz thoughts on this approach?

cc @mwang1026

@danhhz
Copy link
Contributor

danhhz commented Jan 28, 2020

That's a nice idea and I have the same concerns about rangefeed buffering. It definitely works for now, do we have any plans to make IMPORT INTO work without taking the table offline? That would require something different

@ajwerner
Copy link
Contributor

do we have any plans to make IMPORT INTO work without taking the table offline?

Only vague ideas as far as I know. If we have a replicated lock table that supports spans then we could use that to make AddSST work transactionally. @dt brought this up recently. Replicated span locks don't seem like a 20.1 thing at this point. Maybe 20.2. That's probably the cleanest solution though will complicate this story a bit. I wonder if even in that case the right thing to do is detect the commit of the removal of the span lock as the event for the AddSSTable and maybe also attach metadata to the logical op entry with that "committed" span lock and just go do a time bound iteration of that span. For a given AddSSTable we should be writing into a span all at the same timestamp. If we really wanted to make IMPORT INTO transactional then we'd need to grab an upgradable read lock on the entire range being written before we picked a timestamp. I'm not sure if that will fly. I guess it's better than what happens today.

I'm sure we could cobble together something worse using partial offline states in table descriptors or something else nasty that might actually make implementing this a little easier.

<ramble>
I've been thinking generally about watchable locks and some sort of signaling read-write lock that is a common pattern when orchestrating distributed services. This is effectively also the primitive I want for changefeed leases. I'm not sure gossip is the best mechanism for notifying of changes but it's what I'm thinking about using.

The patterns I'd like here is for the lock holders to watch a changefeed for some sort of invalidating event and then the writer of the event will try to grab an exclusive lock. When it successfully acquires the exclusive lock it knows that all of the other leaseholders have dropped there. It could be done without shared locks if you can grab an exclusive lock over a span. Long lived transactions are mostly expensive because of transaction heartbeating which we can probably eliminate without too much work.

I wonder how much of this would fall out cleanly if we implemented pg_advisory_lock and scoped changefeeds. I'm not sure what exactly is the right abstraction for distributed locking at the kv layer but it seems like there's a lot of other state that would benefit from shared leasing and distributed coordination primitives. There are existing subsystems it would simplify (jobs?) and features it should enable (enums? types?).

There's a bootstrapping problem to be worked around to use the changefeeds in the implementation but I'm going to assume we could work around it.
</ramble>

@dt
Copy link
Member

dt commented Jan 29, 2020

my short term vote would probably be that active changefeeds should terminate with an error as soon as a table moves into a non-public state (offline, drop, etc).

That would mean that when you start IMPORT INTO, it kills your feeds, and you can't resume them (or start new ones) while the table is still offline. When the table comes back online, you could resume with cursor from the time it stopped and you'd see all the imported keys (we timestamp them > offline time) just via a normal backfill. I don't think we need to do anything with RangeFeed and AddSSTable in this scenario (unless there's a race with noticing the table is offline and tearing it down and the first sst arriving?). In theory, we could also pause the job and resume it automatically after the IMPORTT so this could all go unnoticed, but at the low level, I don't think we want changefeeds active when there's a chance of AddSSTable requests messing with our assumptions.

I don't think it would actually be correct to emit contents of SSTs as they came in, since they're sorta "uncommitted" until the IMPORT finishes, but more importantly, none of this is expected to play nice with txns, respecting closed-ts, etc -- by taking a table offline, we're assuming we can throw all that out the window, and I imagine we'll violate some assumptions if we let a changefeed keep running on keyspace where we've locked out other SQL and said all bets are off.

@ajwerner
Copy link
Contributor

my short term vote would probably be that active changefeeds should terminate with an error as soon as a table moves into a non-public state (offline, drop, etc).

Well aren’t you in luck! #44265 (comment)

@ajwerner
Copy link
Contributor

ajwerner commented Feb 5, 2020

I just realized that #44663 makes this harder. I have a feeling that time-bound scans are not going to be a popular idea but I could be wrong.

@ajwerner
Copy link
Contributor

CDC Team, y'all should probably get a unit test to exercise the IMPORT INTO scenarios to confirm that indeed the changefeed does fail and ideally that it fails with an error with an appropriate timestamp and has resolved everything up to that timestamp and then close this issue.

@amruss
Copy link
Contributor

amruss commented Feb 25, 2021

Acked @ajwerner

@shermanCRL
Copy link
Contributor

I moved this back to triage, to ensure we get someone assigned.

stevendanna added a commit to stevendanna/cockroach that referenced this issue Apr 28, 2021
In 20.2.4, a changefeed would fail if IMPORT INTO was run against one
of its target tables. The failure would look like:

```
I210428 10:45:57.982012 2015 jobs/registry.go:1131 ⋮ [n1] CHANGEFEED job 653840730248282113: stepping through state failed with error: ‹relation› ‹"test"› is offline: ‹importing›
(1)
Wraps: (2) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/sql/catalog.FilterDescriptorState
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descriptor.go:387
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:219
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:758
  | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:808
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:757
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:859
  | github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128
  | runtime.goexit
  |     /usr/local/Cellar/go/1.16.3/libexec/src/runtime/asm_amd64.s:1371
Wraps: (3) ‹relation› ‹"test"› is offline: ‹importing›
Error
types: (1) *catalog.inactiveDescriptorError (2) *withstack.withStack (3) *errutil.leafError
```

We want changefeed to fail when IMPORT INTO is run because changes via
the AddSSTable mechanism is not currently reflected in the changefeed,
meaning we would fail to emit imported data.

The previous path that raised this failure depended on:

1) The descriptor being offline at the point we attempted to acquire a
   lease on it:

   https://github.com/cockroachdb/cockroach/blob/d1962910b58005096ce411bccbaddcd0c1d30cbd/pkg/ccl/changefeedccl/schemafeed/schema_feed.go#L514

2) The lease acquisition filtering out offline descriptors with an
error:

   https://github.com/cockroachdb/cockroach/blob/eda2309728392593162e962a61182eab6ab003ff/pkg/sql/catalog/descriptor.go#L209

3) The failure from the lease acquisition in the schemafeed being
   treated as a fatal error.

I believe our behaviour here has changed a few times on both the 20.2
branch and master because of changes in each of these 3 behaviours.

In this change, rather than relying on the lease acquisition, we
specifically check for offline tables in our ValidateTable
function. This function is called for every descriptor version we get
from the ExportRequest on the Descriptor table.

Currently, I believe that checking for the offline descriptors is
correct since it appears that only restore and import put tables into
an offline state.

Release note (enterprise change): CHANGEFEEDs more reliably fail when
IMPORT INTO is run against a targeted table.

Fixes cockroachdb#64276

See also cockroachdb#62585, cockroachdb#43784
stevendanna added a commit to stevendanna/cockroach that referenced this issue Apr 28, 2021
In 20.2.4, a changefeed would fail if IMPORT INTO was run against one
of its target tables. The failure would look like:

```
I210428 10:45:57.982012 2015 jobs/registry.go:1131 ⋮ [n1] CHANGEFEED job 653840730248282113: stepping through state failed with error: ‹relation› ‹"test"› is offline: ‹importing›
(1)
Wraps: (2) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/sql/catalog.FilterDescriptorState
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descriptor.go:387
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:219
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:758
  | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:808
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:757
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:859
  | github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128
  | runtime.goexit
  |     /usr/local/Cellar/go/1.16.3/libexec/src/runtime/asm_amd64.s:1371
Wraps: (3) ‹relation› ‹"test"› is offline: ‹importing›
Error
types: (1) *catalog.inactiveDescriptorError (2) *withstack.withStack (3) *errutil.leafError
```

We want changefeed to fail when IMPORT INTO is run because changes via
the AddSSTable mechanism is not currently reflected in the changefeed,
meaning we would fail to emit imported data.

The previous path that raised this failure depended on:

1) The descriptor being offline at the point we attempted to acquire a
   lease on it:

   https://github.com/cockroachdb/cockroach/blob/d1962910b58005096ce411bccbaddcd0c1d30cbd/pkg/ccl/changefeedccl/schemafeed/schema_feed.go#L514

2) The lease acquisition filtering out offline descriptors with an
error:

   https://github.com/cockroachdb/cockroach/blob/eda2309728392593162e962a61182eab6ab003ff/pkg/sql/catalog/descriptor.go#L209

3) The failure from the lease acquisition in the schemafeed being
   treated as a fatal error.

I believe our behaviour here has changed a few times on both the 20.2
branch and master because of changes in each of these 3 behaviours.

In this change, rather than relying on the lease acquisition, we
specifically check for offline tables in our ValidateTable
function. This function is called for every descriptor version we get
from the ExportRequest on the Descriptor table.

Currently, I believe that checking for the offline descriptors is
correct since it appears that only restore and import put tables into
an offline state.

Release note (enterprise change): CHANGEFEEDs more reliably fail when
IMPORT INTO is run against a targeted table.

Fixes cockroachdb#64276

See also cockroachdb#62585, cockroachdb#43784
craig bot pushed a commit that referenced this issue Apr 28, 2021
64323: changefeedccl: fail changefeeds when tables go offline r=miretskiy a=stevendanna

In 20.2.4, a changefeed would fail if IMPORT INTO was run against one
of its target tables. The failure would look like:

```
I210428 10:45:57.982012 2015 jobs/registry.go:1131 ⋮ [n1] CHANGEFEED job 653840730248282113: stepping through state failed with error: ‹relation› ‹"test"› is offline: ‹importing›
(1)
Wraps: (2) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/sql/catalog.FilterDescriptorState
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descriptor.go:387
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:219
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:758
  | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:808
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:757
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:859
  | github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128
  | runtime.goexit
  |     /usr/local/Cellar/go/1.16.3/libexec/src/runtime/asm_amd64.s:1371
Wraps: (3) ‹relation› ‹"test"› is offline: ‹importing›
Error
types: (1) *catalog.inactiveDescriptorError (2) *withstack.withStack (3) *errutil.leafError
```

We want changefeed to fail when IMPORT INTO is run because changes via
the AddSSTable mechanism is not currently reflected in the changefeed,
meaning we would fail to emit imported data.

The previous path that raised this failure depended on:

1) The descriptor being offline at the point we attempted to acquire a
   lease on it:

   https://github.com/cockroachdb/cockroach/blob/d1962910b58005096ce411bccbaddcd0c1d30cbd/pkg/ccl/changefeedccl/schemafeed/schema_feed.go#L514

2) The lease acquisition filtering out offline descriptors with an
error:

   https://github.com/cockroachdb/cockroach/blob/eda2309728392593162e962a61182eab6ab003ff/pkg/sql/catalog/descriptor.go#L209

3) The failure from the lease acquisition in the schemafeed being
   treated as a fatal error.

I believe our behaviour here has changed a few times on both the 20.2
branch and master because of changes in each of these 3 behaviours.

In this change, rather than relying on the lease acquisition, we
specifically check for offline tables in our ValidateTable
function. This function is called for every descriptor version we get
from the ExportRequest on the Descriptor table.

Currently, I believe that checking for the offline descriptors is
correct since it appears that only restore and import put tables into
an offline state.

Release note (enterprise change): CHANGEFEEDs more reliably fail when
IMPORT INTO is run against a targeted table.

Fixes #64276

See also #62585, #43784

Co-authored-by: Steven Danna <danna@cockroachlabs.com>
@amruss
Copy link
Contributor

amruss commented Apr 28, 2021

Closing - Stevens PR (https://github.com/cockroachdb/cockroach/pull/64323/files) made sure we fail the changefeed when the table is taken offline for an IMPORT INTO and added unit tests here. Docs PR here (cockroachdb/docs#10452 (comment))

@amruss amruss closed this as completed Apr 28, 2021
stevendanna added a commit to stevendanna/cockroach that referenced this issue Apr 29, 2021
In 20.2.4, a changefeed would fail if IMPORT INTO was run against one
of its target tables. The failure would look like:

```
I210428 10:45:57.982012 2015 jobs/registry.go:1131 ⋮ [n1] CHANGEFEED job 653840730248282113: stepping through state failed with error: ‹relation› ‹"test"› is offline: ‹importing›
(1)
Wraps: (2) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/sql/catalog.FilterDescriptorState
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descriptor.go:387
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:219
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:758
  | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:808
  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:757
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193
  | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:859
  | github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall
  |     /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128
  | runtime.goexit
  |     /usr/local/Cellar/go/1.16.3/libexec/src/runtime/asm_amd64.s:1371
Wraps: (3) ‹relation› ‹"test"› is offline: ‹importing›
Error
types: (1) *catalog.inactiveDescriptorError (2) *withstack.withStack (3) *errutil.leafError
```

We want changefeed to fail when IMPORT INTO is run because changes via
the AddSSTable mechanism is not currently reflected in the changefeed,
meaning we would fail to emit imported data.

The previous path that raised this failure depended on:

1) The descriptor being offline at the point we attempted to acquire a
   lease on it:

   https://github.com/cockroachdb/cockroach/blob/d1962910b58005096ce411bccbaddcd0c1d30cbd/pkg/ccl/changefeedccl/schemafeed/schema_feed.go#L514

2) The lease acquisition filtering out offline descriptors with an
error:

   https://github.com/cockroachdb/cockroach/blob/eda2309728392593162e962a61182eab6ab003ff/pkg/sql/catalog/descriptor.go#L209

3) The failure from the lease acquisition in the schemafeed being
   treated as a fatal error.

I believe our behaviour here has changed a few times on both the 20.2
branch and master because of changes in each of these 3 behaviours.

In this change, rather than relying on the lease acquisition, we
specifically check for offline tables in our ValidateTable
function. This function is called for every descriptor version we get
from the ExportRequest on the Descriptor table.

Currently, I believe that checking for the offline descriptors is
correct since it appears that only restore and import put tables into
an offline state.

Release note (enterprise change): CHANGEFEEDs more reliably fail when
IMPORT INTO is run against a targeted table.

Fixes cockroachdb#64276

See also cockroachdb#62585, cockroachdb#43784
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture T-cdc
Projects
None yet
Development

No branches or pull requests

8 participants