Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: add support for ARRAY type #2115

Closed
7 tasks done
petermattis opened this issue Aug 14, 2015 · 29 comments
Closed
7 tasks done

sql: add support for ARRAY type #2115

petermattis opened this issue Aug 14, 2015 · 29 comments
Assignees
Labels
A-sql-semantics C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Milestone

Comments

@petermattis
Copy link
Collaborator

petermattis commented Aug 14, 2015

@vivekmenezes
Copy link
Contributor

I've been thinking about how youtube could store all their user comments on cockroachdb. On youtube you can post a comment against a video. You can also reply to an existing comment about a video. So the comment space can be modeled in a table where every row represents a video, with a column containing an array of all the comments related to a video. In addition comments can reference other comments, so the comment type needs to be a tuple like <id, other-comment-id, comment>

The array type should support fast append and update operations, to allow quick commenting and editing of comments.

@kevincox
Copy link

@vivekmenezes Your method would work for small comment sets but you would quickly run into an issue with contention on that array for videos with people commenting quickly and size problems with large amounts of comments. However I think that your suggestion of update operations is a great one. Having array operations built in would avoid having to read the value and then process it locally in a transaction before sending back the new array to the server. Using update commands could create much shorter transactions witch would obviously be good for performance.

@JackKrupansky
Copy link

  1. I'd recommend arrays within a row to be limited to "modest" size (dozens, maybe hundreds of entries at most.) In any case, a row should be a manageable size.
  2. User-defined types are great to define the tuple to be stored in an array. Synchronizing a bunch of parallel arrays (user ID, timestamp, comment text) is a pain. Need to look at how they would be queried, such as all comments by a particular user or all comments within a timestamp range.
  3. Non-modest arrays are best handled as rows in a table, accessed either vis JOIN, nested SELECT, or a separate manual query.
  4. Non-modest arrays frequently need to be queryable, so rows in a table make sense.
  5. The eternal question: Should we just do what Postgres does? (Or at least use it as a starting point for discussion.)
  6. Most comments these days are threaded (hierarchical), which further argues that user comments should be distinct rows.

@vivekmenezes
Copy link
Contributor

One of the reasons for me putting in a real world example, is to allow the
choice of an architecture where each element of the array maps to a
different kv pair on the underlying KV store be part of the consideration
here. It will still mean a lot of contention reading the entire array, but
perhaps we anyway need to support a read with a timestamp a few
milliseconds into the past, or perhaps a read up to a certain array index
to reduce contention.

Alternatively, propose a different schema for the domain. A table with each
individual comment stored in an individual row and the comment id acting as
the primary key, with a column for the video id and secondary index on it.
The comments are read using a secondary index on the video.

A key design consideration is the ability to read all the comments off a
single node (few nodes)

On Tue, Jan 26, 2016 at 11:33 AM Kevin Cox notifications@github.com wrote:

@vivekmenezes https://github.com/vivekmenezes Your method would work
for small comment sets but you would quickly run into an issue with
contention on that array for videos with people commenting quickly and size
problems with large amounts of comments. However I think that your
suggestion of update operations is a great one. Having array operations
built in would avoid having to read the value and then process it locally
in a transaction before sending back the new array to the server. Using
update commands could create much shorter transactions witch would
obviously be good for performance.


Reply to this email directly or view it on GitHub
#2115 (comment)
.

@kevincox
Copy link

WWPGD (What would postgres do) seems like a pretty good starting point for db related questions (it has served sqlite well although very different from cockroach).

I suspect that inline arrays should probably not go over a "modest" size. I'm not saying it should be forbidden but probably discouraged and not optimized for.

At the very least there comes a point where returning the whole array becomes not very useful and that probably means that you need to filter, and once you filter you should probably be using a separate table.

That's my 2 cents at least.

@JackKrupansky
Copy link

We might also consider the Cassandra model of collections - list, set, and map. Sure, sometimes people actually do want mathematical arrays, matrices, and multi-dimensional arrays (ala Postgres), but sometimes they just want simple lists or are trying to simulate a set using an array, or similarly storing keyword/value pairs in two-dimensional arrays or parallel linear arrays.

The map collection type is nice since it lets you store and query arbitrary semi-structured data, typically keyword/value pairs much more naturally.

Cassandra has convenient operations to operate on lists, sets, and maps, such as append and prepend for list, removing list elements, removing all occurrences of a value from a list or set, removing from a map by keyword, etc. Doing all this with raw arrays would be... not so fun.

And I would encourage the same philosophy that Cassandra adopted for lists, sets, and maps: "Collections are meant for storing/denormalizing relatively small amount of data. They work well for things like “the phone numbers of a given user”, “labels applied to an email”, etc. But when items are expected to grow unbounded (“all the messages sent by a given user”, “events registered by a sensor”, ...), then collections are not appropriate anymore and a specific table (with clustering columns) should be used."

It all depends on the use case.

See:
https://cassandra.apache.org/doc/cql3/CQL.html#collections

@kevincox
Copy link

The Cassandra collection model looks really nice. It also appears similar to what I was recommending. I have quoted some of what I deem as the important parts below:

  • "Collections are meant for storing/denormalizing relatively small amount of data."
  • "Collections are always read in their entirety"
  • "Collections cannot have more than 65535 elements."

That limit seems pretty high but I wouldn't be surprised if it is considered bad practice to get anywhere near that. Personally I would be thinking that embedded collections would be suited for "tens of elements" at most.

Again, there are multiple right answers, just sharing what I'm thinking :)

@MehSha
Copy link

MehSha commented Feb 10, 2016

+1 for Cassandra collection model.
arrays are very cool, reduce number of queries and checks to get data, both in one2many and many2many cases.
in many cases we emulate set by two queries (even if we have array).
and maps, eliminate two column tables and unnecessary relations.
all this inches us closed to embedding model (like in Mongodb) and increase both convenience (number of queries) and performance.

please implement

@petermattis petermattis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed SQL labels Feb 13, 2016
@amangoeliitb
Copy link

Any updates on this @petermattis ?

@petermattis
Copy link
Collaborator Author

@amangoeliitb ARRAY is on our roadmap, but not scheduled for the near term. Do you mind explaining what your use case is? Perhaps there is a suitable alternative using the existing SQL functionality.

@knz knz mentioned this issue Aug 4, 2016
3 tasks
@kulshekhar
Copy link

@petermattis One of the best use cases for ARRAY and JSON types would be to enable the implementation of a simple row level ACL system. They could also be a more performant alternative to JOINs in several cases, especially in a distributed setup.

@tlvenn
Copy link
Contributor

tlvenn commented Dec 6, 2016

I have seen quite a few Array related PRs been merged lately and was wondering how far we are from basic support of array type. Thanks !

@jhartman86
Copy link

+1 re: @kulshekhar's comment "could also be more performant alternative to joins in several cases"

@cuongdo
Copy link
Contributor

cuongdo commented Dec 8, 2016

What specific functionality do you need for arrays? Postgres exposes a lot of syntax and functions for arrays, so we've been limiting our work to what's needed to support Hibernate ORM.

@kulshekhar
Copy link

@cuongdo

  • index the array field
  • check if an array contains a particular value

These are the most important, from my perspective. Additional nice to have functionality would be

  • check if an array contains any value from another list
  • check if an array contains all values present in another list
  • append a value
  • merge/concatenate a new array array

@tlvenn
Copy link
Contributor

tlvenn commented Dec 9, 2016

Honestly, it would be already useful to be able to simply store and retrieve an array.

Then comes the list @kulshekhar posted and I agree, being able to index the array and check if an array contains a given value is probably the most desired features afterward.

@spencerkimball
Copy link
Member

@jordanlewis @nvanbenschoten could you guys update this issue?

@petermattis petermattis removed this from the 1.0 milestone Mar 29, 2017
@cuongdo cuongdo removed their assignment Apr 5, 2017
@cuongdo
Copy link
Contributor

cuongdo commented May 2, 2017

Another key functionality @bdarnell brought up is indexing of ARRAY columns.

@justinj
Copy link
Contributor

justinj commented May 30, 2017

Hi @MehSha, @kulshekhar, @tlvenn, @jhartman86, @amangoeliitb, @kajmagnus, @beldpro-ci, @thearchitect, @yinhm, @nickjackson, @arypurnomoz, @mbertschler, @sudhakar, @tlvenn, @bkleef, @domnulnopcea, @iamcarbon, @muesli, @heiskr, @arduanov, @daliwali, @manigandham since you expressed interest in this issue, I was wondering if you had any insight you could share on how you plan to use arrays within Cockroach!

I'm interested in a couple main questions:

  • How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?
  • How large do you expect your arrays to become (either in elements or in data)?
  • A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.
  • What's the appeal of using arrays for you, vs. using a normalized table?

Thanks so much!

@kevincox
Copy link

kevincox commented May 30, 2017

TL;DR I use arrays to contain "tags" which I almost exclusively query to see if some value is in the array of tags.

How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?

Most of my array usage is checking membership. For example I have short arrays of small data (numbers or small strings) and I do a query if x is a member of that array. Ocasionally I will return the whole set of array elements in a query.

My second most common use case is a small set of data such as email addresses which I will almost always return in it's entirety.

I almost never access by index.

How large do you expect your arrays to become (either in elements or in data)?

Very small. Most are [0 2] elements, a couple get to around 20. Again, these are integers or short strings.

A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.

I ocasionally use a GIN index to find rows that contain a certian element in an array.

What's the appeal of using arrays for you, vs. using a normalized table?

I expect that they will be stored closer to the row itself for quick checking as well as the ease of indexing and querying.

Hope that helps explain my use case. Feel free to ask for clarification or anything else.

@kulshekhar
Copy link

How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?

I mainly use these fields with queries. When I have to use the data from an array field, I normally parse it in the application.

How large do you expect your arrays to become (either in elements or in data)?

I have applications with Array field size ranging upto 20000 elements

A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.

GIN indexing does it for me

What's the appeal of using arrays for you, vs. using a normalized table?

The main use case is application level ACL - arrays allow row level permissions to be managed in a flexible and performant manner. It would also work better than joins in a distributed setup, I imagine

@heiskr
Copy link

heiskr commented May 31, 2017

I'll join in, why not? :)

How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?

Querying using ANY.

How large do you expect your arrays to become (either in elements or in data)?

Not unbounded, rarely more than a list of dozen IDs or so.

A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.

I'm not at the scale where that matters... yet ;)

What's the appeal of using arrays for you, vs. using a normalized table?

I'm not a 3rd normal form queen, just as long as I don't have duplicate data. I'm actually transitioning from NoSQL, so even just having a schema is a step in the traditional route. In these tables I'm storing versions of things, and the relationships are part of the versioning, so having to join over the
different versions in a different table... it'd be like (parent_kind, parent_id, parent_version_id, child_kind... etc) Not a pleasant query to do every time I want to read, which is way way more common than write.

I would say though that jsonb support is the bigger blocker at this point.

@kulshekhar
Copy link

I would say though that jsonb support is the bigger blocker at this point

I would agree with this. jsonb could also be used for arrays

@gr0uch
Copy link

gr0uch commented May 31, 2017

How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?

unnest. also array operators array_cat, array_append, array_remove, array_length, etc. I wouldn't be replacing entire arrays.

How large do you expect your arrays to become (either in elements or in data)?

Arbitrarily large, perhaps up to the row size limit.

A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.

I haven't had a use case that would benefit from indexing array columns yet.

What's the appeal of using arrays for you, vs. using a normalized table?

Better read performance and less joins, more flexible data modelling.

@justinj
Copy link
Contributor

justinj commented Jun 5, 2017

Thanks so much all of you for your input! It's much appreciated.

@alex-sherwin
Copy link

alex-sherwin commented Aug 18, 2017

I haven't had a use case that would benefit from indexing array columns yet.

A simple, and probably common scenario, would be a standard MIME email that has been broken out into parts so it's indexed/searchable.

Such a domain model would (usually) contain a small number of TO and CC headers representing a (typically) small number of recipients.

In this case, it's quite convenient to be able to have a native array type in a column that is indexed so you can search for a specific email address. In postgres, this is easily accomplished with a text[] column, a GIN index and array operators while querying (which GIN indexes support, see https://www.postgresql.org/docs/9.6/static/indexes-types.html)

@justinj
Copy link
Contributor

justinj commented Sep 19, 2017

Going to close this since the only outstanding issue is not blocking array usage and has its own issue.

@justinj justinj closed this as completed Sep 19, 2017
@alex-sherwin
Copy link

Going to close this since the only outstanding issue is not blocking array usage and has its own issue.

@justinj

It wasn't immediately obvious to me when checking on this issues updates to see that the feature was actually implemented and is part of 1.1 (currently beta).. just for anyone else looking here as well

knz added a commit to knz/cockroach that referenced this issue Nov 22, 2018
knz added a commit to knz/cockroach that referenced this issue Nov 22, 2018
knz added a commit to knz/cockroach that referenced this issue Nov 27, 2018
knz added a commit to knz/cockroach that referenced this issue Nov 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-semantics C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests