Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add server side parameters to session connection method #823

Merged
merged 10 commits into from
Jul 24, 2023

Conversation

JCZuurmond
Copy link
Collaborator

@JCZuurmond JCZuurmond commented Jul 7, 2023

resolves #690

Description

Pass existing server_side_parameters to session connection wrapper and use to configure SparkSession.

Checklist

@JCZuurmond JCZuurmond requested a review from a team as a code owner July 7, 2023 08:44
@JCZuurmond JCZuurmond requested a review from mikealfare July 7, 2023 08:44
@cla-bot cla-bot bot added the cla:yes label Jul 7, 2023
@JCZuurmond
Copy link
Collaborator Author

Prefer this PR over #691. @alarocca-apixio : We took your commits and touched up the code so that it can be merged.

@Fokko
Copy link
Contributor

Fokko commented Jul 7, 2023

Works on my end:

cat ~/.dbt/profiles.yml 
dbt_tabular:
  outputs:
    dev:
      method: session
      schema: dbt_tabular
      type: spark
      host: NA
      server_side_parameters:
        "spark.driver.memory": "2g"
  target: dev

I can see that this is being picked up by the process:

ps aux | grep -i spark                          
fokkodriesprong  11191 150.0  0.4 413414240 269024 s008  S+   11:14AM   0:01.99 /opt/homebrew/Cellar/openjdk@11/11.0.19/libexec/openjdk.jdk/Contents/Home/bin/java -cp /opt/homebrew/lib/python3.9/site-packages/pyspark/conf:/opt/homebrew/lib/python3.9/site-packages/pyspark/jars/* -Xmx2g -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=2g --conf spark.sql.catalogImplementation=hive pyspark-shell
fokkodriesprong  11203   0.0  0.0 408626896   1312 s010  S+   11:14AM   0:00.00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox -i spark

We can see that --conf spark.driver.memory=2g is set.

@JCZuurmond JCZuurmond changed the title Pass existing server_side_parameters to session connection wrapper and use to configure SparkSession. #691 Add server side parameters to session connection method Jul 7, 2023
@JCZuurmond
Copy link
Collaborator Author

@Fleid: This PR is ready to be merged

@JCZuurmond
Copy link
Collaborator Author

@colin-rogers-dbt: Could you fix the CI?

@colin-rogers-dbt
Copy link
Contributor

This looks ready to merge but I think we should add a functional test case (will need to think where/how we can) and update our unit tests like #577

@colin-rogers-dbt
Copy link
Contributor

also is this dependent on #577 ?

dbt/adapters/spark/session.py Outdated Show resolved Hide resolved
dbt/adapters/spark/connections.py Show resolved Hide resolved
@JCZuurmond JCZuurmond force-pushed the ADAP-405 branch 2 times, most recently from 9e6834e to ac402e7 Compare July 19, 2023 08:19
@@ -24,9 +24,10 @@ class Cursor:
https://github.com/mkleehammer/pyodbc/wiki/Cursor
"""

def __init__(self) -> None:
def __init__(self, *, server_side_parameters: Optional[Dict[str, Any]] = None) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unfamiliar with this mechanic. What does the *, do in the signature? Is it similar to unpacking with my_arg, *_? Do we need it?

Copy link
Collaborator Author

@JCZuurmond JCZuurmond Jul 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is similar to the unpacking you mentioned. I guess that in a signature this mechanic is called "packing" as it is the opposite of unpacking. (I could not quickly find a PEP about this.)

What it does: it packs all positional argument. In this case, all positional arguments are packed into nothing. If you add a parameter after the * you pack all positional arguments into that parameter (as a tuple):

def sum(*numbers):
    total = 0
    for number in numbers:
        total += number
    return total
    
sum(1, 2, 3)
sum(5, 7, 9, 11)

The trick here is that the * without an argument forces all arguments after it to become key-word arguments. I like to use that to improve readability:

Connection(foo)

vs

Connection(server_side_parameters=foo)

The later is more readable.

foo is of course a badly chosen variable name. But, I expect the first positional argument of a Connection to be a connection_string, like conn = pyodbc.connect(connection_str, autocommit=True), or something other than server_side_paramters which is an (optional) additional parameter.

The * is not required, I added it to improve code readability.

@@ -159,6 +165,9 @@ class Connection:
https://github.com/mkleehammer/pyodbc/wiki/Connection
"""

def __init__(self, *, server_side_parameters: Optional[Dict[Any, str]] = None) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ADAP-405] [Feature] Use server_side_parameters as SparkSession config in Spark "session" mode
5 participants