Henry Luan
The core helper function get_summary_codes in the sql.mechanic
package takes in a string that specifies a table’s database name, schema
name, and table name. It outputs an SQL query that summarizes the
table’s column statistics.
If you don’t have much time to read or just want a quick solution, jump to Example 2.
-
If you work with tables inside databases hosted on powerful servers (usually on-premise) but have limited compute resources (e.g. RAM, GPU, CPU) for advanced BI (e.g. Python, R, SAS, etc.)
-
If it’s much cheaper for you to use database servers than analytics servers (e.g. those for Python, R, etc.) either on-premise or on the cloud.
-
You probably want to understand how Example 2affect the CPU and disk usage of your database server, to avoid bad surprises on your server’s resource usage.
-
If you use the setup of Example 2on a cloud database, you MUST do some testing, to understand how the example affects the CPU and disk usage of your cloud resource. Please avoid potentially expensive mistakes.
-
Currently, the function only works with Microsoft SQL Server and Netezza databases. Feel free to contribute to the codes, if interested.
Specail credits: the R codes in this package are built on top of Gordon S. Linoff’s book Data Analysis Using SQL and Excel, 2nd Edition. His work has been a tremendous inspiration for the creation of this package.
You can install the library from my GitHub. If you have concerns regarding the package’s security, you can download, check, and use the “get_summary_codes.R” file directly.
# Install package
library(devtools)
install_github("casualcomputer/sql.mechanic",quiet=TRUE)
# Alternative: use the 'get_summary_codes.R' file only
# source("get_summary_codes.R")
The following codes 1) generate the SQL queries you need to summarize a table, and 2) copy (Ctrl+C) the codes to your clipboard. All you have to do is paste it to your SQL editor and execute the queries.
library(sql.mechanic)
#SQL codes for basic summary, Netezza database
sql_basic_netezza = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="Netezza")
#SQL codes for advanced summary, Netezza database
sql_advanced_netezza = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="advanced", dbtype="Netezza")
#SQL codes for basic summary, Microsoft SQL Server
sql_basic_mssql = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="MSSQL")
#SQL codes for advanced summary, Microsoft SQL Server
sql_advanced_mssql = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="advanced", dbtype="MSSQL")
#copy some of the sql queries to the clipboard
writeClipboard(sql_basic_netezza)
In case you are curious, the SQL copied to your clipboard looks like this.
SELECT REPLACE(REPLACE(REPLACE('<start> SELECT ''<col>'' as colname,
COUNT(*) as numvalues,
MAX(freqnull) as freqnull,
CAST(MIN(minval) as CHAR(100)) as minval,
SUM(CASE WHEN <col> = minval THEN freq ELSE 0 END) as numminvals,
CAST(MAX(maxval) as CHAR(100)) as maxval,
SUM(CASE WHEN <col> = maxval THEN freq ELSE 0 END) as nummaxvals,
SUM(CASE WHEN freq =1 THEN 1 ELSE 0 END) as numuniques
FROM (SELECT <col>, COUNT(*) as freq
FROM SCHEMA_NAME.<tab> GROUP BY <col>) osum
CROSS JOIN (SELECT MIN(<col>) as minval, MAX(<col>) as maxval, SUM(CASE WHEN <col> IS NULL THEN 1 ELSE 0 END) as freqnull
FROM (SELECT <col> FROM SCHEMA_NAME.<tab>) osum
) summary',
'<col>', column_name),
'<tab>', 'TABLE_NAME'),
'<start>',
(CASE WHEN ordinal_position = 1 THEN ''
ELSE 'UNION ALL' END)) as codes_data_summary
FROM (SELECT table_name, case when regexp_like(column_name,'[a-z.]|GROUP') then '"'||column_name||'"'
else column_name end as column_name , ordinal_position
FROM information_schema.columns
WHERE table_name ='TABLE_NAME') a;
This example shows you how you can summarize tables as you did in Example 1, with only a few lines of R codes.
# Install package
library(devtools)
install_github("casualcomputer/sql.mechanic",quiet=TRUE)
# Alternative: use the 'get_summary_codes.R' file only
# source("get_summary_codes.R")
# Load packages
library(sql.mechanic)
library(odbc)
library(DBI)
# Connect to database(s)
## Method 1: prompting user (you need to make some changes here)
con <- dbConnect(odbc(),
Driver = "SQL Server",
Server = "mysqlhost",
Database = "mydbname",
UID = "myuser",
PWD = "Database password",
Port = 1433, encoding = 'windows-1252') #'windows-1252' allows French to display properly
## Alternative Method: Using a DSN
#con <- dbConnect(odbc::odbc(), "DNS_NAMES", encoding = 'windows-1252')
sql_query = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="Netezza")
res = dbSendQuery(con, sql_query) # part of Step 3 in "Example 1"
sql_query_mod = dbFetch(res) # part of Step 3 in "Example 1"
res = dbSendQuery(con, sql_query_mod) # part of Step 4 in "Example 1"
output_table = dbFetch(res) # part of Step 4 in "Example 1"
print(output) #desired summary table
dbDisconnect(con) #close database connection