Skip to content

Commit

Permalink
Update with changes done in sqllineage v1.4.1 (#8)
Browse files Browse the repository at this point in the history
* Update with changes done in sqllineage v1.4.1

* Remove walrus operator
  • Loading branch information
nahuelverdugo authored Apr 3, 2023
1 parent b1bcad8 commit 87fc058
Show file tree
Hide file tree
Showing 42 changed files with 973 additions and 350 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,15 @@ jobs:
python-version: ['3.7', '3.8', '3.9', '3.10']

steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Set up NodeJS
uses: actions/setup-node@v2
uses: actions/setup-node@v3
with:
node-version: '14'
node-version: '16'
- name: Install
run: pip install tox codecov
- name: Script
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/python-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@ jobs:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v2
uses: actions/setup-python@v4
with:
python-version: '3.x'
- name: Install dependencies
Expand Down
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
repos:
- repo: https://github.com/psf/black
rev: 22.8.0
rev: 23.1.0
hooks:
- id: black
language_version: python3.7
Expand All @@ -9,6 +9,6 @@ repos:
hooks:
- id: flake8
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v0.971
rev: v1.0.1
hooks:
- id: mypy
35 changes: 33 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,10 @@ This is a fork authored by the OpenMetadata community, where we are adding `sqlf
Never get the hang of a SQL parser? SQLLineage comes to the rescue. Given a SQL command, SQLLineage will tell you its
source and target tables, without worrying about Tokens, Keyword, Identifier and all the jagons used by SQL parsers.

Behind the scene, SQLLineage uses the fantastic [`sqlparse`](https://github.com/andialbrecht/sqlparse) library to parse
the SQL command, and bring you all the human-readable result with ease.
Behind the scene, SQLLineage pluggable leverages parser library ([`sqlfluff`](https://github.com/sqlfluff/sqlfluff)
and [`sqlparse`](https://github.com/andialbrecht/sqlparse)) to parse the SQL command, analyze the AST, stores the lineage
information in a graph (using graph library [`networkx`](https://github.com/networkx/networkx)), and brings you all the
human-readable result with ease.

## Demo & Documentation
Talk is cheap, show me a [demo](https://reata.github.io/sqllineage/).
Expand Down Expand Up @@ -95,6 +97,35 @@ Intermediate Tables:
db1.table1
```

### Dialect-Awareness Lineage
By default, sqllineage doesn't validate your SQL and could give confusing result in case of invalid SQL syntax.
In addition, different SQL dialect has different set of keywords, further weakening sqllineage's capabilities when
keyword used as table name or column name. To reduce the impact, user are strongly encouraged to pass the dialect to
assist the lineage analyzing.

Take below example, `analyze` is a reserved keyword in PostgreSQL. Default non-validating dialect gives incomplete result,
while ansi dialect gives the correct one and postgres dialect tells you this causes syntax error:
```
$ sqllineage -e "insert into analyze select * from foo;"
Statements(#): 1
Source Tables:
<default>.foo
Target Tables:
$ sqllineage -e "insert into analyze select * from foo;" --dialect=ansi
Statements(#): 1
Source Tables:
<default>.foo
Target Tables:
<default>.analyze
$ sqllineage -e "insert into analyze select * from foo;" --dialect=postgres
...
sqllineage.exceptions.InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL
```

Use `sqllineage --dialects` to see all available dialects.

### Column-Level Lineage
We also support column level lineage in command line interface, set level option to column, all column lineage path will
be printed.
Expand Down
84 changes: 84 additions & 0 deletions docs/behind_the_scene/dialect-awareness_lineage_design.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
********************************
Dialect-Awareness Lineage Design
********************************

Problem Statement
=================
As of v1.3.x release, table level lineage is perfectly production-ready. Column level lineage, under the no-metadata
background, is also as good as it can be. And yet we still have a lot of corner cases that are not yet supported.
This is really due to the long-tail of SQL language features and fragmentation of various SQL dialect.

Some typical issues:

* How to check whether syntax is valid or not?

* dialect specific syntax:

* MSSQL assignment operator
* Snowflake MERGE statement
* CURRENT_TIMESTAMP: keyword or function?
* identifier quote character: double quote or backtick?

* dialect specific keywords:

* reversed keyword vs non-reversed keyword list
* non-reserved keyword as table name
* non-reserved keyword as column name

* dialect specific function:

* Presto UNNEST
* Snowflake GENERATOR

Over the years, we already have several monkey patches and utils on sqlparse, to tweak the AST generated, either because
of incorrect parsing result (e.g. parenthesized query followed by INSERT INTO table parsed as function) or not yet
supported token grouping (e.g. window function for example). Due to the non-validating nature of sqlparse, that's the
bitter pill to swallow when we enjoyed tons of convenience.

Wishful Thinking
================
To move forward, we'd want more from the parser so that:

1. We know better what syntax, or dialect specific feature we support.
2. We can easily revise parsing rules to generate the AST we want when we decide to support some new features.
3. User can specify the dialect when they use sqllineage, so they know what to expect. And we explicitly let them know
when we don't know how to parse the SQL (InvalidSyntaxException) or how to analyze the lineage (UnsupportedStatementException).

Sample call from command line:

.. code-block:: bash
sqllineage -f test.sql --dialect=ansi
Sample call from Python API:

.. code-block:: python
from sqllineage.runner import LineageRunner
sql = "select * from dual"
result = LineageRunner(sql, dialect="ansi")
Likewise in frontend UI, user have a dropdown select to choose the dialect they want.

Implementation Plan
===================
`OpenMetadata`_ community contributed an implementation using the parser underneath sqlfluff. With `#326`_ merged into
master, we have a new `dialect` option. When passed with real dialect, like mysql, oracle, hive, sparksql, bigquery,
snowflake, etc, we'll leverage sqlfluff to analyze the query. A pseudo dialect `non-validating` is introduced to remain
backward compatibility, falling back to use sqlparse as parser.

We're running dual test using both parser and make sure the lineage result is exactly the same for every test case
(except for a few edge cases).

From code structure perspective, we refactored the whole code base to introduce a parser interface:

* LineageAnalyzer now accepts single statement SQL string, split by LineageRunner, and returns StatementLineageHolder
as before
* Each parser implementations sit in folder **sqllineage.core.parser**. They're extending the LineageAnalyzer, common
Models, and leverage Holders at different layer.

.. note::
Dialect-awareness lineage is now released with v1.4.0

.. _OpenMetadata: https://open-metadata.org/
.. _#326: https://github.com/reata/sqllineage/pull/326
33 changes: 33 additions & 0 deletions docs/first_steps/advanced_usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,39 @@ And if you want to see lineage result for every SQL statement, just toggle verbo
db1.table1
Dialect-Awareness Lineage
=========================
By default, sqllineage doesn't validate your SQL and could give confusing result in case of invalid SQL syntax.
In addition, different SQL dialect has different set of keywords, further weakening sqllineage's capabilities when
keyword used as table name or column name. To reduce the impact, user are strongly encouraged to pass the dialect to
assist the lineage analyzing.
Take below example, `analyze` is a reserved keyword in PostgreSQL. Default non-validating dialect gives incomplete result,
while ansi dialect gives the correct one and postgres dialect tells you this causes syntax error:
.. code-block:: bash
$ sqllineage -e "insert into analyze select * from foo;"
Statements(#): 1
Source Tables:
<default>.foo
Target Tables:
$ sqllineage -e "insert into analyze select * from foo;" --dialect=ansi
Statements(#): 1
Source Tables:
<default>.foo
Target Tables:
<default>.analyze
$ sqllineage -e "insert into analyze select * from foo;" --dialect=postgres
...
sqllineage.exceptions.InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL
Use `sqllineage \-\-dialects` to see all available dialects.
Column-Level Lineage
====================
Expand Down
6 changes: 6 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ Behind the scene
behind_the_scene/how_sqllineage_work
behind_the_scene/dos_and_donts
behind_the_scene/column-level_lineage_design
behind_the_scene/dialect-awareness_lineage_design

:doc:`behind_the_scene/why_sqllineage`
The motivation of writing SQLLineage
Expand All @@ -53,6 +54,11 @@ Behind the scene
:doc:`behind_the_scene/dos_and_donts`
Design principles for SQLLineage

:doc:`behind_the_scene/column-level_lineage_design`
Design docs for column lineage

:doc:`behind_the_scene/dialect-awareness_lineage_design`
Design docs for dialect-awareness lineage

Basic concepts
==============
Expand Down
34 changes: 34 additions & 0 deletions docs/release_note/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,40 @@
Changelog
*********

v1.4.1
======
:Date: April 2, 2023

Bugfix
-------------
* frontend app unable to load dialect when launched for the first time

v1.4.0
======
:Date: March 31, 2023

Great thanks to Nahuel, Mayur and Pere from OpenMetadata community for contributing on feature Dialect-awareness lineage.
Leveraging sqlfluff underneath, we're now able to give more correct lineage result with user input on SQL dialect.

Feature
-------------
* Dialect-awareness lineage (`#302 <https://github.com/reata/sqllineage/issues/302>`_)
* support MERGE statement (`#166 <https://github.com/reata/sqllineage/issues/166>`_)

Enhancement
-------------
* Use curved lines in lineage graph visualization (`#320 <https://github.com/reata/sqllineage/issues/320>`_)
* Click to lock highlighted nodes in visualization (`#318 <https://github.com/reata/sqllineage/issues/318>`_)
* Deprecate support for Python 3.6 and Python 3.7, add support for Python 3.11 (`#319 <https://github.com/reata/sqllineage/issues/319>`_)
* support t-sql assignment operator (`#205 <https://github.com/reata/sqllineage/issues/205>`_)

Bugfix
-------------
* exception when insert into qualified table followed by parenthesized query (`#249 <https://github.com/reata/sqllineage/issues/249>`_)
* missing columns when current_timestamp as reserved keyword used in select clause (`#248 <https://github.com/reata/sqllineage/issues/248>`_)
* exception when non-reserved keywords used as column name (`#183 <https://github.com/reata/sqllineage/issues/183>`_)
* exception when non-reserved keywords used as table name (`#93 <https://github.com/reata/sqllineage/issues/93>`_)

v1.3.7
======
:Date: Oct 22, 2022
Expand Down
4 changes: 2 additions & 2 deletions mypy.ini
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,6 @@ warn_no_return=True
warn_redundant_casts=True
warn_unused_ignores=True
disallow_any_generics=True
[mypy-sqllineage.core.parser.sqlfluff.utils.sqlfluff]
[mypy-sqllineage.core.parser.sqlfluff.utils]
disallow_untyped_calls=False
warn_return_any = False
warn_return_any = False
5 changes: 3 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ def run(self) -> None:
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: Implementation :: CPython",
],
python_requires=">=3.7",
Expand All @@ -71,14 +72,14 @@ def run(self) -> None:
entry_points={"console_scripts": ["sqllineage = sqllineage.cli:main"]},
extras_require={
"ci": [
"bandit==1.7.1",
"bandit",
"black",
"flake8",
"flake8-blind-except",
"flake8-builtins",
"flake8-import-order",
"flake8-logging-format",
"mypy==0.971",
"mypy",
"pytest",
"pytest-cov",
"tox",
Expand Down
35 changes: 32 additions & 3 deletions sqllineage/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,10 +65,16 @@ def main(args=None) -> None:
parser.add_argument(
"-d",
"--dialect",
help="the dialect used to compute the lineage",
help="the dialect used to analyze the lineage, use --dialects to show all available dialects",
type=str,
default=DEFAULT_DIALECT,
metavar="ansi, mysql, snowflake, redshift, hive, etc. Check supported dialects by sqlfluff.",
metavar="<dialect>",
)
parser.add_argument(
"-ds",
"--dialects",
help="list all the available dialects",
action="store_true",
)
args = parser.parse_args(args)
if args.e and args.f:
Expand All @@ -79,13 +85,13 @@ def main(args=None) -> None:
sql = extract_sql_from_args(args)
runner = LineageRunner(
sql,
dialect=args.dialect,
verbose=args.verbose,
draw_options={
"host": args.host,
"port": args.port,
"f": args.f if args.f else None,
},
dialect=args.dialect,
)
if args.graph_visualization:
runner.draw(args.dialect)
Expand All @@ -95,6 +101,29 @@ def main(args=None) -> None:
runner.print_table_lineage()
elif args.graph_visualization:
return draw_lineage_graph(**{"host": args.host, "port": args.port})
elif args.dialects:
print(
"""non-validating
ansi
athena
bigquery
clickhouse
databricks
db2
exasol
hive
materialize
mysql
oracle
postgres
redshift
snowflake
soql
sparksql
sqlite
teradata
tsql"""
)
else:
parser.print_help()

Expand Down
Loading

0 comments on commit 87fc058

Please sign in to comment.