Update with changes done in sqllineage v1.4.1 (#8)

* Update with changes done in sqllineage v1.4.1 * Remove walrus operator
open-metadata · Apr 3, 2023 · 87fc058 · 87fc058
1 parent b1bcad8
commit 87fc058
Show file tree

Hide file tree

Showing 42 changed files with 973 additions and 350 deletions.
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -18,15 +18,15 @@ jobs:
         python-version: ['3.7', '3.8', '3.9', '3.10']
 
     steps:
-    - uses: actions/checkout@v2
+    - uses: actions/checkout@v3
     - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python@v2
+      uses: actions/setup-python@v4
       with:
         python-version: ${{ matrix.python-version }}
     - name: Set up NodeJS
-      uses: actions/setup-node@v2
+      uses: actions/setup-node@v3
       with:
-        node-version: '14'
+        node-version: '16'
     - name: Install
       run: pip install tox codecov
     - name: Script

diff --git a/.github/workflows/python-publish.yml b/.github/workflows/python-publish.yml
@@ -13,9 +13,9 @@ jobs:
     runs-on: ubuntu-latest
 
     steps:
-    - uses: actions/checkout@v2
+    - uses: actions/checkout@v3
     - name: Set up Python
-      uses: actions/setup-python@v2
+      uses: actions/setup-python@v4
       with:
         python-version: '3.x'
     - name: Install dependencies

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -1,6 +1,6 @@
 repos:
   - repo: https://github.com/psf/black
-    rev: 22.8.0
+    rev: 23.1.0
     hooks:
       - id: black
         language_version: python3.7
@@ -9,6 +9,6 @@ repos:
     hooks:
       - id: flake8
   - repo: https://github.com/pre-commit/mirrors-mypy
-    rev: v0.971
+    rev: v1.0.1
     hooks:
     - id: mypy
diff --git a/README.md b/README.md
@@ -16,8 +16,10 @@ This is a fork authored by the OpenMetadata community, where we are adding `sqlf
 Never get the hang of a SQL parser? SQLLineage comes to the rescue. Given a SQL command, SQLLineage will tell you its
 source and target tables, without worrying about Tokens, Keyword, Identifier and all the jagons used by SQL parsers.
 
-Behind the scene, SQLLineage uses the fantastic [`sqlparse`](https://github.com/andialbrecht/sqlparse) library to parse 
-the SQL command, and bring you all the human-readable result with ease.
+Behind the scene, SQLLineage pluggable leverages parser library ([`sqlfluff`](https://github.com/sqlfluff/sqlfluff) 
+and [`sqlparse`](https://github.com/andialbrecht/sqlparse)) to parse the SQL command, analyze the AST, stores the lineage
+information in a graph (using graph library [`networkx`](https://github.com/networkx/networkx)), and brings you all the 
+human-readable result with ease.
 
 ## Demo & Documentation
 Talk is cheap, show me a [demo](https://reata.github.io/sqllineage/).
@@ -95,6 +97,35 @@ Intermediate Tables:
     db1.table1
 ```
 
+### Dialect-Awareness Lineage
+By default, sqllineage doesn't validate your SQL and could give confusing result in case of invalid SQL syntax.
+In addition, different SQL dialect has different set of keywords, further weakening sqllineage's capabilities when 
+keyword used as table name or column name. To reduce the impact, user are strongly encouraged to pass the dialect to 
+assist the lineage analyzing. 
+
+Take below example, `analyze` is a reserved keyword in PostgreSQL. Default non-validating dialect gives incomplete result,
+while ansi dialect gives the correct one and postgres dialect tells you this causes syntax error:
+```
+$ sqllineage -e "insert into analyze select * from foo;"
+Statements(#): 1
+Source Tables:
+    <default>.foo
+Target Tables:
+    
+$ sqllineage -e "insert into analyze select * from foo;" --dialect=ansi
+Statements(#): 1
+Source Tables:
+    <default>.foo
+Target Tables:
+    <default>.analyze
+
+$ sqllineage -e "insert into analyze select * from foo;" --dialect=postgres
+...
+sqllineage.exceptions.InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL
+```
+
+Use `sqllineage --dialects` to see all available dialects.
+
 ### Column-Level Lineage
 We also support column level lineage in command line interface, set level option to column, all column lineage path will 
 be printed.

diff --git a/docs/behind_the_scene/dialect-awareness_lineage_design.rst b/docs/behind_the_scene/dialect-awareness_lineage_design.rst
@@ -0,0 +1,84 @@
+********************************
+Dialect-Awareness Lineage Design
+********************************
+
+Problem Statement
+=================
+As of v1.3.x release, table level lineage is perfectly production-ready. Column level lineage, under the no-metadata
+background, is also as good as it can be. And yet we still have a lot of corner cases that are not yet supported.
+This is really due to the long-tail of SQL language features and fragmentation of various SQL dialect.
+
+Some typical issues:
+
+* How to check whether syntax is valid or not?
+
+* dialect specific syntax:
+
+  * MSSQL assignment operator
+  * Snowflake MERGE statement
+  * CURRENT_TIMESTAMP: keyword or function?
+  * identifier quote character: double quote or backtick?
+
+* dialect specific keywords:
+
+  * reversed keyword vs non-reversed keyword list
+  * non-reserved keyword as table name
+  * non-reserved keyword as column name
+
+* dialect specific function:
+
+  * Presto UNNEST
+  * Snowflake GENERATOR
+
+Over the years, we already have several monkey patches and utils on sqlparse, to tweak the AST generated, either because
+of incorrect parsing result (e.g. parenthesized query followed by INSERT INTO table parsed as function) or not yet
+supported token grouping (e.g. window function for example). Due to the non-validating nature of sqlparse, that's the
+bitter pill to swallow when we enjoyed tons of convenience.
+
+Wishful Thinking
+================
+To move forward, we'd want more from the parser so that:
+
+1. We know better what syntax, or dialect specific feature we support.
+2. We can easily revise parsing rules to generate the AST we want when we decide to support some new features.
+3. User can specify the dialect when they use sqllineage, so they know what to expect. And we explicitly let them know
+   when we don't know how to parse the SQL (InvalidSyntaxException) or how to analyze the lineage (UnsupportedStatementException).
+
+Sample call from command line:
+
+.. code-block:: bash
+
+    sqllineage -f test.sql --dialect=ansi
+
+Sample call from Python API:
+
+.. code-block:: python
+
+    from sqllineage.runner import LineageRunner
+    sql = "select * from dual"
+    result = LineageRunner(sql, dialect="ansi")
+
+Likewise in frontend UI, user have a dropdown select to choose the dialect they want.
+
+Implementation Plan
+===================
+`OpenMetadata`_ community contributed an implementation using the parser underneath sqlfluff. With `#326`_ merged into
+master, we have a new `dialect` option. When passed with real dialect, like mysql, oracle, hive, sparksql, bigquery,
+snowflake, etc, we'll leverage sqlfluff to analyze the query. A pseudo dialect `non-validating` is introduced to remain
+backward compatibility, falling back to use sqlparse as parser.
+
+We're running dual test using both parser and make sure the lineage result is exactly the same for every test case
+(except for a few edge cases).
+
+From code structure perspective, we refactored the whole code base to introduce a parser interface:
+
+* LineageAnalyzer now accepts single statement SQL string, split by LineageRunner, and returns StatementLineageHolder
+  as before
+* Each parser implementations sit in folder **sqllineage.core.parser**. They're extending the LineageAnalyzer, common
+  Models, and leverage Holders at different layer.
+
+.. note::
+    Dialect-awareness lineage is now released with v1.4.0
+
+.. _OpenMetadata: https://open-metadata.org/
+.. _#326: https://github.com/reata/sqllineage/pull/326
diff --git a/docs/first_steps/advanced_usage.rst b/docs/first_steps/advanced_usage.rst
@@ -50,6 +50,39 @@ And if you want to see lineage result for every SQL statement, just toggle verbo
         db1.table1
 
 
+Dialect-Awareness Lineage
+=========================
+By default, sqllineage doesn't validate your SQL and could give confusing result in case of invalid SQL syntax.
+In addition, different SQL dialect has different set of keywords, further weakening sqllineage's capabilities when
+keyword used as table name or column name. To reduce the impact, user are strongly encouraged to pass the dialect to
+assist the lineage analyzing.
+
+Take below example, `analyze` is a reserved keyword in PostgreSQL. Default non-validating dialect gives incomplete result,
+while ansi dialect gives the correct one and postgres dialect tells you this causes syntax error:
+
+.. code-block:: bash
+
+    $ sqllineage -e "insert into analyze select * from foo;"
+    Statements(#): 1
+    Source Tables:
+        <default>.foo
+    Target Tables:
+
+    $ sqllineage -e "insert into analyze select * from foo;" --dialect=ansi
+    Statements(#): 1
+    Source Tables:
+        <default>.foo
+    Target Tables:
+        <default>.analyze
+
+    $ sqllineage -e "insert into analyze select * from foo;" --dialect=postgres
+    ...
+    sqllineage.exceptions.InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL
+
+
+Use `sqllineage \-\-dialects` to see all available dialects.
+
+
 Column-Level Lineage
 ====================
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -43,6 +43,7 @@ Behind the scene
    behind_the_scene/how_sqllineage_work
    behind_the_scene/dos_and_donts
    behind_the_scene/column-level_lineage_design
+   behind_the_scene/dialect-awareness_lineage_design
 
 :doc:`behind_the_scene/why_sqllineage`
     The motivation of writing SQLLineage
@@ -53,6 +54,11 @@ Behind the scene
 :doc:`behind_the_scene/dos_and_donts`
     Design principles for SQLLineage
 
+:doc:`behind_the_scene/column-level_lineage_design`
+    Design docs for column lineage
+
+:doc:`behind_the_scene/dialect-awareness_lineage_design`
+    Design docs for dialect-awareness lineage
 
 Basic concepts
 ==============

diff --git a/docs/release_note/changelog.rst b/docs/release_note/changelog.rst
@@ -2,6 +2,40 @@
 Changelog
 *********
 
+v1.4.1
+======
+:Date: April 2, 2023
+
+Bugfix
+-------------
+* frontend app unable to load dialect when launched for the first time
+
+v1.4.0
+======
+:Date: March 31, 2023
+
+Great thanks to Nahuel, Mayur and Pere from OpenMetadata community for contributing on feature Dialect-awareness lineage.
+Leveraging sqlfluff underneath, we're now able to give more correct lineage result with user input on SQL dialect.
+
+Feature
+-------------
+* Dialect-awareness lineage (`#302 <https://github.com/reata/sqllineage/issues/302>`_)
+* support MERGE statement (`#166 <https://github.com/reata/sqllineage/issues/166>`_)
+
+Enhancement
+-------------
+* Use curved lines in lineage graph visualization (`#320 <https://github.com/reata/sqllineage/issues/320>`_)
+* Click to lock highlighted nodes in visualization (`#318 <https://github.com/reata/sqllineage/issues/318>`_)
+* Deprecate support for Python 3.6 and Python 3.7, add support for Python 3.11 (`#319 <https://github.com/reata/sqllineage/issues/319>`_)
+* support t-sql assignment operator (`#205 <https://github.com/reata/sqllineage/issues/205>`_)
+
+Bugfix
+-------------
+* exception when insert into qualified table followed by parenthesized query (`#249 <https://github.com/reata/sqllineage/issues/249>`_)
+* missing columns when current_timestamp as reserved keyword used in select clause (`#248 <https://github.com/reata/sqllineage/issues/248>`_)
+* exception when non-reserved keywords used as column name (`#183 <https://github.com/reata/sqllineage/issues/183>`_)
+* exception when non-reserved keywords used as table name (`#93 <https://github.com/reata/sqllineage/issues/93>`_)
+
 v1.3.7
 ======
 :Date: Oct 22, 2022

diff --git a/mypy.ini b/mypy.ini
@@ -8,6 +8,6 @@ warn_no_return=True
 warn_redundant_casts=True
 warn_unused_ignores=True
 disallow_any_generics=True
-[mypy-sqllineage.core.parser.sqlfluff.utils.sqlfluff]
+[mypy-sqllineage.core.parser.sqlfluff.utils]
 disallow_untyped_calls=False
-warn_return_any = False
+warn_return_any = False
diff --git a/setup.py b/setup.py
@@ -60,6 +60,7 @@ def run(self) -> None:
         "Programming Language :: Python :: 3.8",
         "Programming Language :: Python :: 3.9",
         "Programming Language :: Python :: 3.10",
+        "Programming Language :: Python :: 3.11",
         "Programming Language :: Python :: Implementation :: CPython",
     ],
     python_requires=">=3.7",
@@ -71,14 +72,14 @@ def run(self) -> None:
     entry_points={"console_scripts": ["sqllineage = sqllineage.cli:main"]},
     extras_require={
         "ci": [
-            "bandit==1.7.1",
+            "bandit",
             "black",
             "flake8",
             "flake8-blind-except",
             "flake8-builtins",
             "flake8-import-order",
             "flake8-logging-format",
-            "mypy==0.971",
+            "mypy",
             "pytest",
             "pytest-cov",
             "tox",

diff --git a/sqllineage/cli.py b/sqllineage/cli.py
@@ -65,10 +65,16 @@ def main(args=None) -> None:
     parser.add_argument(
         "-d",
         "--dialect",
-        help="the dialect used to compute the lineage",
+        help="the dialect used to analyze the lineage, use --dialects to show all available dialects",
         type=str,
         default=DEFAULT_DIALECT,
-        metavar="ansi, mysql, snowflake, redshift, hive, etc. Check supported dialects by sqlfluff.",
+        metavar="<dialect>",
+    )
+    parser.add_argument(
+        "-ds",
+        "--dialects",
+        help="list all the available dialects",
+        action="store_true",
     )
     args = parser.parse_args(args)
     if args.e and args.f:
@@ -79,13 +85,13 @@ def main(args=None) -> None:
         sql = extract_sql_from_args(args)
         runner = LineageRunner(
             sql,
+            dialect=args.dialect,
             verbose=args.verbose,
             draw_options={
                 "host": args.host,
                 "port": args.port,
                 "f": args.f if args.f else None,
             },
-            dialect=args.dialect,
         )
         if args.graph_visualization:
             runner.draw(args.dialect)
@@ -95,6 +101,29 @@ def main(args=None) -> None:
             runner.print_table_lineage()
     elif args.graph_visualization:
         return draw_lineage_graph(**{"host": args.host, "port": args.port})
+    elif args.dialects:
+        print(
+            """non-validating
+ansi
+athena
+bigquery
+clickhouse
+databricks
+db2
+exasol
+hive
+materialize
+mysql
+oracle
+postgres
+redshift
+snowflake
+soql
+sparksql
+sqlite
+teradata
+tsql"""
+        )
     else:
         parser.print_help()