Implementing a Simple SQL Parser in Python: From Principles to Practice

In the field of data processing, SQL (Structured Query Language) is a core tool for interacting with databases. But have you ever wondered how a program understands a statement like SELECT id, name FROM users WHERE age > 18 when you write it? Behind this lies the power of a SQL Parser—it converts human-readable SQL text into structured data (such as an Abstract Syntax Tree) that a program can execute.

SQL parsers are widely used in ORM frameworks (e.g., Django ORM), SQL auditing tools (e.g., Sqitch), and database middleware (e.g., Sharding-JDBC). Starting from parsing principles, this article will guide you through implementing a simple SQL parser in Python that supports core SQL statements (SELECT ... FROM ... WHERE), helping you understand the core logic of parsers.

I. The Core of SQL Parsing: Two Key Stages

Parsing any language—including SQL—involves two essential stages: "lexical analysis" and "syntactic analysis". We can analogize this process to "reading an English sentence": first identifying individual words (lexical analysis), then understanding the grammatical structure of the sentence (syntactic analysis).

1. Lexical Analysis: Breaking Down into "Words"

The goal of lexical analysis is to split continuous SQL text into individual Tokens (lexemes) with clear meanings. Tokens are like the "words" of SQL. For example:

SQL statement: SELECT id, name FROM users WHERE age > 18
Token sequence after splitting:
[SELECT, ID(id), COMMA, ID(name), FROM, ID(users), WHERE, ID(age), GT(>), INT(18)]

Common SQL Token types include:

Keywords: SELECT, FROM, WHERE (case-insensitive)
Identifiers: Table names (e.g., users), column names (e.g., id)
Literals: Numbers (18), strings ('Alice')
Operators: = (equality), > (greater than), < (less than)
Punctuation: , (comma), * (asterisk, representing all columns)

2. Syntactic Analysis: Constructing the "Sentence Structure"

Syntactic analysis uses SQL grammar rules (e.g., "SELECT must be followed by column names or *, and FROM must be followed by a table name") to convert the Token sequence into an Abstract Syntax Tree (AST). An AST is a tree structure that clearly represents the logical hierarchy of the SQL statement. For example, the AST of the SQL statement above can be simplified as:

Query
├─ select_clause: [id, name]
├─ from_clause: users
└─ where_clause: 
   ├─ column: age
   ├─ operator: >
   └─ value: 18

The value of an AST lies in its ability to convert unstructured text into structured data. By traversing the AST, a program can easily retrieve key information such as "which columns to query", "which table to query from", and "what filter conditions to apply".

II. Practice: Implementing a Simple SQL Parser in Python

We will implement the parser using Python's ply library (Python Lex-Yacc). ply is a library that simulates the classic compiler construction tools lex (for lexical analysis) and yacc (for syntactic analysis). It is easy to get started with and aligns closely with the core logic of parsers.

1. Environment Preparation

First, install the ply library:

pip install ply

2. Step 1: Implement the Lexical Analyzer (Lexer)

The core of a lexical analyzer is to use regular expressions to match different types of Tokens and ignore irrelevant characters such as spaces and comments.

Code Implementation (Lexer)

import ply.lex as lex

# 1. Define Token types (must be defined first to avoid errors)
tokens = (
    'SELECT', 'FROM', 'WHERE',  # Keywords
    'ID', 'INT', 'STRING',      # Identifiers and literals
    'EQ', 'GT', 'LT',           # Operators (equality, greater than, less than)
    'COMMA', 'STAR'             # Punctuation (comma, asterisk)
)

# 2. Define matching rules for keywords (higher priority than identifiers, as keywords also consist of letters)
reserved = {
    'select': 'SELECT',
    'from': 'FROM',
    'where': 'WHERE'
}

# 3. Define regular expressions for Tokens (ordered by priority from highest to lowest)
# String literals: enclosed in single quotes, e.g., 'Alice'
def t_STRING(t):
    r"'[^']*'"  # Regex: match any character (excluding single quotes) inside single quotes
    t.value = t.value[1:-1]  # Remove the single quotes to retain the actual content
    return t

# Integer literals: sequences of digits
def t_INT(t):
    r'\d+'
    t.value = int(t.value)  # Convert to integer type
    return t

# Identifiers (table names, column names): start with a letter, followed by letters/digits/underscores
def t_ID(t):
    r'[a-zA-Z_][a-zA-Z0-9_]*'
    # Check if the identifier is a keyword (e.g., 'select' should be recognized as SELECT, not ID)
    t.type = reserved.get(t.value.lower(), 'ID')
    return t

# Operators
t_EQ = r'='    # Equality
t_GT = r'>'    # Greater than
t_LT = r'<'    # Less than

# Punctuation
t_COMMA = r',' # Comma
t_STAR = r'\*' # Asterisk (needs escaping because * has a special meaning in regex)

# 4. Ignore irrelevant characters (spaces, tabs, newlines)
t_ignore = ' \t\n'

# 5. Error handling (triggered when unrecognizable characters are encountered)
def t_error(t):
    print(f"Illegal character: '{t.value[0]}'")
    t.lexer.skip(1)  # Skip the invalid character and continue parsing subsequent content

# 6. Create a Lexer instance
lexer = lex.lex()

# Test the Lexer: input SQL text and output the Token sequence
def test_lexer(sql):
    lexer.input(sql)
    print("Lexical Analysis Result (Token Sequence):")
    while True:
        tok = lexer.token()
        if not tok:
            break
        print(f"Type: {tok.type:10}, Value: {tok.value}")

# Test case
test_sql = "SELECT id, name FROM users WHERE age > 18 AND name = 'Alice'"
test_lexer(test_sql)

Running Result

Lexical Analysis Result (Token Sequence):
Type: SELECT    , Value: select
Type: ID        , Value: id
Type: COMMA     , Value: ,
Type: ID        , Value: name
Type: FROM      , Value: from
Type: ID        , Value: users
Type: WHERE     , Value: where
Type: ID        , Value: age
Type: GT        , Value: >
Type: INT       , Value: 18
Type: ID        , Value: AND  # Note: The AND keyword is not defined yet, so it is temporarily recognized as ID
Type: ID        , Value: name
Type: EQ        , Value: =
Type: STRING    , Value: Alice

3. Step 2: Implement the Syntactic Analyzer (Parser)

The core of a syntactic analyzer is to define SQL grammar rules and convert the Token sequence into an AST. We will support the most core query syntax:

SELECT [column_list/*] FROM table_name [WHERE condition (column operator value)]

Code Implementation (Parser)

import ply.yacc as yacc
from lexer import tokens  # Import the Token types defined in Step 1

# 1. Define AST nodes (represented by dictionaries for simplicity and clarity)
def create_ast(node_type, **kwargs):
    return {'type': node_type, **kwargs}

# 2. Define grammar rules (ordered by priority from lowest to highest; the starting rule is 'query')
# Starting rule: Query statement = SELECT clause + FROM clause + [WHERE clause]
def p_query(p):
    '''query : select_clause from_clause where_clause_opt'''
    # p[0] is the return value of the rule; p[1] is select_clause, p[2] is from_clause, p[3] is where_clause_opt
    p[0] = create_ast(
        'Query',
        select=p[1],
        from_clause=p[2],
        where_clause=p[3] if p[3] else None  # Optional clause; set to None if it does not exist
    )

# Optional WHERE clause: either present or absent
def p_where_clause_opt(p):
    '''where_clause_opt : WHERE condition
                        | empty'''
    if len(p) == 3:  # Matches "WHERE condition"
        p[0] = p[2]
    else:  # Matches "empty" (no WHERE clause)
        p[0] = None

# SELECT clause: SELECT + (asterisk / column list)
def p_select_clause(p):
    '''select_clause : SELECT STAR
                     | SELECT column_list'''
    if p[2] == '*':  # Matches "SELECT *"
        p[0] = create_ast('SelectClause', columns=['*'])
    else:  # Matches "SELECT column_list"
        p[0] = create_ast('SelectClause', columns=p[2])

# Column list: multiple IDs separated by commas (e.g., id, name, age)
def p_column_list(p):
    '''column_list : ID
                   | column_list COMMA ID'''
    if len(p) == 2:  # Single column (e.g., id)
        p[0] = [p[1]]
    else:  # Multiple columns (e.g., column_list, ID)
        p[0] = p[1] + [p[3]]

# FROM clause: FROM + table name (e.g., FROM users)
def p_from_clause(p):
    '''from_clause : FROM ID'''
    p[0] = create_ast('FromClause', table=p[2])

# Condition: column + operator + value (e.g., age > 18 or name = 'Alice')
def p_condition(p):
    '''condition : ID EQ INT
                 | ID EQ STRING
                 | ID GT INT
                 | ID LT INT'''
    p[0] = create_ast(
        'Condition',
        column=p[1],
        operator=p[2],
        value=p[3]
    )

# Empty rule (used for optional clauses)
def p_empty(p):
    '''empty :'''
    p[0] = None

# Syntax error handling
def p_error(p):
    if p:
        print(f"Syntax error: Near Token {p.type} (Value: {p.value})")
    else:
        print("Syntax error: Unexpected end of input")

# Create a Parser instance
parser = yacc.yacc()

# Test the Parser: input SQL text and output the AST
def parse_sql(sql):
    ast = parser.parse(sql)
    print("\nSyntactic Analysis Result (AST):")
    import json  # Use JSON to format the output for better readability
    print(json.dumps(ast, indent=2))

# Test cases (support asterisk, multiple columns, and INT/STRING conditions)
test_sql1 = "SELECT id, name FROM users WHERE age > 18"
test_sql2 = "SELECT * FROM orders WHERE product = 'phone'"
parse_sql(test_sql1)
parse_sql(test_sql2)

Running Result (AST)

AST of the first test case (SELECT id, name FROM users WHERE age > 18):

{
  "type": "Query",
  "select": {
    "type": "SelectClause",
    "columns": ["id", "name"]
  },
  "from_clause": {
    "type": "FromClause",
    "table": "users"
  },
  "where_clause": {
    "type": "Condition",
    "column": "age",
    "operator": ">",
    "value": 18
  }
}

AST of the second test case (SELECT * FROM orders WHERE product = 'phone'):

{
  "type": "Query",
  "select": {
    "type": "SelectClause",
    "columns": ["*"]
  },
  "from_clause": {
    "type": "FromClause",
    "table": "orders"
  },
  "where_clause": {
    "type": "Condition",
    "column": "product",
    "operator": "=",
    "value": "phone"
  }
}

III. Applying the Parsing Result: The Value of AST

Once you have the AST, you can do many things with it. For example, you can write a simple "query interpreter" to convert the AST into a natural language description:

def interpret_ast(ast):
    if ast['type'] != 'Query':
        return "Unsupported statement type"
    
    # Parse the SELECT clause
    select_cols = ', '.join(ast['select']['columns'])
    select_desc = f"Query columns: {select_cols}"
    
    # Parse the FROM clause
    from_desc = f"From table: {ast['from_clause']['table']}"
    
    # Parse the WHERE clause
    where_desc = ""
    if ast['where_clause']:
        cond = ast['where_clause']
        where_desc = f", Filter condition: {cond['column']} {cond['operator']} {cond['value']}"
    
    return f"Execution logic: {select_desc} {from_desc}{where_desc}"

# Test the interpreter
ast1 = parser.parse(test_sql1)
print(interpret_ast(ast1))  # Output: Execution logic: Query columns: id, name From table: users, Filter condition: age > 18

IV. Limitations and Advanced Directions

The parser implemented in this article only supports the most basic SQL syntax and has obvious limitations:

It does not support complex syntax such as multi-table joins (JOIN), aggregate functions (COUNT, SUM), and grouping (GROUP BY);
It does not support semantic analysis (e.g., checking if tables/columns exist or if data types match);
It does not handle edge cases such as SQL comments and mixed case (e.g., Select).

For more practical SQL parsing capabilities, you can refer to the following advanced directions:

Use mature libraries: For industrial-grade scenarios, prioritize using libraries like sqlparse (Python) or antlr4 (cross-language). sqlparse can directly parse complex SQL and generate ASTs;
Extend grammar rules: Add rules for JOIN, GROUP BY, etc., based on ply. Pay attention to grammar priorities (e.g., AND has higher priority than OR);
Semantic analysis: Add a "table structure checking" step after syntactic analysis to verify if the columns in the SELECT clause exist in the table specified in the FROM clause;
Query optimization: Optimize queries based on the AST (e.g., pushing down filter conditions, selecting indexes)—this is a core capability of database kernels.

V. Conclusion

The essence of SQL parsing is to "convert text into structured data", which relies on two core steps: "splitting into Tokens via lexical analysis" and "building an AST via syntactic analysis". Although the simple parser implemented in this article using ply is not sufficient for production environments, it can help you understand the working principles of parsers.