Learn How to Write a Mini SQL Parser in Python (Step by Step)
Takashi Yamamoto
Infrastructure Engineer · Leapcell

Implementing a Simple SQL Parser in Python: From Principles to Practice
In the field of data processing, SQL (Structured Query Language) is a core tool for interacting with databases. But have you ever wondered how a program understands a statement like SELECT id, name FROM users WHERE age > 18
when you write it? Behind this lies the power of a SQL Parser—it converts human-readable SQL text into structured data (such as an Abstract Syntax Tree) that a program can execute.
SQL parsers are widely used in ORM frameworks (e.g., Django ORM), SQL auditing tools (e.g., Sqitch), and database middleware (e.g., Sharding-JDBC). Starting from parsing principles, this article will guide you through implementing a simple SQL parser in Python that supports core SQL statements (SELECT ... FROM ... WHERE
), helping you understand the core logic of parsers.
I. The Core of SQL Parsing: Two Key Stages
Parsing any language—including SQL—involves two essential stages: "lexical analysis" and "syntactic analysis". We can analogize this process to "reading an English sentence": first identifying individual words (lexical analysis), then understanding the grammatical structure of the sentence (syntactic analysis).
1. Lexical Analysis: Breaking Down into "Words"
The goal of lexical analysis is to split continuous SQL text into individual Tokens (lexemes) with clear meanings. Tokens are like the "words" of SQL. For example:
- SQL statement:
SELECT id, name FROM users WHERE age > 18
- Token sequence after splitting:
[SELECT, ID(id), COMMA, ID(name), FROM, ID(users), WHERE, ID(age), GT(>), INT(18)]
Common SQL Token types include:
- Keywords:
SELECT
,FROM
,WHERE
(case-insensitive) - Identifiers: Table names (e.g.,
users
), column names (e.g.,id
) - Literals: Numbers (
18
), strings ('Alice'
) - Operators:
=
(equality),>
(greater than),<
(less than) - Punctuation:
,
(comma),*
(asterisk, representing all columns)
2. Syntactic Analysis: Constructing the "Sentence Structure"
Syntactic analysis uses SQL grammar rules (e.g., "SELECT
must be followed by column names or *
, and FROM
must be followed by a table name") to convert the Token sequence into an Abstract Syntax Tree (AST). An AST is a tree structure that clearly represents the logical hierarchy of the SQL statement. For example, the AST of the SQL statement above can be simplified as:
Query
├─ select_clause: [id, name]
├─ from_clause: users
└─ where_clause:
├─ column: age
├─ operator: >
└─ value: 18
The value of an AST lies in its ability to convert unstructured text into structured data. By traversing the AST, a program can easily retrieve key information such as "which columns to query", "which table to query from", and "what filter conditions to apply".
II. Practice: Implementing a Simple SQL Parser in Python
We will implement the parser using Python's ply
library (Python Lex-Yacc). ply
is a library that simulates the classic compiler construction tools lex
(for lexical analysis) and yacc
(for syntactic analysis). It is easy to get started with and aligns closely with the core logic of parsers.
1. Environment Preparation
First, install the ply
library:
pip install ply
2. Step 1: Implement the Lexical Analyzer (Lexer)
The core of a lexical analyzer is to use regular expressions to match different types of Tokens and ignore irrelevant characters such as spaces and comments.
Code Implementation (Lexer)
import ply.lex as lex # 1. Define Token types (must be defined first to avoid errors) tokens = ( 'SELECT', 'FROM', 'WHERE', # Keywords 'ID', 'INT', 'STRING', # Identifiers and literals 'EQ', 'GT', 'LT', # Operators (equality, greater than, less than) 'COMMA', 'STAR' # Punctuation (comma, asterisk) ) # 2. Define matching rules for keywords (higher priority than identifiers, as keywords also consist of letters) reserved = { 'select': 'SELECT', 'from': 'FROM', 'where': 'WHERE' } # 3. Define regular expressions for Tokens (ordered by priority from highest to lowest) # String literals: enclosed in single quotes, e.g., 'Alice' def t_STRING(t): r"'[^']*'" # Regex: match any character (excluding single quotes) inside single quotes t.value = t.value[1:-1] # Remove the single quotes to retain the actual content return t # Integer literals: sequences of digits def t_INT(t): r'\d+' t.value = int(t.value) # Convert to integer type return t # Identifiers (table names, column names): start with a letter, followed by letters/digits/underscores def t_ID(t): r'[a-zA-Z_][a-zA-Z0-9_]*' # Check if the identifier is a keyword (e.g., 'select' should be recognized as SELECT, not ID) t.type = reserved.get(t.value.lower(), 'ID') return t # Operators t_EQ = r'=' # Equality t_GT = r'>' # Greater than t_LT = r'<' # Less than # Punctuation t_COMMA = r',' # Comma t_STAR = r'\*' # Asterisk (needs escaping because * has a special meaning in regex) # 4. Ignore irrelevant characters (spaces, tabs, newlines) t_ignore = ' \t\n' # 5. Error handling (triggered when unrecognizable characters are encountered) def t_error(t): print(f"Illegal character: '{t.value[0]}'") t.lexer.skip(1) # Skip the invalid character and continue parsing subsequent content # 6. Create a Lexer instance lexer = lex.lex() # Test the Lexer: input SQL text and output the Token sequence def test_lexer(sql): lexer.input(sql) print("Lexical Analysis Result (Token Sequence):") while True: tok = lexer.token() if not tok: break print(f"Type: {tok.type:10}, Value: {tok.value}") # Test case test_sql = "SELECT id, name FROM users WHERE age > 18 AND name = 'Alice'" test_lexer(test_sql)
Running Result
Lexical Analysis Result (Token Sequence):
Type: SELECT , Value: select
Type: ID , Value: id
Type: COMMA , Value: ,
Type: ID , Value: name
Type: FROM , Value: from
Type: ID , Value: users
Type: WHERE , Value: where
Type: ID , Value: age
Type: GT , Value: >
Type: INT , Value: 18
Type: ID , Value: AND # Note: The AND keyword is not defined yet, so it is temporarily recognized as ID
Type: ID , Value: name
Type: EQ , Value: =
Type: STRING , Value: Alice
3. Step 2: Implement the Syntactic Analyzer (Parser)
The core of a syntactic analyzer is to define SQL grammar rules and convert the Token sequence into an AST. We will support the most core query syntax:
SELECT [column_list/*] FROM table_name [WHERE condition (column operator value)]
Code Implementation (Parser)
import ply.yacc as yacc from lexer import tokens # Import the Token types defined in Step 1 # 1. Define AST nodes (represented by dictionaries for simplicity and clarity) def create_ast(node_type, **kwargs): return {'type': node_type, **kwargs} # 2. Define grammar rules (ordered by priority from lowest to highest; the starting rule is 'query') # Starting rule: Query statement = SELECT clause + FROM clause + [WHERE clause] def p_query(p): '''query : select_clause from_clause where_clause_opt''' # p[0] is the return value of the rule; p[1] is select_clause, p[2] is from_clause, p[3] is where_clause_opt p[0] = create_ast( 'Query', select=p[1], from_clause=p[2], where_clause=p[3] if p[3] else None # Optional clause; set to None if it does not exist ) # Optional WHERE clause: either present or absent def p_where_clause_opt(p): '''where_clause_opt : WHERE condition | empty''' if len(p) == 3: # Matches "WHERE condition" p[0] = p[2] else: # Matches "empty" (no WHERE clause) p[0] = None # SELECT clause: SELECT + (asterisk / column list) def p_select_clause(p): '''select_clause : SELECT STAR | SELECT column_list''' if p[2] == '*': # Matches "SELECT *" p[0] = create_ast('SelectClause', columns=['*']) else: # Matches "SELECT column_list" p[0] = create_ast('SelectClause', columns=p[2]) # Column list: multiple IDs separated by commas (e.g., id, name, age) def p_column_list(p): '''column_list : ID | column_list COMMA ID''' if len(p) == 2: # Single column (e.g., id) p[0] = [p[1]] else: # Multiple columns (e.g., column_list, ID) p[0] = p[1] + [p[3]] # FROM clause: FROM + table name (e.g., FROM users) def p_from_clause(p): '''from_clause : FROM ID''' p[0] = create_ast('FromClause', table=p[2]) # Condition: column + operator + value (e.g., age > 18 or name = 'Alice') def p_condition(p): '''condition : ID EQ INT | ID EQ STRING | ID GT INT | ID LT INT''' p[0] = create_ast( 'Condition', column=p[1], operator=p[2], value=p[3] ) # Empty rule (used for optional clauses) def p_empty(p): '''empty :''' p[0] = None # Syntax error handling def p_error(p): if p: print(f"Syntax error: Near Token {p.type} (Value: {p.value})") else: print("Syntax error: Unexpected end of input") # Create a Parser instance parser = yacc.yacc() # Test the Parser: input SQL text and output the AST def parse_sql(sql): ast = parser.parse(sql) print("\nSyntactic Analysis Result (AST):") import json # Use JSON to format the output for better readability print(json.dumps(ast, indent=2)) # Test cases (support asterisk, multiple columns, and INT/STRING conditions) test_sql1 = "SELECT id, name FROM users WHERE age > 18" test_sql2 = "SELECT * FROM orders WHERE product = 'phone'" parse_sql(test_sql1) parse_sql(test_sql2)
Running Result (AST)
AST of the first test case (SELECT id, name FROM users WHERE age > 18
):
{ "type": "Query", "select": { "type": "SelectClause", "columns": ["id", "name"] }, "from_clause": { "type": "FromClause", "table": "users" }, "where_clause": { "type": "Condition", "column": "age", "operator": ">", "value": 18 } }
AST of the second test case (SELECT * FROM orders WHERE product = 'phone'
):
{ "type": "Query", "select": { "type": "SelectClause", "columns": ["*"] }, "from_clause": { "type": "FromClause", "table": "orders" }, "where_clause": { "type": "Condition", "column": "product", "operator": "=", "value": "phone" } }
III. Applying the Parsing Result: The Value of AST
Once you have the AST, you can do many things with it. For example, you can write a simple "query interpreter" to convert the AST into a natural language description:
def interpret_ast(ast): if ast['type'] != 'Query': return "Unsupported statement type" # Parse the SELECT clause select_cols = ', '.join(ast['select']['columns']) select_desc = f"Query columns: {select_cols}" # Parse the FROM clause from_desc = f"From table: {ast['from_clause']['table']}" # Parse the WHERE clause where_desc = "" if ast['where_clause']: cond = ast['where_clause'] where_desc = f", Filter condition: {cond['column']} {cond['operator']} {cond['value']}" return f"Execution logic: {select_desc} {from_desc}{where_desc}" # Test the interpreter ast1 = parser.parse(test_sql1) print(interpret_ast(ast1)) # Output: Execution logic: Query columns: id, name From table: users, Filter condition: age > 18
IV. Limitations and Advanced Directions
The parser implemented in this article only supports the most basic SQL syntax and has obvious limitations:
- It does not support complex syntax such as multi-table joins (
JOIN
), aggregate functions (COUNT
,SUM
), and grouping (GROUP BY
); - It does not support semantic analysis (e.g., checking if tables/columns exist or if data types match);
- It does not handle edge cases such as SQL comments and mixed case (e.g.,
Select
).
For more practical SQL parsing capabilities, you can refer to the following advanced directions:
- Use mature libraries: For industrial-grade scenarios, prioritize using libraries like
sqlparse
(Python) orantlr4
(cross-language).sqlparse
can directly parse complex SQL and generate ASTs; - Extend grammar rules: Add rules for
JOIN
,GROUP BY
, etc., based onply
. Pay attention to grammar priorities (e.g.,AND
has higher priority thanOR
); - Semantic analysis: Add a "table structure checking" step after syntactic analysis to verify if the columns in the
SELECT
clause exist in the table specified in theFROM
clause; - Query optimization: Optimize queries based on the AST (e.g., pushing down filter conditions, selecting indexes)—this is a core capability of database kernels.
V. Conclusion
The essence of SQL parsing is to "convert text into structured data", which relies on two core steps: "splitting into Tokens via lexical analysis" and "building an AST via syntactic analysis". Although the simple parser implemented in this article using ply
is not sufficient for production environments, it can help you understand the working principles of parsers.
Leapcell: The Best of Serverless Web Hosting
Finally, we recommend an excellent platform for deploying Python services: Leapcell
🚀 Build with Your Favorite Language
Develop effortlessly in JavaScript, Python, Go, or Rust.
🌍 Deploy Unlimited Projects for Free
Only pay for what you use—no requests, no charges.
⚡ Pay-as-You-Go, No Hidden Costs
No idle fees, just seamless scalability.
🔹 Follow us on Twitter: @LeapcellHQ