Data Analyst Business English Course
    About Lesson

    SQL Databases with Pandas and Python

    The lesson is based on the tutorial by Rob Mulla.

    Vocabulary Part 1

    Intro to Databases & Manual Exports (Until 04:55)

    Listening practice 1 (Until 04:55)

    Part 1 Summary: Intro to Databases & Manual Exports (Until 04:55)

    • The Problem with Local Files: While storing data in individual files (like CSVs or Parquet) works for small projects, real organizations use centralized Relational Databases to store data efficiently and allow multiple users to query exactly what they need.

    • Local Setup with Docker: To test database interactions without a complex local installation, data analysts often use Docker images to quickly spin up a pre-configured database (like the MySQL fake employee database used in the video) on a specific local port (e.g., 3306).

    • Database Visualizers (DBeaver): Before writing code, it is best practice to connect via database management software like DBeaver to view tables, check entity-relationship (ER) diagrams, and test initial SQL queries.

    • The “Intermediate” Workflow: A common but inefficient way to get data into Python is to run an SQL query in a visualizer, export the result as a CSV file, and then load that file into Pandas using pd.read_csv().

    Vocabulary Part 2

    Direct Python Connections & Cursors (04:55 – 11:24)

    Listening practice 2 (04:55 – 11:24)

    Part 2 Summary: Direct Python Connections & Cursors (04:55 – 11:24)

    • Direct Database Connectors: To eliminate the manual CSV step, Python can use specific database connectors (like mysql-connector-python). You establish a direct connection by passing a connection string containing the host, port, username, password, and database name.

    • The Low-Level Method (Cursors):

      • Once connected, you create a cursor object to execute raw SQL queries.

      • The cursor doesn’t load all data into memory at once; it points to the results and streams them one row at a time, requiring you to manually iterate over it and append rows to a Python list.

      • Crucially, you must always explicitly close the cursor and connection when done to free up database resources.

    • The Pandas Advantage (read_sql): Tracking connectors and manually closing cursors is tedious. A more modern, efficient approach uses SQLAlchemy to create a database engine, allowing Pandas to directly execute a query and instantly format the output into a structured DataFrame using pd.read_sql().

    Vocabulary Part 3

    Advanced Writing & Appending Data (11:24 – End)

    Listening practice 3 (11:24 – End)

    Part 3 Summary: Advanced Writing & Appending Data (11:24 – End)

    • Writing Back to the DB (to_sql): Data analysts don’t just read data; they also save their engineered features or aggregated results back to the database using the Pandas .to_sql() method.

    • Handling the “Table Already Exists” Error: By default, .to_sql() expects a brand-new table. If you run the code a second time on an existing table, Python will throw an error and fail.

    • The if_exists Parameter: To control how Pandas interacts with an existing database table, you must explicitly set the if_exists argument:

      • fail (Default): Stops the script and raises an error.

      • replace: Drastically drops (deletes) the old database table and creates a new one with the current DataFrame.

      • append: Keeps the existing database rows and simply inserts the new DataFrame rows at the bottom.

    • Tracking Changes with Timestamps: When logging repeated data over time (like daily max salary calculations), adding a create_date timestamp column to your DataFrame before using append is essential for tracking when each row of data was calculated.