Skip to content

Udacity Data Engineering Nanodegree program Project 1. Data Modeling with Postgres

Notifications You must be signed in to change notification settings

parkthebus/UdacityProject_1-DataModeling-with-Postgres

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DataModeling-with-Postgres

Udacity Data Engineering Nanodegree program Project 1. Data Modeling with Postgres

How to start

  1. Create Tables

python create_table.py

  1. Run ETL script

python etl.py

Description

create_tables.py

create and drop tables

sql_queries.py

contains all sql queries (CREATE, DROP, INSERT, SELECT)

etl.py

  • process_song_file(cur, filepath)
  • open song file(dataframe) and insert song, artist data into song_table and artist_table
  • the inserted data is list of dataframe's specific columns at index 0.
  • for example. select 'song_id','title','artist_id','year','duration' columns in dataframe as df and get values into list.

list(df[['song_id','title','artist_id','year','duration']].values[0])

  • process_log_file(cur, filepath)
  • open log file(dataframe) and filter by page==NextSong.
  • convert timestamp columnn to datetime
  • use dataframe.assign() and datetimelibrary.
  • divide 1000 and convert each rows of timestamp_data('ts') in dataframe(df) to datetime using datetime library.
t = df.assign(ts=[datetime.datetime.fromtimestamp(ts/1000) for ts in list(df['ts'])])
  • extract timestamp, hour, day, week of year, month, year, and weekday data from timespamp converted dataframe and insert time data records into time_table.
  • select user information from dataframe and insert into user_table.
  • get songid and artistid from song and artist table using JOIN query.
  1. get song, artist and length(duration) rows from dataframe iteration.
  2. find songs from JOIN table using song, artist, length data *
    
    SELECT s.song_id, a.artist_id
    FROM songs as s JOIN artists as a ON s.artist_id = a.artist_id
    WHERE s.title = %s AND a.name = %s AND s.duration = %s;
    
    3.create songplay_data which insert into songplay_table.
  • get index from iteration enumerated number
  • get timestamp, userid, level, sessionid, location, useragent from dataframe's column
  • and use songid, artistid we've got above.
songplay_data = (index, row.ts, row.userId, row.level, songid, artistid, row.sessionId, row.location, row.userAgent)
cur.execute(songplay_table_insert, songplay_data)
  • process_data(cur, conn, filepath, func)
  • get all files matching extension from directory
  • get total number of files found
  • iterate over files and process

About

Udacity Data Engineering Nanodegree program Project 1. Data Modeling with Postgres

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages