MS in Marketing and Retail Science
Dealing with Data & Introduction to Python Programming
Course code: SHBI-GB 7304 B20
Overview
This course is the recommended first course for students who 1) want to
work in the rapidly growing fields of data science and data analytics or
2) who want to acquire the technical and data analysis skills needed in
other disciplines such as finance and marketing. The course provides an
introduction to programming (using Python) and covers the collection,
storage, organization, management, and analysis of data, both structured
(record-based) and unstructured (such as text).
Course Objectives
At a very high level, the course will teach you Python and SQL, plus a few
Unix tools that are useful for everyday data handling and processing. At
the completion of this course, you should:
-
Write simple programs for a variety of data handling tasks (e.g., fetch
data from the web, data processing, etc.)
-
Retrieve and manage data coming in a variety of formats and from
different sources
- Store and query data in relational databases
- Visualize and effectively present data
Software that we will not use or cover
-
We do not plan on using R. While it is a very useful open source tool
for data analysis and visualization, we can achieve the same results
using Python. Furthermore, Python can handle easier bigger datasets, is
“cleaner” as a language, and can be used for many more
purposes than R. Therefore, to minimize the need to learn multiple tools
to achieve the same goal, we standardize on using Python.
-
We do not plan on using Tableau, or any other visualization technology
(e.g., D3.js). There is a separate Data Visualization course.
Help and Office
Topics
- Introduction to programming using Python
- Data modeling and ER model
- Relational databases and SQL
- Basics of data analysis and visualization
Prerequisites
None
Important Information
Since this is a hands-on course, you must bring your laptop to every class
with sufficient battery charge. Make sure you can connect to NYU wi-fi.
Attendance and penalty for missing classes
Requiring attendance is necessary for several reasons. First, you
incorrectly assume you can catch up on a missed class by watching a
recording (if available). Videos do not engage your brain as much as a
live class. Second, less than 20% of you watch the recording (if
available). You are then lost in class, which provides the wrong signals
to me as an instructor. Third, your absence hurts class discussions.
Fourth, you miss out on feedback if you do not work through the questions
I pose in class. Fifth, I lose the feedback since there are fewer
questions.
The policy below will be in effect only after the add/drop
period.
Without mandatory attendance, attendance is often below 50%. Therefore,
though I dislike doing this, I penalize absences. If you anticipate being
absent for good reasons, please email me well in advance. Please enter
"Excused" on the attendance sheet described below to avoid the
penalty if I approve. If you miss a class due to emergencies and cannot
tell me in advance, do not panic. Take care of the emergency first, and
then email me. I will permit you to change the "Absent" to
"Excused." But if you miss a class without a valid reason, there
is a penalty, as stated below.
For sections meeting in 150-190 minute sessions, you will lose one
grade (A to A-, A- to B+, B+ to B, B to B-, and so on) for EVERY missed
session unless you were explicitly excused via email. Thus, if you miss
two class sessions, you will lose two grades, and so on.
For sections meeting in 75-80 minute sessions, you will lose one grade
(A to A-, A- to B+, B+ to B, B to B-, and so on) for EVERY TWO missed
sessions unless you were explicitly excused via email. Thus, if you miss
four class sessions, you will lose two grades, and so on.
Please sit in the same seat in every class and display your name tags. For
Zoom classes, you must keep your video on AT ALL TIMES. You must also have
a good working headset or mic, as it is extremely rude to be inaudible and
force me to ask you to repeat yourself. After entering the class, please
mark yourself present in the first 20 minutes on the OneDrive sheet (link
posted on Brightspace).
You will be marked absent if you are more than 20 minutes late unless it
is because of factors beyond your control (traffic, subway, or
interviews running late). You will also be marked absent if you leave
the class early unless you have my permission or get it afterward. You
will get an F in the course if you are caught cheating on the attendance
sheet.
Grading
- Five Homeworks: 5% * 5 = 25%
- Final exam: 50%
- Final project: 25%
-
Attendance:
Please read about the penalty for missing classes above.
Late Assignment Submission Policy
Late submissions (even by 1 minute) will get a zero score because the
answers will be posted immediately after the due date and time. No
extensions will be granted except for medical or family emergencies. If
you have any religious or personal conflicts, please submit the
assignments beforehand since the related material will be covered well in
advance of the due dates.
Materials
I will distribute Jupyter notebooks. There is required textbook for the
course, but the following books are a useful reference for some of the
material that I will be covering in class.
-
Online version of the class notes on Github. This repository contains
material for the class, mainly under the “Introduction to
Python” and “SQL” folders.
-
Python For Everybody: Exploring Data in Python 3. It is available for
free as a PDF on the web (you can also buy a hard copy for $10 on
Amazon, or get a Kindle version for $1).
-
Learn Python 3 the Hard Way. This book is available for free on the web
(you can also get a hard copy if you want for ~$35).
-
Learning MySQL, Chapters 4, 5, 6, and 7: This book contains an extensive
discussion of MySQL, providing more details on schema design, SQL
queries, etc.
Course policies
Unless otherwise noted, we follow the default Stern Policies.
Classes are videotaped and a link is posted to NYU Brightspace under
the MediaSite tab.
Frequently Asked Questions
-
Q: Why Do not we use R/STATA/Matlab/… in this class? My friend
says that they are very useful tools for analyzing data, and my
internship requires the use of R. A: I agree that all these tools are
very useful and should be in the toolkit for any professional who deals
with data. However, within the context of a semester-long class, if we
attempt to learn all these tools, we will cover everything
superficially, and we'll spread ourselves too thin. Python and its
libraries is a very mature ecosystem and will give you substantial
ability to handle, process, and visualize pretty much anything that you
want.
-
Q: Should I know programming to take this class? A: No, we will learn
programming in Python during the class.
-
Q: I know programming and/or SQL. Is this the right class for me? A: It
depends on your level. I expect that approximately 40% of the course
will focus on teaching you programming and Python, then 40% on databases
and SQL, and 20% on a variety of other topics. If you know programming
but not Python and are not familiar with SQL, I think that you will get
a lot out of this class. If you already know Python but not SQL, it may
be worthwhile, but there will be repetition of things that you know. If
you are familiar with both programming and SQL, then this is definitely
not the class for you.
-
Q: Will we learn about big data? A: While we will learn a lot about
handling big data sets, most probably we will not cover any “big
data” tools, such as Hadoop, Hive, Pig, etc. While it is cool to
add these buzzwords in your CV, you will be surprised how far you can go
with just a simple relational database and knowledge of SQL, alone. In
fact, in most industrial settings that I worked for, SQL is the
preferred mode of analysis, not Hadoop, Pig, and other tools like that.
Once you add Python in the mix with SQL, your abilities become
superpowers. Trust me, “you're going to like the way you
look” at the end of the class, even without knowing Hadoop.
-
Q: I already know Python, SQL, have used some NLP tools, and I am really
interested in learning deeper the following couple of topics…. A:
This is not the right class for you. The class is designed to be broad
and introductory, not deep and advanced. You should consider taking the
data mining class, or some specialized class on the topic of your
interest. If you take this class, you are most probably going to be
bored, and it will not be a good use of your time.
Tentative Timeline
Module |
Topic |
1 |
- Using NYU JupyterHub
- Introduction to programming and Jupyter
-
Key components of a programming language: Variables, operators,
statements
|
2 |
-
Key components of a programming language: Data structures such as
lists, conditional branching, loops
|
3 |
- Syntax versus semantics
- Help, comments, and printing
- Introduction to formatting output using f-strings
|
4 |
- Simple data types: Logical and numeric
- Sequenced data types: Strings, lists, and ranges
- Mutable versus immuatable data types
|
5 |
- Arithmetic operators, in-place operators
- Comparison operators
- Logical operators
- Chaining operators and operator overloading
|
6 |
|
7 |
|
8 |
- Control Flow statements: while loops, for loops
|
9 |
- Control Flow statements: while loops, for loops
|
10 |
|
11 |
|
12 |
|
13 |
|
14 |
-
Entity-Relationship model: Entities, keys, attributes, relations,
ER examples
|
15 |
- Entity-Relationship model: ER diagrams to SQL Tables
|
16 |
|
17 |
- SQL 2: LIKE, IS NULL, and Inner Join queries
|
18 |
- SQL 3: Inner Join II and Outer Join
|
19 |
- SQL 4: Aggregation / GROUP BY queries
|
20 |
- SQL 5: Subqueries / Python and SQL
|
21 |
- Database integrative class practice
|
22 |
|
23 |
|
24 |
- Intro to Pandas and Plotting
|
25 |
- Intro to Pandas and Plotting
|
26 |
- Intro to Pandas and Plotting
|
27 |
- Intro to Pandas and Plotting
|
28 |
|