Foundations of Data Science combines three perspectives: inferential thinking, computational thinking, and real-world relevance. Given data arising from some real-world phenomenon, how does one analyze that data so as to understand that phenomenon? The course teaches critical concepts and skills in computer programming and statistical inference, in conjunction with hands-on analysis of real-world datasets, including economic data, document collections, geographical data, and social networks. It delves into social issues surrounding data analysis such as privacy and design.

This course does not have any prerequisites beyond high-school algebra. The curriculum and format is designed specifically for students who have not previously taken statistics or computer science courses. Students with some prior experience in either statistics or computing are welcome to enroll, though some parts of the course will be slow. Students who have taken both statistics and computer science courses should instead take a more advanced course.

Our primary text is an online book called Computational and Inferential Thinking: The Foundations of Data Science. This text was written for the course by the course instructors.

The computing platform for the course is hosted at datahub.berkeley.edu. Students find it convenient to use their own computer for the course. If you do not have adequate access to a personal computer, we have machines available for you; please contact the instructor.

You are not alone in this course; the staff and instructors are here to support you as you learn the material. It's expected that some aspects of the course will take time to master, and the best way to master challenging material is to ask questions. For online questions, use Piazza. We will also hold office hours for in-person discussions.

Weekly labs are a required part of the course and should be submitted during your lab session. To receive credit, you must attend lab, work on the lab assignment until you're finished or the lab period is over, and get checked off by a course staff member. Labs will be released on Monday night. If you don't want to attend lab physically, you may complete a lab assignment remotely, but you must complete it by Tuesday at 11:59pm to receive credit. Note that if you attend lab, you can still get credit even if you don't finish all of the lab problems. However, if you choose to work remotely, you must finish the entire lab to receive credit. Each person must submit each lab independently, but you are welcome to collaborate with other students in your lab room.

Small-group tutoring sections will be available to a subset of students who sign up for them during the second week of classes. For students who have not programmed before, these sections will be an excellent use of your time. Details about sign-ups will be shared in lecture and posted here. Tutoring sessions are held in BIDS unless your tutor contacts you otherwise. BIDS (Berkeley Institute of Data Science) is in Doe Library, immediately to your left if you walk in through the Memorial Glade entrance!

Data science is about analyzing real-world data sets, and so a series of projects involving real data are a required part of the course. You may work with a single partner on all projects, and we strongly recommend that you find a partner in your lab section.

Weekly homework assignments are a required part of the course. Each student must submit each homework independently, but you are allowed to discuss problems with other students.

The midterm exam will be held in class (during the lecture period) on Friday, March 10. The final exam will be held from 7 p.m. to 10 p.m. on Tuesday May 9. Unless you have accommodations as determined by the university or permission from the instructor, you must take the midterm and the final at the dates and times provided here. Please check your course schedule and make sure that you have no conflicts with these exams.

Participation points can be earned in one of two ways: attending lecture or completing a final independent data investigation. Lecture attendance will begin to count in week 3; the first two weeks are optional. Students who have previously taken both computer science and statistics courses cannot receive participation credit for attendance; they must complete a final independent data investigation.

Details (posted 1/27/16):

- The
*final independent data investigation*can be completed in pairs and is expected to required 10-20 hours of work. The purpose of the investigation is to apply what you've learned to some data set of your choosing. Investigations will require the following elements. Correct execution of every element is sufficient for full credit. You will present your investigation during RRR week to the course staff.- Choose and describe a data set that includes at least two tables, some quantitative variable, and some categorical variable.
- Visualize some quantitative variable(s) of the data in a way that summarizes the data effectively and write a short observational description.
- Visualize some categorical variable(s) of the data in a way that summarizes the data effectively and write a short observational description.
- Summarize some aspect of the data in a table that involves grouping or pivoting and write a short observational description.
- Summarize some aspect of the data in a table that involves joining the two tables and write a short observational description.
- State a hypothesis related to the data and the corresponding null hypothesis.
- Perform a statistical test for the hypothesis and write a short conclusion.
- Describe a prediction problem related to the data.
- Apply a prediction technique to the problem and briefly justify your choice of approach.
- Evaluate the prediction technique quantitatively and write a short conclusion.

- The requirement for
*attending lecture*is to attend at least 2 lectures in each of at least 10 weeks, starting with week 3. Not attending could potentially save you 10-20 hours; just enough time to complete your data investigation. Attendance will be taken via a Google Form. Yes, it's possible to fake attendance, but doing so is quite silly given that you have another option. Students caught faking attendance will fail the course. If you are intending to earn attendance credit but can't attend due to unforeseen circumstances, you are encouraged to contact the instructor instead of subverting the attendance system. - Students who have substantial prior experience in both statistics and computer science are only eligible for the data investigation option. That's not to punish you; it's to ensure that this course actually furthers your education. Both of the following must be true for this policy to apply to you.
- In a prior semester, you have passed one of the following courses at Berkeley: Stat 2, 20, 21, 133, 134, 135 or any upper-division course containing the words "Probability" or "Statistics" in the course title, such as EE 126 or IEOR 172.
- In a prior semester, you have passed one of the following courses at Berkeley: Stat 133, CS 61A, CS 61B, or Engineering 7.

Grades will be assigned using the following weighted components:

Activity | Grade |
---|---|

Participation | 10% |

Lab | 10% |

Homework | 20% |

Projects | 20% |

Midterm | 10% |

Final | 30% |

The course will not be curved, but further details of grading criteria may not be announced until the end of the course. It is certainly possible for all students to receive high grades in this course if all of you show mastery of the material on exams and complete all assignments.

With the obvious exception of exams, we encourage you to discuss all of the course activities with your friends and classmates as you are working on them. You will definitely learn more in this class if you work with others than if you do not. Ask questions, answer questions, and share ideas liberally.

Since you're working collaboratively, keep your project partner and the course staff informed. If some medical or personal emergency takes you away from the course for an extended period, or if you decide to drop the course for any reason, please don't just disappear silently! You should inform your project partner, so that nobody is depending on you to do something you can't finish.

Cooperation has a limit, however. You should not share your code or answers directly with other students. Doing so doesn't help them; it just sets them up for trouble on exams. Feel free to discuss the problems with others beforehand, but not the solutions. Please complete your own work and keep it to yourself. The exception to this rule is that you can share everything related to a project with your project partner and turn in one project between you.

Penalties for cheating are severe — they range from a zero grade for the assignment or exam up to dismissal from the University, for a second offense.

Rather than copying someone else's work, ask for help. You are not alone in this course! The course staff is here to help you succeed. If you invest the time to learn the material and complete the projects, you won't need to copy any answers.

If you want to receive credit for an assignment that you will turn in after the deadline, you must ask your GSI before the deadline. Otherwise, late homework & lab will not be accepted. Late projects will be accepted for half credit. Extensions will only be offered in advance of the deadline and for exceptional circumstances.

This page shouldn't end with a list of penalties for cheating or lateness, because penalties and grades aren't the purpose of the course. We actually just want you to learn. Please keep that goal in mind throughout the semester. Welcome to Data 8.