Housing Price Prediction
-
Due Friday, 20 Mar, 2026, 8:00 PM Eastern (accepted for full credit until 11:59 PM)
Overview
In this project, we’ll write a program to analyze data about houses to make predictions about selling prices. We will explore data processing using the Pandas library, and see how attribute selection and engineering can affect the prediction accuracy of our model. We will practice doing this without loops, instead leveraging the powerful functional programming capabilities of Pandas. Finally, we will use generative AI to help us visualize our data.
Objectives
- Practice data transformation and processing using the Pandas library
- Practice functional techniques, avoiding loops with Pandas structures for greater efficiency.
- Writing and using helper functions.
- See how attribute selection and data transformation affect ML models.
- Visualize data using with the help from generative AI.
Starter Files
Download the starter files using this link. You’ll find:
house_prices.pytests.py- A
datafolder, containing:train.csvtest.csvtestPrices.csvdata_description.txt
Keep the data folder in the same folder as your house_prices.py file, since the code in house_prices.py expects to find the data files in that location.
Suggested Timeline
We recommend approximately the timeline below:
- 9 Mar: Download the starter code and set up the VS Code project. Read the entire spec and study the provided code.
- 11 Mar: Complete step 1.
- 12 Mar: Complete step 2.
- 13 Mar: Complete step 3.
- 14 Mar: Complete step 4.
- 15 Mar: Complete step 5.
- 16 Mar: Complete step 6.
- 17 Mar: Complete step 7.
- Wednesday, 18 Mar, 2026: Last day to still get 5% extra credit for your project 4 submission!
- Friday, 20 Mar, 2026: Final due.
Collaboration Policy and the Honor Code
All students in the class are presumed to be decent and honorable, and all students in the class are bound by the College of Engineering Honor Code. The full collaboration policy can be found in the syllabus.
Course policies, including those on academic integrity, are in place to encourage an effective learning environment for you. We want students to learn from and with each other, and we encourage you to collaborate. We also want to encourage you to reach out and get help when you need it.
Encouraged Collaboration Examples
You are encouraged to:
- Give or receive help in understanding course concepts covered in lecture or lab.
- Practice and study with other students to prepare for assessments or exams.
- Consult with other students to better understand project specifications.
- Discuss general design principles or ideas as they relate to projects.
- Help others understand compiler errors or how to debug parts of their code.
To clarify the last item, you are permitted to look at another student’s code to help them understand what is going on with their code. You are not allowed to tell them what to write for their code, and you are not allowed to copy their work to use in your own solution. If you are at all unsure whether your collaboration is allowed, please contact the course staff via the admin form before you do anything. We will help you determine if what you’re thinking of doing is in the spirit of collaboration for EECS 183.
Prohibited Collaboration Examples
The following are considered Honor Code violations:
- Submitting others’ work as your own.
- Copying or deriving portions of your code from others’ solutions.
- Collaborating to write your code so that your solutions are identifiably similar.
- Sharing your code with others to use as a resource when writing their code.
- Receiving help from others to write your code.
- Sharing test cases with others if they are turned in as part of your solution.
- Sharing your code in any way, including making it publicly available in any form (e.g. a public GitHub repository or personal website).
Autograder Cheating Detection
We run every submission against every other submission and determine similarities. All projects that are “too similar” are forwarded to the Engineering Honor Council. This happens to numerous students each semester. Also know that it takes months to get a resolution from the Honor Council. Discussing the project with other students will NOT be an issue. Sharing code between students, even if it’s just one function, will likely cause the cheating detector to identify both programs as “too similar”. We also search the web for solutions that may be posted online and add these into the mix of those checked for similarities. Searching the web, by the way, is something that we are very good at.
Any violation of the honor policies appropriate to each piece of course work will be reported to the Honor Council, and if guilt is established, penalties may be imposed by the Honor Council and Faculty Committee on Discipline. Such penalties can include, but are not limited to, letter grade deductions or expulsion from the University.
Also note that on all cases forwarded to the Engineering Honor Council the LSA Dean of Academic Affairs is also notified. Furthermore, the LSA rule is students involved in honor violations cannot withdraw from nor drop the course.
Working with a Partner
- For Projects 3 and 4, you may choose to work with one other student who is currently enrolled in EECS 183 Python.
- You may change partners between projects, e.g., you may have a different partner for project 3 than for project 4.
- You may not change partners during a project.
- Although you are welcome to work alone if you wish, we encourage you to consider partnering up for Projects 3 and 4. If you would like a partner but don’t know anyone in the class, we encourage you to use the Search for Teammates post on Piazza to find someone! Please make sure to mark your search as Done once you’ve found a partner.
- As a further reminder, a partnership is defined as two people. Outside of your partnership, you are encouraged to help each other and discuss the project in English (or in some other human language), but don’t share project code with anyone but your partner. See the course Honor Code documentation for details.
- To register a partnership on the autograder, go to the autograder link for the project and select “Send group invitation”. Then, add your partner to the group by entering their email when prompted. They will receive a confirmation after registration, and must accept the invitation before the partnership can submit. You must choose whether or not to register for a group on the autograder before you can submit. If you select the option to work alone, you will not be able to work with a partner later in the project. If a partnership needs to be changed after you register, you may submit an admin request.
- The partnership will be treated as one student for the purpose of the autograder, and you will not receive additional submits beyond the given four submits per day.
- If you decide to work with a partner, be sure to review the guidelines for working with a partner.
- If you choose to use late days and you are working in a partnership, review this section for how late days will be charged against each partner.
Provided Files
train.csv
The file train.csv (in the data folder) is the training data, which contains a unique Id, many attributes, and the SalePrice for each house. We will use the data in this CSV file to build a predictive model – that is, a function that takes in the attributes of a house and outputs a prediction for the sale price. Open train.csv in VS Code and note:
- The first row gives the name of each column. The leftmost column is named
ID. - We will treat the next 79 columns as attributes – these are characteristics of houses that we’ll use to try to predict the sale price. (Note: I’m using the word “attribute” here in the data science sense of a characteristic of a single row of data. This meaning is distinct from an “attribute” that some Python objects have, like a NumPy array has a
dtypeattribute.) - The last column is named
SalePrice. Values in this column are the actual sale prices for the houses in each corresponding row, according to historical data from years ago. - Note that the last row of
train.csvhasID1060.
test.csv
The file test.csv (also in the data folder) is the testing data, which contains only the unique Id and attributes for each house (no SalePrice). We will use the model built from train.csv to predict the sale price for each house in test.csv. Open test.csv in VS Code and note:
- The
IDcolumn starts from 1061, continuing from theID1060 house in the last row oftrain.csv. test.csvhas the same 79 attribute columns astrain.csv, with the same column names and data types, but it does not have theSalePricecolumn, since this is what we want to predict for these houses.
data_description.txt
Skim the data_description.txt file in the data folder. Note that it contains information about each attribute. There may be times when you need to refer to this document to complete a task for this project, so keep it in mind.
house_prices.py
Open house_prices.py and find the main function. You can see that we have provided code to load train.csv and test.csv into two DataFrames. Next, demonstrate_helpers is called; this is a function we have provided to demonstrate the effects of some helper functions used in house_prices.py. Read demonstrate_helpers and run the code. You should get output like the following (with more output omitted as indicated by the ...). Read the output a bit to understand the purpose of each helper function.
----------
Attributes with missing values:
Index(['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual', ...],
dtype='str')
----------
Numeric attributes:
['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', ...]
----------
Non-numeric attributes:
['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', ...]
----------
Values, for each non-numeric attribute:
MSZoning: ['RL', 'RM', 'C (all)', 'FV', 'RH']
Street: ['Pave', 'Grvl']
Alley: [nan, 'Grvl', 'Pave']
LotShape: ['Reg', 'IR1', 'IR2', 'IR3']
LandContour: ['Lvl', 'Bnk', 'Low', 'HLS']
...
In this project, we will apply data cleaning and transformations to both train_df and test_df. Then, we will train a model using the transformed data.
tests.py
Extensive yet not necessarily complete tests are provided for you in this file. You may run the test file at any time. The tests are numbered to correspond to the steps (and substeps) of the project as described below and labeled in the house_prices.py file. Note:
- If a test runs but indicates that a failure or exception occurred, there is a problem in your code.
- If a test runs and indicates that it passed, that means that your code is correct for the specific case(s) tested by that test. Note that passing the tests does not necessarily mean your code is correct in all cases, but it is a good start.
- You will need to complete the steps in order, since some later steps depend on earlier steps. Similarly, later tests may crash if you haven’t completed the earlier steps.
To reiterate, it is quite possible for your code to pass tests in tests.py but still be incorrect in some other cases and potentially fail the autograder. Examine your code carefully and write additional tests yourself as needed.
transform_data
The main function calls transform_data, which in turn calls several other functions. The purpose of transform_data is to take in the raw data from train_df and test_df, and apply a series of transformations. These transformations make the data more easily used by a predictive model. They are summarized below:
handle_missing_values- Real-life data often includes missing values, for reasons such as data collection errors, non-applicable attributes, or simply unrecorded information. These missing values need to be handled in some way, since predictive models generally can’t work with data when some data is missing.
add_derived_attributes- We can create new attributes by combining or transforming existing attributes. These new attributes can capture additional information that may be useful for prediction.
get_neighborhood_to_avg_priceandnormalize_neighborhood_by_avg_price- The
Neighborhoodattribute is categorical, but it carries strong information about housing prices. By transforming it into a numerical attribute based on average sale price in that neighborhood, we can allow the model to leverage this information more effectively.
- The
normalize_numeric_attrs- Normalizing numeric attributes to a common scale (e.g., [0, 1]) can help improve the performance of many machine learning models, especially those that are sensitive to the scale of input features.
apply_ordinal_encodings- Some categorical attributes represent ordered categories (e.g., quality ratings like “poor”, “fair”, “good”, “excellent”). By encoding these categories with ordinal values (e.g. 0 for “poor”, 1 for “fair”, 2 for “good”, 3 for “excellent”), we can allow the model to recognize the inherent order and leverage it in making predictions.
convert_to_one_hot- For categorical attributes that do not have an inherent order (e.g., type of heating that a home has), we can use one-hot encoding to convert them into binary attributes. This allows the model to use this information without assuming any ordered relationship between categories.
After applying these transformations, transform_data takes the transformed data (in all_inputs) and creates DataFrames along with the Id and SalePrice (for train_io, as in “train input and output”), and just the Id (for test_input). These DataFrames are returned to main and then passed to the provided test_accuracy function. The test_accuracy function trains a linear regression model on the training data, and evaluates the accuracy of the model on the testing data. When you’ve completed the project, you should end up with an accuracy score of 0.619. We could achieve a higher accuracy by using a more sophisticated machine learning algorithm, and by doing more extensive data cleaning and transformation.
Practicing Functional Techniques
In this project, we will practice functional techniques with Pandas. This means that we will avoid using standard loops (for and while) to manipulate our data. Comprehensions are fine; we’ll use them at times.
Instead of standard loops, “functional techniques” as we’re defining them here include apply and map (with lambda functions as needed), specialized method calls when applicable (e.g., mean, mode), and vectorized operations (e.g., df.loc[:, col_name] + 1). Why are we focusing on these and avoiding loops? These functional techniques are more efficient (…see lecture for details) and more concise, and this practice trains your brain in a new way of programming that is increasingly important in modern applications. In fact, many of the most powerful tools for data science and machine learning (e.g., PyTorch, TensorFlow) are designed to be used in a functional programming style.
Thus, with one exception stated in context further below, the autograder will check that you have not used any loops before awarding full credit. To summarize:
Use functional techniques instead of standard loops.
STEP 1: handle_missing_values
Consider the handle_missing_values function that is called from transform_data. It should replace all missing values in the data. Ordinarily it would be the job of the data scientist to decide, for each attribute, what the missing value policy should be. We have made such decisions for you in this project:
- The
GarageYrBltattribute has missing values that we replace with the corresponding values from theYearBuiltcolumn. The reasoning is that if a house has a garage, it is likely that the garage was built in the same year as the house itself. replace_with_mode- For certain categorical attributes, a missing value likely indicates an error in the data.
- For example, every house has an exterior, so a missing
Exterior1stvalue likely indicates an error in data collection. - We choose the simple approach of replacing missing values in such attributes with the mode (most common value) of that attribute.
replace_with_mean- Similarly, for certain numerical attributes, a missing value likely indicates an error in the data.
- For example, every house is on a lot of a given area, so a missing
LotAreavalue likely indicates an error. - We choose the simple approach of replacing missing values in such attributes with the mean (average) of that attribute.
replace_with_not_app- For other attributes, a missing value likely indicates not an error in data collection but an indication that the attribute is not applicable.
- For example, if the
Alleyattribute is missing, it likely means that the house has no alley access – that theAlleyattribute is not applicable to that house. - For such attributes, we will replace the missing value with the string
"NotApp"to indicate that the attribute is not applicable.
The next bit of code creates the col_name_to_fill_in dictionary, mapping a column name to the value that should be used to replace missing values in that column. It is first made with a comprehension, and then added to with the update method. We’ve not previously studied that update method, but you can see that it adds more entries to the dictionary; it essentially merges the argument dictionary into the dictionary left of the dot.
For convenience, we also create the all_replace_attrs list as the concatenation / merge (with +) of the three lists of attributes that require different types of replacement.
Having understood all of the above, your task is to complete the “TO DO Step 1” section labeled in the handle_missing_values function:
The overall structure of your solution for this step should be:
all_inputs.loc[:, all_replace_attrs] = (
# an appropriate call to apply
)
This can be done in a single statement (plus an inner def), or you may break it up into a couple statements with intermediate variables if you wish. Your inner def will need to use the fillna method from pandas, which fills in missing values in a Series with a specified value. Read the succinct pandas fillna documentation to see how to use it.
STEP 2: add_derived_attributes
One way to improve our data is to add new attributes that reflect combinations of existing attributes. Complete the function add_derived_attributes to add the following new attributes:
AgeWhenSold: The year sold minus the year builtYearsSinceRemodel: The year sold minus the year remodeledTotalSF: The sum of the square footage of the basement, first floor, and second floor.
You may recall that new columns can be added to a DataFrame with an assignment. For example:
all_inputs.loc[:, "AgeWhenSold"] = ... # expression generating a Series of the correct length
Also note that we can perform basic math operations on Series (and even DataFrames) of the same shape, similar to what we can do in NumPy.
STEP 3: get_neighborhood_to_avg_price and normalize_neighborhood_by_avg_price
Consider the attribute Neighborhood. It is originally a categorical variable represented by string abbreviations for each neighborhood. While categorical, this attribute carries strong information about housing prices, since properties in the same neighborhood tend to have similar values.
To better capture this relationship, we can transform Neighborhood into a numerical attribute. You will do this in two steps, labeled Step 3a and Step 3b in the code.
Step 3a: get_neighborhood_to_avg_price
Observe the type hints, RME, and provided code for this function. The TO DO section should be completed with a line that returns a dictionary mapping each neighborhood string to its average SalePrice in train_df. Study the provided variables and use them in a dictionary comprehension to create and return the required dictionary.
Step 3b: normalize_neighborhood_by_avg_price
Observe the type hints, RME, and provided code for this function. The parameter neighborhood_to_avg_price is the dictionary created by get_neighborhood_to_avg_price, passed to this function within transform_data. Use an assignment statement (or two) to transform the Neighborhood column as specified.
STEP 4: normalize_numeric_attrs
It is useful to convert all numeric types into a [0, 1] range, so that the large magnitude of certain attributes (say, square footage) does not overwhelm the small magnitude of other attributes (say, number of bathrooms) when we train our model.
Complete the normalize_numeric_attrs function to perform this transformation. Complete this step with an inner def and a single assignment statement using apply.
STEP 5: apply_ordinal_encodings
In this step, we convert quality-related attributes that are currently categorical (such as Ex, Gd, TA, Fa, Po) into numerical values using ordinal encoding. These categories represent ordered quality levels, from highest to lowest, rather than independent labels.
Included in the provided code are maps from categories to integers for the standard quality attributes, as well as three special attributes (BsmtExposure, GarageFinish, and Functional) that also have ordered categories. Study this and the remainder of the provided code in this function. Use these variables to write a small amount of code to complete this function.
Unlike most of this project, one for loop will be appropriate here, since we’re changing what transformation to apply based on the attribute we’re working with.
- Use
zipto loop throughmapsandattrssimultaneously. Thus, each time the body of the loop is executed, we’ll be dealing with one of the mappings in themapslist, and the corresponding list of attributes in theattrslist. - In the body of your
forloop, create an innerdefcalledapply_mapping. Its purpose is to convert a single column according to the current mapping in use.- Why do this in the body of the
forloop? Becauseapply_mapping’s work depends on the current mapping, so we need to access it from the loop variables. (There are other ways we could handle this, but this is the simplest.)
- Why do this in the body of the
- The remainder of the
forloop body should be a single assignment statement usingapplyto apply the appropriate mapping (viaapply_mapping) to each column in the current column list.
STEP 6: get_one_hot_cols and convert_to_one_hot
Some categorical attributes do not have an inherent order. For example, the Heating attribute has categories such as “GasA”, “GasW”, “Grav”, “Wall”, “OthW”, and “Floor”. There is no inherent ordering to these categories – one is not better or worse than another. For such attributes, we want to use one-hot encoding to convert them into binary attributes.
Step 6a: get_one_hot_cols
Study the provided code for get_one_hot_cols. Complete the TO DO section with a single dictionary comprehension that you assign to col_name_to_series.
Step 6b: convert_to_one_hot
Study the provided code for convert_to_one_hot. Complete the TO DO section with a single list comprehension that you assign to list_of_dfs. You’ll need to call get_one_hot_cols in your list comprehension.
Step 7: Using GenAI to Generate Visualization Code
As you are increasingly aware, Python has a huge number of libraries, each of which usually contains a huge number of functions. For example, matplotlib is a very powerful library for data visualization. Take a quick look (like for 30 seconds) at the full documentation for matplotlib.
Some people have studied matplotlib extensively; that is worthwhile for those who will need it frequently! But many other people (like me, and probably you) just need it now and then; they just want to generate some kind of visualization of their data and move on.
Prior to genAI, people in that situation would check a site like stackoverflow.com and find a question like How to plot a histogram using Matplotlib in Python with a list of data?. That, perhaps in combination with the full documentation, would allow them to write the code they needed. But it would take some time to find the right question and adapt the code in the answer.
With genAI, there’s an easier way. And since our goal is not to master matplotlib, but rather just to generate some visualizations of our data, we’re not losing much if anything in needed expertise by asking genAI for help with matplotlib. To put it more generally:
If your goal is just to get a task done, and not to get better at the task itself, then genAI is a particularly helpful tool.
(Of course, GenAI can also be used in certain ways as a tutor to increase one’s expertise! I’ve certainly done this. But I think we also all know from experience that it’s easy to inadvertently use genAI in ways that do not increase expertise.)
How does genAI know so much about how to use matplotlib? Because it has been trained on a huge amount of examples, including that in stackoverflow.com. Honestly, the makers of genAI systems owe a great debt (at least ethically…) to the huge number of people that have contributed to such sites. Reddit and various help forums are other very significant examples of this “free” labor the world has provided.
Alright, let’s use genAI to generate the matplotlib code for us. We will use Gemini, Google’s genAI model. You can access Gemini Pro (ordinarily a paid product) when signed in through your umich email at https://gemini.google.com. Another benefit of signing in is that U-M has an agreement with Google to protect your privacy in ways that are not otherwise available.
Please don’t use other AI tools for this project, to keep things consistent for everyone and to protect your privacy. Gemini Pro will be quite sufficient for this task, and is far superior to free tools.
For the deepest learning experience, and to avoid issues of academic dishonesty, please only use AI for this project / this course / any course when specifically allowed.
Observe the make_visualizations function. This is where you will put Gemini-generated code to create visualizations of the data.
Note that we are not asking Gemini to actually make images for us – that would usually not work out well, with it hallucinating various details of the data. Rather, we’re asking Gemini to generate Python code that we can run to make the visualizations.
Ask Gemini to help you to do the following. Each time some code is generated, read it to see what you can understand from it, but note that you will not be required to write code like this yourself from scratch.
- Create a new folder called
data_visif it doesn’t already exist, to store the visualizations that you will create. - Make a histogram to plot the frequency of price ranges of the
SalesPricecolumn fromtrain_df. Save the image assales_price_histogram.png, in thedata_visfolder. The result should look like a bar graph with sale price on the x-axis and frequency (number of houses with a given sale price) on the y-axis. - Make a scatter plot to visualize the relationship between
GrLivArea(above ground living area) andSalePrice. Save the image asliving_area_vs_price.png, in thedata_visfolder. The result should look like a scatter plot withGrLivAreaon the x-axis andSalePriceon the y-axis, with each point representing a house. As expected, you’ll find that generally, houses with larger above ground living area tend to have higher sale prices, though there will be some variation. - Make a box plot to visualize the relationship between
NeighborhoodandSalePrice. Save the image asneighborhood_vs_price.png, in thedata_visfolder. The result should look like a box plot withNeighborhoodon the x-axis andSalePriceon the y-axis, with each box representing a neighborhood. The box plot will show the distribution of sale prices within each neighborhood, allowing you to compare the central tendency and variability of sale prices across different neighborhoods.
Style Checklist
Review EECS 183 Python Style Guide and the style rubric below. All sections are relevant to this project.
Readability Violations
-1 for each of the following categories:
- Top comment problems.
- Missing RME for any function.
- Other comment problems.
- Indentation problems.
- Whitespace problems.
- Variable problems.
- Line length problems.
- Boolean Expression problems.
Coding Quality Violations
-2 for each of the following categories:
- Global constants not following naming conventions, or global variables used.
- Magic numbers used.
- Using logic that is clearly too involved or incorrect.
How to Submit
- Go to the autograder, where you will submit only
house_prices.py.
IMPORTANT:
- Differences in whitespace of your output can fail the autograder.
- Ensure that you have included your (and your partner’s) name, your (and your partner’s) uniqname and a small description of the program in the header comments in all files submitted.
- You have four submissions to the autograder per day with feedback, and one additional “wildcard” submission to use once during the project.
- We will grade your best submission for style. If multiple submissions are tied in score, we take the last of those.
Copyright and Academic Integrity
© 2026 Steven Bogaerts.
Materials for this assignment were developed with assistance from course staff, including Xinyun Cao and Leanne Cheng.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
All materials provided for this course, including but not limited to labs, projects, notes, and starter code, are the copyrighted intellectual property of the author(s) listed in the copyright notice above. While these materials are licensed for public non-commercial use, this license does not grant you permission to post or republish your solutions to these assignments.
It is strictly prohibited to post, share, or otherwise distribute solution code (in part or in full) in any manner or on any platform, public or private, where it may be accessed by anyone other than the course staff. This includes, but is not limited to:
- Public-facing websites (like a personal blog or public GitHub repo).
- Solution-sharing websites (like Chegg or Course Hero).
- Private collections, archives, or repositories (such as student group “test banks,” club wikis, or shared Google Drives).
- Group messaging platforms (like Discord or Slack).
To do so is a violation of the university’s academic integrity policy and will be treated as such.
Asking questions by posting small code snippets to our private course discussion forum is not a violation of this policy.