Link to home
Start Free TrialLog in
Avatar of morad98
morad98

asked on

what more cleaning do i need in my data set to fix the standard deviation the (prices)

sacramento_real_estate_transactions.csvsacramento_real_estate_transactions.csvHi,
I have a data set. i made couple of cleaning on it. i got acceptable result but i cannot work on it to predict my values because of:

prices which is the target has a high standard deviation. here the standard deviation in thousands.

the features which are the beds number and the bath numbers with square feet are in the acceptable standard deviation (less than one).

i can understand there is an outliar values. my question is: what is the best procedure to fix the standard deviation of the price after what i did for cleaning.

note: for cleaning i removed the negative values, the zero's, and the NaN. please i need an expert to help in this :(
Avatar of d-glitch
d-glitch
Flag of United States of America image

What data do you have, and how was it collected?
How do you know that this data is valid?
What do plan to use this data?  What do you hope to find out?

You mention beds and baths?  Are you talking about actual beds and bath tubs?  Or bathrooms and bedrooms?

Are you talking about houses or apartments?  Monthly rents or purchase prices?
Are you considering location, location, location?
Avatar of morad98
morad98

ASKER

Actually, the data is part of the Lab. Yes I am talking about number of bedrooms and bathrooms.

the locations also there. in the last hour i tried to remove the outliers by using the IQR and the Z score but still the standard deviation of the the target is not normal.
Avatar of morad98

ASKER

i attached the data set. and here is the code that i wrote for IQR:

def detect_iqr(data_2):
    my_list = list(data_2)
    my_data = []
    q1, q2, q3 = np.percentile(my_list,[25, 50, 75])
    for i in data_2:
        if (i > q1) & (i < q3):
            my_data.append(i)
           
    return Sacramento_housing[data_2.isin(my_data)]
You still haven't said what you are hoping to achieve with this analysis.

How do you decide what constitutes an outlier?
What percentage of the Sacramento housing stock, does this database represent?
Are these really sale dates?  If so, May 21, 2008 was a very busy day.
All this data is over 10 years old.  Is that a problem?
How are you handling the the longitude and latitude data?
Avatar of morad98

ASKER

Longitude and latitude will be an object so, I will not use it. For the Zip also I changed that to Zip. So, my plan to make bed, bath, sq_ft as my features. I am going to use IQR to decide for the outliers.
You can not hope to succeed by ignoring the most valuable data.

Precise physical Location (not ZIP code) is the most important single factor in housing price.  That's why the First Law of Real Estate used that word (and no others) three times.
Avatar of morad98

ASKER

Well, this a good point to start with (you are right). So, the location will be in my features. great help.
but still i see that using the IQR will remove a lot of my data. and i want to come up with a good standard deviation
ASKER CERTIFIED SOLUTION
Avatar of d-glitch
d-glitch
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of morad98

ASKER

Hi,
the Lab state this : In this lab you will hone your exploratory data analysis (EDA) skills and practice constructing simple linear regressions using a data set on Sacramento real estate sales. The data set contains information on qualities of the property, location of the property, and time of sale.

great I moved forward before reading your reply. but as expert in Data ... what is the best procedure to follow in cleaning Data and the procedure for SLR in high level.
I have no idea what IQR, Z Score, and SLR mean.  I wouldn't know EDA either, except that you defined it.

What is the goal of your analysis, in English?
You can't start analyzing data until you have defined the problem.
And you should never discard any data without a good reason and a better explanation.
Avatar of morad98

ASKER

Thanks a lot for your help and the advises.