We help IT Professionals succeed at work.

Python Challenging Code Question

Member_2_7966113
on
392 Views
Last Modified: 2019-09-22
OK, EE members, I appreciate that EE doesn't have as many programmers as Stackoverflow, but I have joined because I'm hoping that the few programmers with EE will assist me.


I was presented with the following Apache Spark python question:

Sample and Randomly split data to create training and test datasets and persist the training dataset to disk.

In order to successfully complete the question I need to write a function that achieves the following:

  1. CC1_TrainTestSpit.csv
  2. Samples 20 Percent of the the Dataframe without replacement
  3. Randomly splits the sampled data to create train and test datasets with weights .8 and .3 respectively
  4. Persist the training dataset using DISK_ONLY
  5. Uses the summary function on the training dataset to return a dataframe of the statistics mean, min, max.
  6. Returns, in order the training dataset, test test dataset, and the summary dataframe

The function should take the form of:

def traintestSplit(df, Dataframe)

Return (trainDF

testDF

statsDF)

The dataset is attached.

Can someone assist me
Comment
Watch Question

Author

Commented:
Hello EE

I thought I would let you know that the above task / question came directly from the following Databricks exam

CRT020: Databricks Certified Associate Developer for Apache Spark 2.4 with Python 3 – Assessment
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014

Commented:
In order to successfully complete the question I need to write a function

The key part of this is that YOU need to write a function.

Author

Commented:
aikimark

There isn't any need for the smug response. If you don't want to help, then simply don't help.

I have attempted to write the code but I don't think I'm doing it correctly.

Be professional!
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014

Commented:
I'm not being smug.

I'm willing to help you.  I'm not willing to take your tests for you.

What code have YOU written and what problems are you having with it?
CERTIFIED EXPERT

Commented:
Hi Just saw the question now.

I think it would be easier to help if we walk through this step by step and if you show us what you wrote so far.

At which step exactly do you fail? 2.) ? 3.) ? 4.) ? ...

What is the exact problem?

Does the code crash, do you get an unexpected result?

I guess step 1 just means, that you have an existing csv file as data source, right?
nociSoftware Engineer
CERTIFIED EXPERT
Distinguished Expert 2019

Commented:
Please present what you have done so far. (a code block of the function would suffice).
and where do you get stuck.

Author

Commented:
Hi all,

I will update you with my attempted code when I get back to my desk.

Author

Commented:
OK, Experts,

I have just got  back to my desk.

As you requested, the following is what I have done so far:

data = pd.read_csv('CC1_TrainTestSpit.csv') 

#get 20% data

df=data.sample(frac=0.2)

# split data

def trainTestSplit(df):
    train=df.sample(frac=0.8) 
    test=df.drop(train.index)
    
    summary=train.describe(include=[np.number]).loc[['mean','min','max'],:]
    
    return train,test,summary

Open in new window



I hope you now feel more in a position to help me...
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014

Commented:
Is there an import statement associated with the pd object?

Author

Commented:
Hi aikimark,

I'm not sure what you mean?

Author

Commented:
Do you mean

import pandas as pd 
import numpy as np

Open in new window

Author

Commented:
I'm not sure if the above code is correct, however when I get the following error when I run the following line of code:

df=data.sample(frac=0.2)


sample() got an unexpected keyword argument 'frac'
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014

Commented:
Yes.  That is what I meant.  When you share code, please share all of your code, not just snippets.  The question I asked might have lead you to an immediate solution if you had failed to include those import statements.
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014

Commented:
Check your data
What do you see when you invoke this statement?
data.head()

Open in new window

Author

Commented:
I get the following:

Out[22]: Row(_c0='discountId', _c1='price', _c2='active', _c3='createdAt')
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014

Commented:
Did you put that statement right after the read_csv() statement?

Author

Commented:
Hi aikimark,

I haven't got clue what you're suggesting....
nociSoftware Engineer
CERTIFIED EXPERT
Distinguished Expert 2019

Commented:
First what version of python are you using. Preferably it would be  3.6, at least Python 3.  [ python 2.7 works a bit differently ]
(although this code should also run on python 2.7...)

Your code with a few extras, save as a script...

#!/bin/env python3
#! previous line added to enforce python 3... when run as an executable.

import pandas as pd
import numpy as np 

data = pd.read_csv('CC1_TrainTestSpit.csv')    # load frame

#get 20% data

df=data.sample(frac=0.2)   # sample data
print( df.head())          # added to show first few elements.
print(len(data))           # some statistics on the data... (original size)
print(len(df))            # sampled size

# split data, still no changes...

def trainTestSplit(df):
    train=df.sample(frac=0.8)
    test=df.drop(train.index)

    summary=train.describe(include=[np.number]).loc[['mean','min','max'],:]

    return train,test,summary

(trf, tsf, sum) = trainTestSplit(df)      # run the split function

print(trf)     # print result-sets   - this will show only part of the data as it is quite a huge array (160 or so elements, 40- ish shown).
print(tsf)
print(sum)

Open in new window


So if there are error then it is because there is more info that is not shown in your code fragment
It is still your code with some print statements added.

if a complaint about frac is unknown then possibly the file could not be read or is not valid CSV (although it does work on my system).
at least  data is not a pandas dataframe in that case.

Author

Commented:
NOCI

Thank for reaching out.

Going to test now ... will let you know

Author

Commented:
Hi NOCI

I'm getting the following error:

TypeError: sample() got an unexpected keyword argument 'frac'
nociSoftware Engineer
CERTIFIED EXPERT
Distinguished Expert 2019

Commented:
#!/bin/env python3
#! previous line added to enforce python 3... when run as an executable.

import pandas as pd
import numpy as np 

data = pd.read_csv('CC1_TrainTestSpit.csv')    # load frame
print(len(data))           # some statistics on the data... (original size)
print(data)

#get 20% data

df=data.sample(frac=0.2)   # sample data
print( df.head())          # added to show first few elements.
print(len(df))            # sampled size

# split data, still no changes...

def trainTestSplit(df):
    train=df.sample(frac=0.8)
    test=df.drop(train.index)

    summary=train.describe(include=[np.number]).loc[['mean','min','max'],:]

    return train,test,summary

(trf, tsf, sum) = trainTestSplit(df)      # run the split function

print(trf)     # print result-sets   - this will show only part of the data as it is quite a huge array (160 or so elements, 40- ish shown).
print(tsf)
print(sum)

Open in new window


Suptle change added a print statement after the csv_read. Those should give some clue...

Author

Commented:
Hi NOCI

I'm getting an error with the following line

print(len(data))           # some statistics on the data... (original size)

The error is:

TypeError: object of type 'DataFrame' has no len()
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014

Commented:
Where did you put the
data.head()

Open in new window

statement?

Author

Commented:
aikimark,

You asked me that before, and I kindly suggested that I didn't understand what you meant.

Can you please elaborate?

Author

Commented:
aikimark,

To answer your question directly .... I didn't put it anywhere! Because I clearly don't know where to put it. I have never even come across that statement until now...
Software Engineer
CERTIFIED EXPERT
Distinguished Expert 2019
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION
CERTIFIED EXPERT

Commented:
Just a tiny suggestion for debugging:

Add following lines in the beginning of your script (of noci's last version)

import sys
print(sys.version)
import pandas as pd
print(pd.__version__)
import numpy as np
print(numpy.__version__)

Open in new window


And send us the output

Author

Commented:
ok, I decided to not read the data using Pandas, instead I used standard Python

file_location = "/FileStore/tables/CC1_TrainTestSpit.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

data = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

Open in new window

I then ran you script as follows:

print(data)

print(len(data))           # some statistics on the data... (original size)

#get 20% data

df=data.sample(frac=0.2)   # sample data
print( df.head())          # added to show first few elements.
print(len(df))            # sampled size

# split data, still no changes...

def trainTestSplit(df):
    train=df.sample(frac=0.8)
    test=df.drop(train.index)

    summary=train.describe(include=[np.number]).loc[['mean','min','max'],:]

    return train,test,summary

(trf, tsf, sum) = trainTestSplit(df)      # run the split function

print(trf)     # print result-sets   - this will show only part of the data as it is quite a huge array (160 or so elements, 40- ish shown).
print(tsf)
print(sum)

Open in new window



However, I still get the error:

TypeError: object of type 'DataFrame' has no len()

The full error is as follows:

TypeError: object of type 'DataFrame' has no len()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<command-4057405164164362> in <module>()
----> 1 print(len(data))           # some statistics on the data... (original size)
      2 print(data)
      3 
      4 #get 20% data
      5 

TypeError: object of type 'DataFrame' has no len()

Open in new window

Author

Commented:
gelonida,

As per my recent comment, I decided to not bother with Pandas for now...

I'm running Python version 3
aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014

Commented:
In my earlier comment, I said to put the data.head() statement after the read_csv() statement.  Do you understand that you HAD a statement that invoked the pandas read_csv() method?

Do you understand that "after" means below or following (statement) in your code?

Do you have any coding experience?  If so, how much?

What is your Python experience?  My assumption is that you know Python well enough to tackle data science problems.

Author

Commented:
Hi guys

I recoded without Pandas just to show the issue isn’t with Pandas, but more to do with the code

Author

Commented:
aikimark,

Forgive me if I'm misintepreting your comments, but I can't make out whether you're helping me or just putting me down?

Yes, I do have some coding experience.

I would just appreciate your help please.
nociSoftware Engineer
CERTIFIED EXPERT
Distinguished Expert 2019

Commented:
Spark is NOT standard python it is just another external import (apache spark).  You may have conflicting imports.
Not all objects do have an implementation of the len() function.  pandas dataframe objects do have them though.
In stead of editing this inside a bigger thing please try to run the provided scripts as such....

IMHO aikimark was not putting you down... adding some statements to print data or show data after an action is a standard debugging method (i did the same).
The difference is i made a copy / paste example and aikimark requested to add a statement after a certain action....appearantly this is not clear to you .
so "some coding experience" seems to exclude basic debugging skills, or english not being your native language (like i am not a native enlish speaker).
Again not putting you some place, just an observation.


csv import is part of the standard set.

A native method for reading csv files would be:
#!/bin/env python3
import csv
import random
data=[];
with open('CC1_TrainTestSpit.csv') as csvfile:
    csvdata = csv.reader(csvfile, dialect='excel', delimiter=',')
    for row in csvdata:
        data.append(row)
    #get 20% data
print(data)
da=random.sample(data, int(len(data)/5))
print( len(da) )

# split data
...

Open in new window

Now this does not produce a dataframe, it produces an array of data...and the header is still the first line in the data.
CERTIFIED EXPERT

Commented:
@Member_2_7966113

The point of my post is, that in order to reproduce the issue and not fall into some subtle traps it is very helpful to know the exact version of python (not 'just' I'm using python3) and involved libraries like pandas (or not) or numpy or sparks.

The next thing, that is very important, that you execute exactly the same code than the one you posted.

This will allow us to reproduce your errors and it will allow you to validate our suggestions.
Sometimes a tiny typo, a line not inserted at the exact same place can be the cause why things work for the experts and not for you.

Author

Commented:
IMHO aikimark was not putting you down

ok.

Anyway, I entered your code as follows:

#!/bin/env python3
import csv
import random
data=[];
with open('/FileStore/tables/CC1_TrainTestSpit.csv') as csvfile:
    csvdata = csv.reader(csvfile, dialect='excel', delimiter=',')
    for row in csvdata:
        data.append(row)
    #get 20% data
print(data)
da=random.sample(data, int(len(data)/5))
print( len(da) )

# split data
...

Open in new window


But I get the error

ileNotFoundError: [Errno 2] No such file or directory: '/FileStore/tables/CC1_TrainTestSpit.csv'

There must be an error with the way I have inputted '/FileStore/tables/CC1_TrainTestSpit.csv', as the file is in that location
nociSoftware Engineer
CERTIFIED EXPERT
Distinguished Expert 2019

Commented:
Case sensitive?...
/Filestore != /filestore  on any unix system. On windows it doesn't matter.
The same problem may also be in the original.

CC1_TrainTestSplit.csv vs. CC1_TrainTestSpit.csv.. (small L missing)?

Author

Commented:
NOCI,

I don't think that is the problem because the following works fine:

# File location and type
file_location = "/FileStore/tables/CC1_TrainTestSpit.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","


data = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

Open in new window

Author

Commented:
The full error is as follows:

FileNotFoundError: [Errno 2] No such file or directory: '/FileStore/tables/CC1_TrainTestSpit.csv'
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<command-4363701911678434> in <module>()
      3 import random
      4 data=[];
----> 5 with open('/FileStore/tables/CC1_TrainTestSpit.csv') as csvfile:
      6     csvdata = csv.reader(csvfile, dialect='excel', delimiter=',')
      7     for row in csvdata:

FileNotFoundError: [Errno 2] No such file or directory: '/FileStore/tables/CC1_TrainTestSpit.csv'

Open in new window

aikimarkSocial distance; Wear a mask; Don't touch your face; Wash your hands for 20 seconds
CERTIFIED EXPERT
Top Expert 2014

Commented:
You've wandered into Linux land, so I'm unable to fully help.  I'll still monitor the question and comment if I can.

I think my 'spidey sense' went off early in this question thread.

Author

Commented:
OK, but I'm not entirely convinced the issue is with Linux - but I could be wrong
nociSoftware Engineer
CERTIFIED EXPERT
Distinguished Expert 2019

Commented:
That would be strange.. as open() is the only way to open a file, so deep down both pandas and spark will use the same open()  to access a file.

Can you produce the COMPLETE script you are using.. even from the spark version the imports are (in)conveniently missing.
Did you try to run the script as i put it here copy/paste to a file on your system.
(assumption in my script is that the CC1.. csv file is in the current directory.).
I have no apache.spark on my system and no simple package to install it on my system.

Author

Commented:
OK, here is everything:

# Databricks notebook source
# MAGIC %md
# MAGIC 
# MAGIC ## Overview
# MAGIC 
# MAGIC This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.
# MAGIC 
# MAGIC This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

# COMMAND ----------

import pandas as pd 
import numpy as np

# COMMAND ----------

# File location and type
file_location = "/FileStore/tables/CC1_TrainTestSpit.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","


data = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

# COMMAND ----------

#!/bin/env python3
import csv
import random
data=[];
with open('/FileStore/tables/CC1_TrainTestSpit.csv') as csvfile:
    csvdata = csv.reader(csvfile, dialect='excel', delimiter=',')
    for row in csvdata:
        data.append(row)
    #get 20% data
print(data)
da=random.sample(data, int(len(data)/5))
print( len(da) )

# split data
...

# COMMAND ----------

# MAGIC %sql
# MAGIC 
# MAGIC /* Query the created temp table in a SQL cell */
# MAGIC 
# MAGIC select * from `CC1_TrainTestSpit_csv`

# COMMAND ----------

print(len(data))           # some statistics on the data... (original size)
print(data)

#get 20% data

df=data.sample(frac=0.2)   # sample data
print( df.head())          # added to show first few elements.
print(len(df))            # sampled size

# split data, still no changes...

def trainTestSplit(df):
    train=df.sample(frac=0.8)
    test=df.drop(train.index)

    summary=train.describe(include=[np.number]).loc[['mean','min','max'],:]

    return train,test,summary

(trf, tsf, sum) = trainTestSplit(df)      # run the split function

print(trf)     # print result-sets   - this will show only part of the data as it is quite a huge array (160 or so elements, 40- ish shown).
print(tsf)
print(sum)

# COMMAND ----------

#!/bin/env python3
#! previous line added to enforce python 3... when run as an executable.

import pandas as pd
import numpy as np 

print(len(data))           # some statistics on the data... (original size)
print(data)

#get 20% data

df=data.sample(frac=0.2)   # sample data
print( df.head())          # added to show first few elements.
print(len(df))            # sampled size

# split data, still no changes...

def trainTestSplit(df):
    train=df.sample(frac=0.8)
    test=df.drop(train.index)

    summary=train.describe(include=[np.number]).loc[['mean','min','max'],:]

    return train,test,summary

(trf, tsf, sum) = trainTestSplit(df)      # run the split function

print(trf)     # print result-sets   - this will show only part of the data as it is quite a huge array (160 or so elements, 40- ish shown).
print(tsf)
print(sum)

# COMMAND ----------

Open in new window

CERTIFIED EXPERT

Commented:
Perhaps spark is doing some weird magic with the case of the filenames (though it would surprise me)

the case of filenames is important.

Just asking some (perhaps stupid) questions

Did you execute both scripts as the same user on the same machine?
Can you start the script from within the same terminal?

can you type in a terminal (ideally the same from which you launched the working and failing python script) following four commands and share the output (copy paste):
id
ls -l /FileStore/tables/CC1_TrainTestSpit.csv
ls -ld /FileStore/tables/
ls -ld /FileStore/

Open in new window

CERTIFIED EXPERT

Commented:
Just one more question:
reading the comment of the file
Databricks notebook source

Open in new window


Do you run one of the codes in a jupyter notebook (in a notebook server)?

Is this notebook server perhaps located on a different machine?

Adding following lines will show you whether the code of both scripts is really executed on the same machine with the same privileges
import os ; os.system("uname -a ; id") 

Open in new window

I know that one should normally use subprocess.Popen, but this is just for debugging

Author

Commented:
gelonida,

Thanks getting touch. I can confidently say the issue isn't with filenames in this case..

If the problem was with filenames, the following would fail

file_location = "/FileStore/tables/CC1_TrainTestSpit.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
data = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

Open in new window

Author

Commented:
Hi gelonida, trust me, the problem isn't as you suggestd...
nociSoftware Engineer
CERTIFIED EXPERT
Distinguished Expert 2019

Commented:
Ok this explains a lot... (I needed to google some terms...) Appearantly this is in a databricks apache sparks environment.
Meaning some things are arranged FOR you by the framework. I have no such stuff setup.
(i know pip install sparkmagic will get me some of the tools, alas not all).

So this is more a sparkmagics/databricks question than a python question.
This seems to run in a databricks environment. Which may be on another system,  ... explaining missing files.....
Also if a file is missing i will get errors and stackdumps from python, the sparkmagic framework may very well suppress those

Now an answer to your question:

get the CC1... .csv file in your current directory. (you may need to install pandas on your system as well)
and paste my script from https://www.experts-exchange.com/questions/29158512/Python-Challenging-Code-Question.html#a42945427 into the file read_csv.py
run the script with:  python read_csv.py
And then you will have your answer on PYTHON.
Then report on results please.


It won't answer the implied question why doesn't the Databricks envronment run as expected.

Some remarks:
#!/bin/env line ONLY work when running on Unix/Linux and only if they are the first line of a file.
the image loader understand what to the with #!....  otherwise

The script i presented are to be run as a standalone script ONLY requiring python, not a complete distributed data analysis system.
so if you run Python and paste my script in it it will run and read a data file in the current directory.

appearantly there is some dataframe declaration in the spark environment that has no len() method.
You can remove the print(data.len() )  as it only prints the number of elements (rows) a pandas dataframe has.

Bare python is different from an environment that embeds python as a method...  the latter most probably also includes all kinds of extras and provides all kinds of assumption that we can't know.
The databricks framework will provide some means to translate filenames to some different location where a "virtual" store can be... where as the bare python open will not.
the sparm object will know about those, pandas may know, or might not know.

Author

Commented:
OK, NOCI, what platform are you running the code on?

Also, can you give me a printout of the results when you run the code please?
CERTIFIED EXPERT

Commented:
@Member_2_7966113

I don't trust anybody, even not myself.

I learnt, that discarding a cause because it can't be the cause is not necessarily a good idea.

With one script python finds a file with a given name and can open it
in another script it can't open it.

Either the file name is not identical, or the user opening the file is not identical or the machine the code is excuted on is not identical.
It is essential to not trust, but to have an output of the script confirming that this is or is not the cause.

So please add following lines to both scripts, the working and the failing one:

import os ; os.system("uname -a ; id ; ls -l /FileStore/tables/CC1_TrainTestSpit.csv ; ls -ld /FileStore/tables/ ; ls -ld /FileStore/") 

Open in new window

and show us the output.

I don't want to be right. I just want to be 100% sure, that this is not the cause.
nociSoftware Engineer
CERTIFIED EXPERT
Distinguished Expert 2019

Commented:
I have a plain linux system.   (i tried on a gentoo system, instead using  debian, centos etc. should all work).
Installed are Python 3.6   (and python 2.7 for some older stuff).
also added the pandas envronment. (and all dependancies).


The print out also is in: https://www.experts-exchange.com/questions/29158512/Python-Challenging-Code-Question.html#a42945427
bottom half.

Author

Commented:
Hi NOCI,

I think your solution actually works.

I just need to figure out why I'm getting the following error on my platform

df=data.sample(frac=0.2)   # sample data
AttributeError: 'list' object has no attribute 'sample'

Open in new window

Author

Commented:
The full error is:

AttributeError: 'list' object has no attribute 'sample'
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<command-4057405164164362> in <module>()
      4 #get 20% data
      5 
----> 6 df=data.sample(frac=0.2)   # sample data
      7 #print( df.head())          # added to show first few elements.
      8 #print(len(df))            # sampled size

AttributeError: 'list' object has no attribute 'sample'

Open in new window

nociSoftware Engineer
CERTIFIED EXPERT
Distinguished Expert 2019

Commented:
A list has no sample method/attribute.

that why i used random.sample(list, n)
 which takes a list and pulls n random samples from it.
...
print(data)
da=random.sample(data, int(len(data)/5))
print( len(da) )
...

Open in new window

also there is no frac... 0.2 = 1 / 5.. the second parameter needs to be an int so it will be truncated.

As it is a list you cannot drop values based on content... etc. so the trisection code needs adjustment as well.

You may be able to install pandas by running pip install pandas, that would help using the original code.
( If you have no sudo / root right it may need to be: pip install --user pandas  )
Then you also may need to use pyvenv in front of your python2 command or use: python3 -m venv  to use the pip installed user environment.

Author

Commented:
gelonida,

The code provided NOCI appears to be working, however I entered the code you suggested and I get the following results

Out[3]: 512

Author

Commented:
HI NOCI,

You solution worked.
Just so you know I understand why I got the error 'AttributeError: 'list' object has no attribute 'sample''.

The equivalent in Spark for sample is randomSplit. When I entered that instead of sample, it worked.

Going mark you solution as resolved.

Thanks sooooooooooooooooo much man
nociSoftware Engineer
CERTIFIED EXPERT
Distinguished Expert 2019

Commented:
Ok,

In the future when posting a question please ALSO mention the runtime environment.   yours is not "plain sailing" Python.
Spark provides a LOT more then just a python runtime.
CERTIFIED EXPERT

Commented:
@Member_2_7966113

It's not the 512, but the lines above I was interested
They should have looked something like:
Linux mymachine 4.4.0-161-generic #189-Ubuntu SMP Tue xxxx ...  x86_64 x86_64 GNU/Linux
uid=1000(username) gid=1000(groupname) groups=1000(groupname),4(adm)
ls: cannot access '/FileStore/tables/CC1_TrainTestSpit.csv': No such file or directory
ls: cannot access '/FileStore/tables/': No such file or directory
ls: cannot access '/FileStore/': No such file or directory

Open in new window


If you did not see some lines that look like my above example, then it is perhaps your databricks / spark framework behaving differently than python (capturing stdout / stderr instead of displaying it)

Also I asked to add them to both scripts and execute them in the working and in the failing script.
You seemed to have copied and pasted the lines into ipython (or a playbook) which does not help to find out the difference between the working and failing script.

In any case I'm glad that things are working now for you.

However in order to have a faster response next time I strongly suggest, to ensure, that some information is already in your first post.
your python version, the version of the involved libraries code.

This can easily be done by adding a few lines at the beginning of your script, like for example
import sys
print(sys.version)
import pandas as pd
print(pd.__version__)
import numpy as np
print(np.__version__)

Open in new window



Make also sure, that the code is complete enough, that others on EE can execute it and try to reproduce your issue.
and post the exact error message / output of your script.


Happy coding and enjoy python / Spark

Author

Commented:
Hello noci,

Can you let me know the what the following code attempts to achieve:

test=df.drop(train.index)
nociSoftware Engineer
CERTIFIED EXPERT
Distinguished Expert 2019
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION

Gain unlimited access to on-demand training courses with an Experts Exchange subscription.

Get Access
Why Experts Exchange?

Experts Exchange always has the answer, or at the least points me in the correct direction! It is like having another employee that is extremely experienced.

Jim Murphy
Programmer at Smart IT Solutions

When asked, what has been your best career decision?

Deciding to stick with EE.

Mohamed Asif
Technical Department Head

Being involved with EE helped me to grow personally and professionally.

Carl Webster
CTP, Sr Infrastructure Consultant
Empower Your Career
Did You Know?

We've partnered with two important charities to provide clean water and computer science education to those who need it most. READ MORE

Ask ANY Question

Connect with Certified Experts to gain insight and support on specific technology challenges including:

  • Troubleshooting
  • Research
  • Professional Opinions
Unlock the solution to this question.
Join our community and discover your potential

Experts Exchange is the only place where you can interact directly with leading experts in the technology field. Become a member today and access the collective knowledge of thousands of technology experts.

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.