python for analysing data over 300,000,000 rows and 20 columns

i would look like ro analyse/describe data that is big over 300,000,000 rows do you python can be able to help me do this. how efficiently can i do thos
LVL 1
Anthony MatovuBusiness Analyst, MTN UgandaAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Dr. KlahnPrincipal Software EngineerCommented:
Python is a "less than optimal" choice for an application like this.

The reason is that Python is an interpreted language, not a compiled language.  Given the nature of interpreted languages, Python must run slower than an equivalent compiled program that does the same job.  At a guess it would run at best 1/3 the speed of a compiled program.

If the intended application needs to handle a 300 million row table, you probably want to get all the performance possible out of the application.  C or C++ are excellent possibilities if the data involves multiple data types, and if it is purely numeric number crunching it is hard to beat good old FORTRAN.
David FavorLinux/LXD/WordPress/Hosting SavantCommented:
For this large data set, start with PHP, if you prefer an interpreted language.

Be sure you have Opcache setup correctly, which will pre-compile your PHP in to executable code.

That said... you're likely still going to have problems.

You might also check out PERL, which can be pre-compiled into an ELF binary executable. There are also PERL precompilers which convert PERL to C or C++, which can then be run through GCC or CLANG or LLVM, using options which highly optimize loops.

Keep in mind, with this size data, you must write highly optimized loops. Do some searching about how to optimize loops.

Even if you start with an interpreted language, likely you'll eventually have to switch over to using C or C++ eventually.

Trick: PERL has a well documented + easy facility for linking in binary libraries at runtime. In the past when I've hit situations like yours. I use PERL to do all the work, except the brute force looping through data, which I push off to a custom C library, sometimes with only one function... which loops through data, does transforms, then have PERL rewrite any data back to database, using DBI + transactions.

Unsure exactly how this works with MongoDB + how you arrange to rewrite your data, if rewriting is required, will be far slower (disk speed i/o) than looping through data (memory speed i/o).

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Anthony MatovuBusiness Analyst, MTN UgandaAuthor Commented:
thank you very much. what is i summaries and reduce the number of rows to not more than 12,000,000 uaing sql server or any DBMS before i do extraction to where i will need python.  i want to use python for analysis.  do you think this will help
David FavorLinux/LXD/WordPress/Hosting SavantCommented:
12M is still a lot of records, whether you're using Python or C or C++.

The primary overhead in this operation is database i/o, which is constant/independent of language you use.

Just to read 12M rows will take a lot of time.

To rewrite 12M... a very long time...

Tip: You may find reading all rows (no index) will actually be quicker. You'll just have to test + determine what works best.
nociSoftware EngineerCommented:
Some math: 300M rows * 20 Col * 8 bytes (floats) = ~48GB of raw data, no overhead for keeping it in rows/columns esp. with interpreted data.
this will add up.... adding control structures for 20 Col & 300M  rows which also keeps pointers to the cells itself.
During the loading you may need twice this size to allow for resizes.....

R (statistical system:  https://www.r-project.org/ ) might be better equiped to handle this more or less interpreted, you wil still need massive amounts of memory.
But it has optimal primitives and is designed to handle big stuff.

Now if the data can be process row by row... you only need 20-ish data cells (say sum it all up)..  
then it is only  the time to process 300M or 12M rows taking some time.. So IT REALLY DEPENDS on what you want to do.
(btw 12M is only one order of magnitude lower).
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Statistical Analysis System (SAS)

From novice to tech pro — start learning today.