We help IT Professionals succeed at work.

python for analysing data over 300,000,000 rows and 20 columns

i would look like ro analyse/describe data that is big over 300,000,000 rows do you python can be able to help me do this. how efficiently can i do thos
Comment
Watch Question

Dr. KlahnPrincipal Software Engineer

Commented:
Python is a "less than optimal" choice for an application like this.

The reason is that Python is an interpreted language, not a compiled language.  Given the nature of interpreted languages, Python must run slower than an equivalent compiled program that does the same job.  At a guess it would run at best 1/3 the speed of a compiled program.

If the intended application needs to handle a 300 million row table, you probably want to get all the performance possible out of the application.  C or C++ are excellent possibilities if the data involves multiple data types, and if it is purely numeric number crunching it is hard to beat good old FORTRAN.
Fractional CTO
Distinguished Expert 2019
Commented:
For this large data set, start with PHP, if you prefer an interpreted language.

Be sure you have Opcache setup correctly, which will pre-compile your PHP in to executable code.

That said... you're likely still going to have problems.

You might also check out PERL, which can be pre-compiled into an ELF binary executable. There are also PERL precompilers which convert PERL to C or C++, which can then be run through GCC or CLANG or LLVM, using options which highly optimize loops.

Keep in mind, with this size data, you must write highly optimized loops. Do some searching about how to optimize loops.

Even if you start with an interpreted language, likely you'll eventually have to switch over to using C or C++ eventually.

Trick: PERL has a well documented + easy facility for linking in binary libraries at runtime. In the past when I've hit situations like yours. I use PERL to do all the work, except the brute force looping through data, which I push off to a custom C library, sometimes with only one function... which loops through data, does transforms, then have PERL rewrite any data back to database, using DBI + transactions.

Unsure exactly how this works with MongoDB + how you arrange to rewrite your data, if rewriting is required, will be far slower (disk speed i/o) than looping through data (memory speed i/o).
Anthony MatovuBusiness Analyst, MTN Uganda

Author

Commented:
thank you very much. what is i summaries and reduce the number of rows to not more than 12,000,000 uaing sql server or any DBMS before i do extraction to where i will need python.  i want to use python for analysis.  do you think this will help
David FavorFractional CTO
Distinguished Expert 2019
Commented:
12M is still a lot of records, whether you're using Python or C or C++.

The primary overhead in this operation is database i/o, which is constant/independent of language you use.

Just to read 12M rows will take a lot of time.

To rewrite 12M... a very long time...

Tip: You may find reading all rows (no index) will actually be quicker. You'll just have to test + determine what works best.
nociSoftware Engineer
Distinguished Expert 2019
Commented:
Some math: 300M rows * 20 Col * 8 bytes (floats) = ~48GB of raw data, no overhead for keeping it in rows/columns esp. with interpreted data.
this will add up.... adding control structures for 20 Col & 300M  rows which also keeps pointers to the cells itself.
During the loading you may need twice this size to allow for resizes.....

R (statistical system:  https://www.r-project.org/ ) might be better equiped to handle this more or less interpreted, you wil still need massive amounts of memory.
But it has optimal primitives and is designed to handle big stuff.

Now if the data can be process row by row... you only need 20-ish data cells (say sum it all up)..  
then it is only  the time to process 300M or 12M rows taking some time.. So IT REALLY DEPENDS on what you want to do.
(btw 12M is only one order of magnitude lower).