Big Data

84

Solutions

4

Articles & Videos

214

Contributors

Big data describes data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying and information privacy. The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set.

Share tech news, updates, or what's on your mind.

Sign up to Post

I have large numbers PDF document, from which I need to extract text. The extracted text I use for further processing. I did this for a small subset of documents using Tesseract API in a linear approach and I get the required output. However, this takes a very long time when I have a large number of documents.

I tried to use the Hadoop environment processing capabilities (Map-Reduce) and storage (HDFS) for solving this issue. However, I am facing problem to implement Tesseract API into the Hadoop (Map-Reduce) approach. As Teserract converts the files into intermediate image files, I am confused as to how intermediate result Image files of Tesseract-API-process can be handled inside HDFS.

I have searched and unsuccesfully tried a few options earlier like:

    I have extracted text from PDF by extending FileInputFormat class into my own PdfInputFormat class using Hadoop-Map-Reduce, for this i used Apache PDFBox to extract text from pdf, but when it comes to scanned-pdf's which contains image, this solution does not give me the required results.

    I found few answers on the same topic stating to use -Fuse and that will help or one should generate image files locally and than upload those into hdfs for further processing. Not sure if this is the correct approach.

Would like to know approaches around this.
0
Containers and Docker for Everyone
Containers and Docker for Everyone

Containers are an incredibly powerful technology that can provide you and/or your engineering team with huge productivity gains. Using containers, you can deploy, back up, replicate, and move apps and their dependencies quickly and easily.

Hey,

I have an audio file, many actually, that are an interview between the interviewer and interviewee.  The same person is asking questions in each file, while the people answering are different.

I need to separate the answers out by generating silence over the interview questions. I'm currently doing this by hand with audacity, but it is extremely time consuming.

Any help would be greatly appreciated.  I am a software developer, but audio is not my area, so code is am option if there isn't a program available.

Thanks
0
https://www.google.com/search?biw=1918&bih=974&tbs=dur%3Al&tbm=vid&q=taxi+mafia+-android+-walkthrough+-gameplay+-game+-%22video+game%22+-playstation+-xbox&oq=taxi+mafia+-android+-walkthrough+-gameplay+-game+-%22video+game%22+-playstation+-xbox&gs_l=serp.3...14103.18629.0.19563.19.18.0.0.0.0.113.1362.16j2.18.0....0...1.1.64.serp..1.0.0.KTXCbd4TVOY


copy that search into a browser
Google with videos larger than 20 minutes

I am looking for mob (like tony sprano tv show "the spranos" on hbo television) documentary about taxi drivers
Please dont completely edit search to

tony sprano taxi cab

to find tony sprano riding a taxi cab because this is an algorithm question

Tony sprano riding a taxi cab is an example to a correct answer but isnt the only correct answer to the question but please dont edit the answer too much


All the results are video games playing
People filming themselves playing xbox or playstation or nintendo

Seems like all the specific searches are something else.
So this is more of a big data question.

I usually just watch netflix.com because it is easier than something specific.
Most people just watch regular tv because tv is easy to turn on

this question is a puzzle and not looking for a link the meaning of + and - symbols
so if you have a better picture of google custom search terms that is not the answer.
google-puzzle
Note: this question is not homework and is not a puzzle for a job interview -YET.…
0
Dublin Tech Summit
One event, two days, a great line-up of speakers, and 48% female presence. Still have no idea what I’m talking about?
1
hi,

I am ready introduction to oracle goldengate:

http://www.oracle.com/technetwork/middleware/goldengate/overview/index.html

"

Oracle GoldenGate is a comprehensive software package for real-time data integration and replication in heterogeneous IT environments. The product set enables high availability solutions, real-time data integration, transactional change data capture, data replication, transformations, and verification between operational and analytical enterprise systems. Oracle GoldenGate 12c brings extreme performance with simplified configuration and management, tighter integration with Oracle Database, support for cloud environments, expanded heterogeneity, and enhanced security.

In addition to the Oracle GoldenGate core platform for real-time data movement, Oracle provides the Management Pack for Oracle GoldenGate—a visual management and monitoring solution for Oracle GoldenGate deployments—as well as Oracle GoldenGate Veridata, which allows high-speed, high-volume comparison between two in-use databases.
"

so it is for ETL and replication, but what is Oracle GoldenGate for Big Data? goldengate is not for big data, right?

please share you idea.
0
Hi Wizards, I think everyone nowadays heard about it everyday. So how is your experience with Bitcoin so far? We have 4-5 free servers, can we use it to mine some cent ;-)

Any recommendation for procedures, setup is appreciated. Many thanks as always.
0
regular gmail; not g-suite. one label.

gmail label

only want gmails to reach inbox from one sender
admin@ee.com

all the other emails are not important

is there a gmail filter using the word NOT
0
Hi All,

I'd like to know what kind of performance suggestion and tweak for very large VM deployment ?

I've got one VM running Tableau application which process data from multiple SQL Server databases, then it crunches the numbers before presenting it to the Executive management team.

The specs:

16x vCPU
112 GB vRAM
1 TB D:\ as Thin Provisioned VMDK on VMFS 5

somehow it is running slower every month. So what's the best practice recommendation for deploying such large VM ?

Any tips and suggestion would be greatly appreciated.
0
could i see the
do not call list


how can individuals know which numbers not to call; if they cant see the list


I am not sure which zone this question should be in so please add zones.
0
We have a table that lists dates as a number (double), ie, 20170417.
We would like to place this as a date into a date field, preferably in the format YYYY-MM-DD
What's the most efficient way to accomplish this?
0
Free Tool: Subnet Calculator
LVL 8
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

HI,

I am trying to find sample dataset about (cloud) storage server file access logs to conduct my research project. Can anyone please suggest any ideas or places to find this type of sample files? I think maybe something like FTP server's log dataset. because my project focus on file access not web page access.

Thanks in advance.
0
Hi,
couple of years ago, our client developed a "Document Management" system for their own (it has specific business rules).
Currently, they have 10 million documents and 8 TB of information.

They currently have the system running in 2 platforms (both perfomes slow):
1. Windows Environment (Windows Server 2012 R2, MS SQL Server 2012 R2 and IIS)
2. Linux (Red Hat Linux 6, mySQL and Apache)

As you guess, managing this system have become terrible difficult because of 2 main reasons:

1. Displaying 'search results' or 'document reports' (list documents and properties) takes more than 30 minutes (in employee's computers).
2. To backup they have to do it in serveral steps (and the night is not enogh to make a full backup) (in employee's computers)

So, they have requested to us to improve their system, we are developers.
Also, they have request to us to propose a new platform for managing the new improved Document Management system.

We have done our research in google, but we are not satisfied on what seams to be the new platform so I would like to receive tour recommendations or suggestions about it.

What we initially think is that using the folloiwing should do the work just fine:
- Amazon Elastic (filesystem)
- Amazon DinamoDB (database)
- Apache Hadoop (web server)
- php/laravel (programming)

Your comments are very welcomed.
Thanks a lot.
0
Hoping to get some opinions.

Plain and simple. I would say 90% of our data stored on the network are PST files created from Outlook. Now when I say 90%, I am talking about hundreds and hundreds of GB, maybe even a TB of data that consists of purely PST files.

What do other companies do to combat people "needing" to save 10 years of email history? I know one of my options is buy more storage, but want to know what other options are out there, or what other people are doing.
0
I've loaded several months of data to Hive using SAS. I have confirmed that everything loaded successfully and can query the data with no issues.
However, when I move to using an Impala editor (local here Hue/Hadoop) and refresh/update the tables, I get an error when running the following query: SELECT * from data_table LIMIT 1000.

The error is:
Your query has the following error(s):
IllegalStateException: null

Seems that it cant see the table.
Any ideas?
0
How do I use the rand() function to divide a data set into 3 parts? Randomness for the purpose of statistical data analysis.
1
Hi

I am putting together a presentation to the business on the pros of creating a 360 customer view using our data. Does anyone have any information I can include that may help please?

Thanks you
0
I've been searching for non javascript based charts/graphs to display mysql data. We currently use nvd3 but that is becoming a problem when trying to integrate our software with other products.

I need at least 5 or 6 leads showing the possibilities of creating nice looking charts/graphs to represent mysql based data without needing js behind it, at all. Obviously, we'll have to build the intermediate between the charts/graphs and mysql but first trying to find if there are any such solutions.

I've come across a few html5 things but nothing that is really definitive and truly usable today. I'm looking for any and all alternatives to using JS which can show nice charts/graphs.
0
CC0 License
Choosing an appropriate provider for your company’s email security can be difficult as email security is a key element in the overall security of a business. A company’s email is an open door for malicious hackers who can potentially drive your business into the ground with one rogue email. There are a few options with regards to a solution but it all depends on picking the right fit for your business. This blog post is dedicated to the options businesses face when choosing their email security.


Appliance or Software Solutions - On Premise


Generally, the most popular choice for a company are appliance or software solutions. These appliances are great for focusing on certain aspects of email security such as data privacy and spam and virus protection. This option is also easy to install and not very expensive. However, these software solutions require a lot of  "hands on" updates which do not come with the benefits of realtime threat intelligence or the big data analysis that can be performed across the entire network of a cloud provider and can be quite slow to update. These appliances operate by themselves and require occasional attention from the beholder of the software.


Hosted Email Security - the OEM way


Another option businesses have are hosted email security services where the provider is simply hosting an appliance or application to take some management out of the customer’s hands. This option would be considered appealing as the customer …
0
Hi, I am trying to use something similar to a vlookup in dax and am not able to get it working..

I've attached an example workbook.

In table "Sheet1" I have got a column named "BucketID" which is generated from a formula (from dates being completed or not --- giving me a string of 1's and 0's...) -- I am trying to take that string of numbers, and look it up from table "BucketID" --- by looking up the tableID and then providing the corresponding text.... (this output/formula will be in column "Bucket"

can anyone help me?

Attached is the example
Example.xlsx
0
The Orion Papers
The Orion Papers

Are you interested in becoming an AWS Certified Solutions Architect?

Discover a new interactive way of training for the exam.

I am try to migrate SQL server data to Google big query server my data size of table is 285 GB. how to migarte it.
0
I have a couple of tables in a database that are using the LONG data type. It's a horrible data type and we need to get rid of it. The question is what do I replace it with? I've narrowed it down to LOB, BLOB, or CLOB but the differences between those seem subtle and I can't quite figure out which is the best choice and why. Some of the LONG fields are storing bitmap image data. Others are storing HTML markup text. I'm OK with using a different data type for each of those. I just need to get rid of the LONG.
0
Big data transfers via information superhighways require special attention and protection. Learn more about the IT-regulations of the country where your server is located. Analyze cloud providers and their encryption systems for safe data transit. Set in-house rules for preventing data loss.
0
Is it actually possible for a transportation freight broker or even a manufacturer to create a prediction analytics tool that is accurate enough to predict what truckload capacity will cost in even given lane that moves in say 1 to 6 months?

for example

atlanta ga to chicago il
everything average aa far as weight , product and equipment. Current cost for truckload carrier to haul $1000 . What will this same move cost in 3 months?


My point is in freight hauling there are too many variables involved so each time i read about some company in transaportation freight advertise it has a rate prediction tool , i ask is this possible to predict?
0
I have a 3 node datastax cassandra(Community) cluster with huge data. I have few tables which contain 3-5 billion records in them. I want to delete data that is older than 90 days from those tables.

The problem is how do i run a select query which runs without timeout. I am currently running below query

NOW=$(date -d "-3 month" +"%Y-%m-%d")
select day_ts from table_name where minute_ts < '$NOW' LIMIT 100000 ALLOW FILTERING;


Even if i limit the select query result, it will still parse the whole 3-5 billion records and then filter the data.

Please suggest what can be a efficient way to do this.
0
Hi Experts,

I want to enroll in a big data course. While I'm searching for a course I found many like:
1- Big data schools (BDSCP).
2- EMC Data Science Associate (EMCDSA).
3- Cloudera Certified Associate (CCA).
4- MCSE: Business Intelligence.
Others.....

Which you recommend me to start with, considering these factors:
1- I'm a beginner to big data filed.
2- I have basic level programming language experience (in Microsoft technologies only).
3- I need the most required certificate in the market.
4- Can be studied online without a need to attend at training course center.


Thanks a lot in advance.
Harreni
0

Big Data

84

Solutions

4

Articles & Videos

214

Contributors

Big data describes data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying and information privacy. The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set.

Top Experts In
Big Data
<
Monthly
>