Big Data

101

Solutions

256

Contributors

Big data describes data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying and information privacy. The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set.

Share tech news, updates, or what's on your mind.

Sign up to Post

How does big data affect operations in manufacturing?
0
Free Tool: SSL Checker
LVL 12
Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

hi,

What is the support and feature on Data scientist and big data DB2 offer ?
0
hi,

anyone use polybase on MSSQL for hadoop ? is scale out feature of Polybase working fine? load balancing working well ?
0
hi,

for big data and data science solution of MS SQL, how many piece of puzzle we need? I know MS SQL has R service/server included in the MS SQL installation, I don't think only one server can do the job, what is the full picture?

each need separate cost ?
0
hi,

do oracle has graphical data model? so that for expensive operation like inner join, we can do it in graphical way so that inner join stuff can be by pass and use graphical process.

This is some time MS SQL use it for big data query.

Also will oracle trying to any kind of extra technology like GPU to speed up big data queries ?
0
hi,

What is the component/application for Oracle to do big data analysis?

What is the Oracle product for AI and data scientist research ?

when I look at this:

https://www.oracle.com/artificial-intelligence/platform.html

it said:

Additional libraries and tools include: Jupyter, pandas, scikit-learn, Pillow, OpenCV, and NumPy.
Deep learning frameworks include: TensorFlow, Keras and Caffe.
Elastic AI and Machine Learning Infrastructures include NVIDIA, Flash Storage, and Ethernet.

Open in new window

.

which component is free and no issue with Oracle DB. ?
0
hi,

any good reference , e.g. URL and books on designing , debugging and architecture on data lake with big data and data science?
0
I’m trying to crack my head around another key challenge at the moment. That is to Increase
AI’s accuracy to be able to identify demographic gender groups from 70% to 80%.
Currently accuracy of the information is at 70%.

I’m running a Data Management Platform (DMP) which collects demographic tag
information and these information are saved in Treasure Data.

I’m thinking of setting some rules in AI to be able to better
differentiate gender. For example, E-commerce data parameters that tells AI that
visitors who have viewed cosmetics have a higher chance of being female while
gadget viewers would have a higher chance of being male.

Do you think these are good ideas/methods? If yes, why. If no, any better
suggestions please?

The Data I have at the moment is not very large, average database of 100,000
users. I’m not able to increase this number any bigger at the moment.

Would be great if anyone could also advise or share on how best to
approach the challenges above and help shed some light on the challenges faced.

Many thanks in advance!
0
I’m trying to crack my head around a key challenge at the moment. That is to increase
AI’s accuracy to be able to identify demographic age groups to plus minus 5 years
old.

Current AI settings only allow categorization of age groups of 20-24 years old,
25-29 years old…55 to 59 years old, 60 years old and above. Not plus minus.

I’m running a Data Management Platform (DMP) which collects demographic tag
information and these information are saved in Treasure Data.

I’m thinking of filtering groups of 10 years old , example 20-24, 25-29
as one group then feeding it to AI to try to increase accuracy instead of
inserting the full raw data CSV for AI to compute.

Do you think these are good ideas/methods? If yes, why. If no, any better
suggestions please?

The Data I have at the moment is not very large, average database of 100,000
users. I’m not able to increase this number any bigger at the moment.

Would be great if anyone could also advise or share on how best to
approach the challenges above and help shed some light on the challenges faced.

Many thanks in advance!
0
Hi, i'm need to know it's possible to create Live Data warehouse.
to be transfer data when adding new records or any changes immediately
0
Cloud Class® Course: CompTIA Cloud+
LVL 12
Cloud Class® Course: CompTIA Cloud+

The CompTIA Cloud+ Basic training course will teach you about cloud concepts and models, data storage, networking, and network infrastructure.

There is a partitioned(batch_date is partition)  hive  table(Table 1) with 3 columns and 1 partition

I'm trying to execute  INSERT INTO TABLE PARTITION(batch_date='2018-02-22')  select column 1, column 2, column 3 from  Table 1 where column 1 = "ABC";

It returns zero records and in hdfs it is creating 3 empty files.

Can you please suggest me a solution on how to prevent these small files  getting created in hdfs??


Note: Before running the INSERT statements I have set the below hive properties.

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
0
I have some data I am trying to normalize/ weight. I have 4 regions, the number of people who have missing training certificates, and number of people in the region. Originally I was going to divide number of missing training certificate by number of people for each region to normalize the data. However, the data looks really small when I do that - like 100/40000. I don’t really want to graph such small numbers but need some way to bring in the number of people. Should I multiply by 100 and then just say this data is per 100 employees? Would that make sense?

Any other suggestions?
1
Hello Experts,

I have created the following Hadoop Hive Script.

The script is attempting to store the results of a query into the following location:

LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/sampleout/';

However, I keep on getting the following error:

FAILED: ParseException line 9:0 Failed to recognize predicate 'ROW'. Failed rule: 'identifier' in table or column identifier
18/01/30 16:08:06 [main]: ERROR ql.Driver: FAILED: ParseException line 9:0 Failed to recognize predicate 'ROW'. Failed rule: 'identifier' in table or column identifier
org.apache.hadoop.hive.ql.parse.ParseException: line 9:0 Failed to recognize predicate 'ROW'. Failed rule: 'identifier' in table or column identifier

Open in new window


The Hive script is as follows:

[code]DROP TABLE IF EXISTS geography;
CREATE EXTERNAL TABLE geography
(
 anonid INT,
 eprofileclass INT,
 fueltypes STRING,
 acorn_category INT,
 acorn_group STRING,
 acorn_type INT,
 nuts4 STRING,
 lacode STRING,
 nuts1 STRING,
 gspgroup STRING,
 ldz STRING,
 gas_elec STRING,
 gas_tout STRING
)
ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/sampleout/'
TBLPROPERTIES ("skip.header.line.count" = "1");

Create table acorn_category_frequency
 as
select acorn_category,
 count(*) as acorn_categorycount
from geography
group by acorn_category,
ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/sampleout/';

Open in new window

[/code]

Can someone please help figure out where I'm going wrong in the script?

Thanks
0
Hello Experts,

I would like to run a query on the attached file, but I don't know what type information is on the file in order to run a query on it.

Can someone let me know how to determine the information included in the file?

Regards

Carlton
VANQ_TRIAD_COLLS_20180118
0
Hello Community,

I have created my first hql code, see below and I can't get any data to appear.. I have recently installed Sandbox. The installation comes with a few sample databases. I'm using the database called sample_07 to guide me with my own .hql code.

My hql code is as follows:

CREATE EXTERNAL TABLE mysample
(
 code STRING,
 description STRING,
 total_emp INT,
 salary INT
)
ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/root/music'
TBLPROPERTIES ("skip.header.line.count" = "1");

Open in new window


However, when I run the code using Zeppellin Notebook with the following code, I can see the tables, but no data appears

%jdbc(hive)
select * from mysample limit 14

Open in new window


However, when I run the same code, with using the sample database called sample_07 both the tables and data appear.

csharp

I'm sure there is something very simple that I'm missing.

Can someone please let me know where I'm going wrong?
0
Hello Experts,

I have run a hql called samplehive.hql, see attached. However, the script fails with the following error:

FAILED: ParseException line 1:2 cannot recognize input near 'D' 'R' 'O'
18/01/17 20:46:46 [main]: ERROR ql.Driver: FAILED: ParseException line 1:2 cannot recognize input near 'D' 'R' 'O'
org.apache.hadoop.hive.ql.parse.ParseException: line 1:2 cannot recognize input near 'D' 'R' 'O'

I'm very new to Hadoop Hive, can someone take a look at the script and let me know where I'm going wrong

Thanks
samplehive.txt
0
I have be asked to move data dated before 2004.  Is there a easy way of doing this?  Without going through each folder, sorting by created date and then moving.
0
I'm interested in using Visual Studio in the field of BIG data and artificial intelligence.

At the moment the latest version of Visual Studio is 2017.
When is the next version due out?
What spec machine is needed for it to run smoothly in terms of processor, RAM and diskspace (and anything else that is relevant).
I found that with the Express edition, I could not use the Streamwriter.  Is this expected?
0
So I have a dataset wherein I have account number and "days past due" with every observation. So for every account number, as soon as the "days past due" column hits a code like "DLQ3" , I want to remove rest of the rows for that account(even if DLQ3 is the first observation for that account).

My dataset looks like :

Observation date Account num   Days past due

2016-09                           200056              DLQ1
2016-09                           200048              DLQ2
2016-09                           389490              NORM
2016-09                           383984              DLQ3.....

So for account 383984, I want to remove all the rows post the date 2016-09 as now its in default.

So in all I want to see when the account hits DLQ3 and when it does I want to remove all the rows post the first DLQ3 observation.
0
Cloud Class® Course: Certified Penetration Testing
LVL 12
Cloud Class® Course: Certified Penetration Testing

This CPTE Certified Penetration Testing Engineer course covers everything you need to know about becoming a Certified Penetration Testing Engineer. Career Path: Professional roles include Ethical Hackers, Security Consultants, System Administrators, and Chief Security Officers.

Big Data Projects
The pressure to deliver ‘more for less’ is increasing day in day out inclusive of all industries and business sectors. But if you have a special understanding of technologies like Big Data, your company can get even more value.
0
My business is exploring the option of recoding item codes as currently its all over the place. Ideally going forward we would like to have only one serial number generated for an item  and that the serial number will be the same as the item number.

Is it possible and what impact it will have to the business.

Thanks
0
Hello,

I am new to Hadoop.  I have a question regarding yarn memory allocation.  If  we have 16GB memory in cluster,  we can have least 3 4GB cluster an keep 4 GB for other uses.  If a job needs 10 GB RAM, would it use 3 containers or  use one container and will start using the ram rest of the RAM ?
0
Hello Guys,

We would like to keep Hadoop prod , dev and QA with standard settings and configurations should sync.   What is the best practise to keep them same?  Since we have 100+ data nodes in PROD and only 8 nodes in Dev and 8 Nodes in QA.

We need to make sure all of them are in sync. What is best practise to keep them same?
0
Hi,

what is the diff between MariaDB ColumnStore 1 0  and MS SQL SSIS + SSAS ? if MariaDB ColumnStore 1 0   ?
0
dear all,
I have got video and audio files I need to segment them based on their text.
I need to segment all the files. for example ( a single word contain n audio frames and n of visual frames (images) )
Can any one help or advice how can I make it?

Thanks
0

Big Data

101

Solutions

256

Contributors

Big data describes data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying and information privacy. The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set.