Big Data

113

Solutions

278

Contributors

Big data describes data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying and information privacy. The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set.

Share tech news, updates, or what's on your mind.

Sign up to Post

Please advice, best online self-paced big data paid course site
0
Fundamentals of JavaScript
LVL 13
Fundamentals of JavaScript

Learn the fundamentals of the popular programming language JavaScript so that you can explore the realm of web development.

HIVE   I need to create a report which has a pipe delimited header with the column name, and the data will be comma delimited
New to the HIVE HDFS  space trying to determine the best way

maybe i just need a way to merge files ... could not get it to work  ...  if you questions on this please reply... thanks

Some details :  

hdfs dfs -cat /tmp/i777/CAHM_NEW/000000_0
MONTH_END_DATE|RESOURCE_ID|RESOURCE_NAME|RESOURCE_TYPE|COST_CENTER|CHARGE_CODE|VENDOR_NAME|PROJECT_ID|Z_CODE|PROJECT_HOURS|COST|RATE|IS_ACCRUAL

Open in new window


the above data created with the following
INSERT OVERWRITE DIRECTORY '/tmp/i777/CAHM_NEW/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
SELECT
'MONTH_END_DATE' AS  MONTH_END_DATE
, 'RESOURCE_ID'   AS    RESOURCE_ID
, 'RESOURCE_NAME' AS    RESOURCE_NAME
, 'RESOURCE_TYPE' AS    RESOURCE_TYPE
, 'COST_CENTER'   AS    COST_CENTER
,'CHARGE_CODE'    AS   CHARGE_CODE
,'VENDOR_NAME'    AS   VENDOR_NAME
,'PROJECT_ID'     AS   PROJECT_ID
,'Z_CODE'         AS   Z_CODE
,'PROJECT_HOURS' AS   PROJECT_HOURS
, 'COST'          AS   COST
, 'RATE'          AS   RATE
, 'IS_ACCRUAL'    AS  IS_ACCRUAL
FROM  someTable  LIMIT 1;

Open in new window






hdfs dfs -cat /tmp/i777/CAHM_NEW/000000_0

\N,George,Washington, David,E,\N,EXPENSE,\N,P000000046,\N,5,375,75,1
\N,George,Washington, David,E,\N,EXPENSE,\N,P000000046,\N,75,0,375,5
\N,George,Washington, David,E,\N,EXPENSE,\N,P000000046,\N,75,0,600,8
\N,George,Washington, David,E,\N,EXPENSE,\N,P000000046,\N,8,600,75,1
\N,George,Washington, David,E,\N,EXPENSE,\N,P00000014,\N,4,300,75,1
\N,George,Washington, David,E,\N,EXPENSE,\N,P00000014,\N,75,0,300,4
\N,George,Washington, David,E,\N,EXPENSE,\N,P00000014,\N,75,0,600,8
\N,George,Washington, David,E,\N,EXPENSE,\N,P00000014,\N,8,600,75,1
\N,George,Washington, Robert,E,\N,EXPENSE,\N,P00000014,\N,4,300,75,1
\N,George,Washington, Robert,E,\N,EXPENSE,\N,P00000014,\N,75,0,300,4
\N,George,Washington, David,E,\N,EXPENSE,\N,P000000046,\N,5,375,75,1
\N,George,Washington, David,E,\N,EXPENSE,\N,P000000046,\N,75,0,375,5
\N,markjames@abcdcom, Katz, Greg,NA,\N,EXPENSE,\N,P00000014,\N,10,900,90,1
\N,markjames@abcd.com,Katz, Greg,NA,\N,EXPENSE,\N,P00000014,\N,11,990,90

Open in new window



above data created with the following
create table csv_dump_data ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
LOCATION '/tmp/i777/CAHM_NEW/' as
SELECT
CAST(MO_END_DT AS STRING) AS  MONTH_END_DATE
, RSRC_ID       AS    RESOURCE_ID
, RSRC_NM       AS    RESOURCE_NAME
, RSRC_TYP_CD   AS    RESOURCE_TYPE
, CSTCTR_CD     AS    COST_CENTER
, CHRG_CD       AS   CHARGE_CODE
, VEND_NM       AS   VENDOR_NAME
, PROJ_ID       AS   PROJECT_ID
, Z_CD          AS   Z_CODE
, CAST(PROJ_HR_QTY AS STRING) PROJECT_HOURS
, CAST(RSRC_COST_AMT AS STRING) COST
, CAST(RSRC_RT_AMT AS STRING ) RATE
, ACCRUL_IND    AS   IS_ACCRUAL
FROM  someTable;

Open in new window

0
Would like to understand whether pandas(python) can be used to convert from oracle pl/sql to python(accessing hive) and later ported into spark(Big data) environment or development.

Is to be done using Pyspark for executing in spark(Big data) environment?
0
Hi all is there any software out there that operates on all the platforms is easy priced and lets me store all data photos jobs history passwords for my aging clients who can’t remember anything.

MC & HNY
Thanks for all the help throughout the years
Damian.
0
I have a relatively new company in the UK and it's a technology business that offers custom software services, big data services, and analytics services. I am quite new to marketing. Where should I add the company to make it connected with the UK and US markets?

The answer may include any classifieds or other places where to add the company. It may also be any process (for example LinkedIn prospecting) to reach out to individual companies.
0
Why do companies purchase a product known as Complex Event Processor (TIBCO Streambase CEP, IBM Infosphere CEP) or download Open Source (Siddhi, Esper)?

I understand why companies use real-time analytics in general to make sense of real-time data streams, but I don't understand why CEP.

"Complex event processing (CEP) uses patterns to detect composite events in streams of tuples."
CEP also joins many streams and finds patterns among the whole.

But I don't get it why use CEP and not Spark? Is there any use case you can explain this on?
0
I have a public API endpoint that I am pulling a json file every 30 mins. Right now I am using a python pandas dataframe to pull and upload the file to a cloud storage bucket and then sending to pub sub to process and place into BQ. The problem with this is that the file name stays the same and even though I have  gcs text stream to pub sub if it reads the file once it never reads it again even though the file attributes have changed. My question here is can any one help me with code that will pull from an api web link and stream the data directly to pub sub?

Sample code below:
import json
import pandas as pd
from sodapy import Socrata
from io import StringIO
import datalab.storage as gcs
from google.oauth2 import service_account

client = Socrata("sample.org", None)
results = client.get("xxx")

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results, columns =['segmentid','street','_direction','_fromst','_tost','_length','_strheading','_comments','start_lon','_lif_lat','lit_lon','_lit_lat','_traffic','_last_updt'])
# send results to GCP
gcs.Bucket('test-temp').item('data.json').write_to(results_df.to_json(orient='records', lines=True),'text/json')
0
Hi, i would like to compare any 2 text files. These files may be some what bigger in size.  Would like to do using spark dataframes. Sample requirement shared below.

Ideally, take first record from file1 and search in entire file2 and it should bring all matched occurrences and export to output file by putting proper flag as i mentioned (like update/delete/insert/same) by using Pyspark. Similarly all other records from file1 also should follow same approach.

DataSet1 - (file1.txt)
NO  DEPT NAME   SAL
1   IT  RAM     1000    
2   IT  SRI     600
3   HR  GOPI    1500    
5   HW  MAHI    700

DataSet2 - (file2.txt)
NO  DEPT NAME   SAL
1   IT   RAM    1000    
2   IT   SRI    900
4   MT   SUMP   1200    
5   HW   MAHI   700

Output Dataset - (outputfile.txt)
NO  DEPT NAME   SAL FLAG
1   IT  RAM     1000    S
2   IT  SRI     900     U
4   MT  SUMP    1200    I
5   HW  MAHI    700     S
3   HR  GOPI    1500    D

Thanks
0
Hi dears,
I install BDE plugin in vsphere 6.5 with default setting. When I want create a cluster, I 'v got an error:
Failed to query IP address
In our lab environment, I installed and configure DNS server and DHCP server on the same Host that work perfectly.
I expect nodes take their ip and set dns entries automatically according our configuration, but it doesn't happen.
I guess node send ip request query to DHCP when they haven't still Name (Host name of node). Therefore DHCP can't update DNS entry.
How can I create BD Cluster with hundreds of nodes  automatically by DHCP and DNS?
I appreciate in advance.
error-bde1.jpg
0
Hello.. what type of storage architecture would be most useful for kafka, zookeeper running in a kubernetes and containerized environment. Would glusterfs that manages storage persistence, replication, resiliency, and HA be an overkill when the distributed bigdata services do this on their own anyway? We're debating internally why glusterfs would be needed for distributed bigdata systems like Cassandra, Mongo, Kafka, Zookeeper, etc. Thus trying to keep the architecture simpler. Just wondering if anyone has thoughts or experience to share that would help us a bit also.
0
Become a Certified Penetration Testing Engineer
LVL 13
Become a Certified Penetration Testing Engineer

This CPTE Certified Penetration Testing Engineer course covers everything you need to know about becoming a Certified Penetration Testing Engineer. Career Path: Professional roles include Ethical Hackers, Security Consultants, System Administrators, and Chief Security Officers.

Hi I'm looking for information on Data center growth in Australia. In particular publications on Industry trends on value and growth in Australia.
Anything that talks about isolation of Perth also beneficial. Thanks
0
Hi, I'm building a big data streaming pipeline that takes streams from a camera through kinesis to trigger a lambda function. The lambda function will then use AWS machine learning to detect objects and the images are stored in S3 and their metadata is stored in DDB. My problem is that the first frame of the video is being stored on S3 and DynamoDB repeatedly (same image is being stored). Here is the lambda code (the main function):

def process_image(event, context):

    #Initialize clients
    rekog_client = boto3.client('rekognition')
    s3_client = boto3.client('s3')
    dynamodb = boto3.resource('dynamodb')

    s3_bucket = ...
    s3_key_frames_root = ...

    ddb_table = dynamodb.Table(...)
    rekog_max_labels = ...
    rekog_min_conf = float(...)
    label_watch_list = ...
    label_watch_min_conf = ...

    #Iterate on frames fetched from Kinesis
    for record in event['Records']:
        
        frame_package_b64 = record['kinesis']['data']
        frame_package = cPickle.loads(base64.b64decode(frame_package_b64))
        img_bytes.append(frame_package["ImageBytes"])
        frame_count = frame_package["FrameCount"]

        rekog_response = rekog_client.detect_labels(
            Image={
                'Bytes': img_bytes
            },
            MaxLabels=rekog_max_labels,
            MinConfidence=rekog_min_conf
        )

        #Iterate on rekognition labels. Enrich and prep them for storage in DynamoDB
        labels_on_watch_list = []
        

Open in new window

0
How does big data affect operations in manufacturing?
0
I’m trying to crack my head around another key challenge at the moment. That is to Increase
AI’s accuracy to be able to identify demographic gender groups from 70% to 80%.
Currently accuracy of the information is at 70%.

I’m running a Data Management Platform (DMP) which collects demographic tag
information and these information are saved in Treasure Data.

I’m thinking of setting some rules in AI to be able to better
differentiate gender. For example, E-commerce data parameters that tells AI that
visitors who have viewed cosmetics have a higher chance of being female while
gadget viewers would have a higher chance of being male.

Do you think these are good ideas/methods? If yes, why. If no, any better
suggestions please?

The Data I have at the moment is not very large, average database of 100,000
users. I’m not able to increase this number any bigger at the moment.

Would be great if anyone could also advise or share on how best to
approach the challenges above and help shed some light on the challenges faced.

Many thanks in advance!
0
I’m trying to crack my head around a key challenge at the moment. That is to increase
AI’s accuracy to be able to identify demographic age groups to plus minus 5 years
old.

Current AI settings only allow categorization of age groups of 20-24 years old,
25-29 years old…55 to 59 years old, 60 years old and above. Not plus minus.

I’m running a Data Management Platform (DMP) which collects demographic tag
information and these information are saved in Treasure Data.

I’m thinking of filtering groups of 10 years old , example 20-24, 25-29
as one group then feeding it to AI to try to increase accuracy instead of
inserting the full raw data CSV for AI to compute.

Do you think these are good ideas/methods? If yes, why. If no, any better
suggestions please?

The Data I have at the moment is not very large, average database of 100,000
users. I’m not able to increase this number any bigger at the moment.

Would be great if anyone could also advise or share on how best to
approach the challenges above and help shed some light on the challenges faced.

Many thanks in advance!
0
Hi, i'm need to know it's possible to create Live Data warehouse.
to be transfer data when adding new records or any changes immediately
0
There is a partitioned(batch_date is partition)  hive  table(Table 1) with 3 columns and 1 partition

I'm trying to execute  INSERT INTO TABLE PARTITION(batch_date='2018-02-22')  select column 1, column 2, column 3 from  Table 1 where column 1 = "ABC";

It returns zero records and in hdfs it is creating 3 empty files.

Can you please suggest me a solution on how to prevent these small files  getting created in hdfs??


Note: Before running the INSERT statements I have set the below hive properties.

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
0
Hello Experts,

I have created the following Hadoop Hive Script.

The script is attempting to store the results of a query into the following location:

LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/sampleout/';

However, I keep on getting the following error:

FAILED: ParseException line 9:0 Failed to recognize predicate 'ROW'. Failed rule: 'identifier' in table or column identifier
18/01/30 16:08:06 [main]: ERROR ql.Driver: FAILED: ParseException line 9:0 Failed to recognize predicate 'ROW'. Failed rule: 'identifier' in table or column identifier
org.apache.hadoop.hive.ql.parse.ParseException: line 9:0 Failed to recognize predicate 'ROW'. Failed rule: 'identifier' in table or column identifier

Open in new window


The Hive script is as follows:

[code]DROP TABLE IF EXISTS geography;
CREATE EXTERNAL TABLE geography
(
 anonid INT,
 eprofileclass INT,
 fueltypes STRING,
 acorn_category INT,
 acorn_group STRING,
 acorn_type INT,
 nuts4 STRING,
 lacode STRING,
 nuts1 STRING,
 gspgroup STRING,
 ldz STRING,
 gas_elec STRING,
 gas_tout STRING
)
ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/sampleout/'
TBLPROPERTIES ("skip.header.line.count" = "1");

Create table acorn_category_frequency
 as
select acorn_category,
 count(*) as acorn_categorycount
from geography
group by acorn_category,
ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/sampleout/';

Open in new window

[/code]

Can someone please help figure out where I'm going wrong in the script?

Thanks
0
Hello Experts,

I have run a hql called samplehive.hql, see attached. However, the script fails with the following error:

FAILED: ParseException line 1:2 cannot recognize input near 'D' 'R' 'O'
18/01/17 20:46:46 [main]: ERROR ql.Driver: FAILED: ParseException line 1:2 cannot recognize input near 'D' 'R' 'O'
org.apache.hadoop.hive.ql.parse.ParseException: line 1:2 cannot recognize input near 'D' 'R' 'O'

I'm very new to Hadoop Hive, can someone take a look at the script and let me know where I'm going wrong

Thanks
samplehive.txt
0
Angular Fundamentals
LVL 13
Angular Fundamentals

Learn the fundamentals of Angular 2, a JavaScript framework for developing dynamic single page applications.

So I have a dataset wherein I have account number and "days past due" with every observation. So for every account number, as soon as the "days past due" column hits a code like "DLQ3" , I want to remove rest of the rows for that account(even if DLQ3 is the first observation for that account).

My dataset looks like :

Observation date Account num   Days past due

2016-09                           200056              DLQ1
2016-09                           200048              DLQ2
2016-09                           389490              NORM
2016-09                           383984              DLQ3.....

So for account 383984, I want to remove all the rows post the date 2016-09 as now its in default.

So in all I want to see when the account hits DLQ3 and when it does I want to remove all the rows post the first DLQ3 observation.
0
My business is exploring the option of recoding item codes as currently its all over the place. Ideally going forward we would like to have only one serial number generated for an item  and that the serial number will be the same as the item number.

Is it possible and what impact it will have to the business.

Thanks
0
Hello,

I am new to Hadoop.  I have a question regarding yarn memory allocation.  If  we have 16GB memory in cluster,  we can have least 3 4GB cluster an keep 4 GB for other uses.  If a job needs 10 GB RAM, would it use 3 containers or  use one container and will start using the ram rest of the RAM ?
0
Hello Guys,

We would like to keep Hadoop prod , dev and QA with standard settings and configurations should sync.   What is the best practise to keep them same?  Since we have 100+ data nodes in PROD and only 8 nodes in Dev and 8 Nodes in QA.

We need to make sure all of them are in sync. What is best practise to keep them same?
0
Hi,

I am curious if someone knows the best way to set alerts based on certain keywords for financial filings such as 8K, 10K etc. For example, I want to create an alert such that when the following filing appears on the website and has a keyword like "PSU", I get an alert.https://www.sec.gov/Archives/edgar/data/1115128/000156459017019148/quot-8k_20170928.htm

Thanks
0
Hello,

When we create datanodes ,  for the disks do we need to use local disks or SAN disks?  Most of them are recommending the local disks. Why do we need to have local disks?
0

Big Data

113

Solutions

278

Contributors

Big data describes data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying and information privacy. The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set.