Hadoop

Apache™ Hadoop® is an  open-source framework that allows large data sets to be processed and distributed across commodity cluster computers.

Share tech news, updates, or what's on your mind.

Sign up to Post

We use SQL server 2017 as our main data store RDMS

We have a vehicle tracking app with one massive log table (all positions for vehicles)  and a lot of other small tables (like car number plate, users etc), pretty much every query involves a join to this main log table
We are just getting going, and the main table is over a billion rows and over 1TB with a relatively small number of cars in relation to the growth plans for the next year
I know this is small, but we estimate we will grow between 10x to 100x in the next year, which has brought up some architecture questions for long term.

Currently we use SQL server which runs nicely, we will also be adding redis to address reduce load on DB,
Currently the DB grows by 10GB every day
Considering we maybe growing at least by 100GB a day, over time this puts us into the BIG DATA category, and I need to start looking at some options for scaling
Every record that gets inserted to the log table gets an ID, and we use this ID for logging events, alerts etc for fast performance (this works well at the current sizing)

Currently my biggest headache is servers with disk storage, I can find server options upto 10TB quite easily, but after this options are limited and prices sky rocket.
Pricing is a massive issue, we are not Google/Microsoft and cannot afford huge server costs.
Performance is also a huge issue, for example, people expect to run reports and expect very small loading times.
We were also planning to move …
0
11/26 Forrester Webinar: Savings for Enterprise
11/26 Forrester Webinar: Savings for Enterprise

How can your organization benefit from savings just by replacing your legacy backup solutions with Acronis' #CyberProtection? Join Forrester's Joe Branca and Ryan Davis from Acronis live as they explain how you can too.

Hi ,

Could please some help to write Recursive Query in Hive (or in Pyspark) .



CREATE   TABLE VSOLD_TESTCALC  AS (
     WITH
        RECURSIVE TEST
        (
          PURCHASENUMBER
          , EffDt
          , NetSOLD
          , NetPurchase --Nate
          , PostSeqNum
          , TransCode
          , Amt
          , TransDescription
          , rnum
          , t_flg
          , SpecPeriodFlg
          , MaxSOLDLimit
          )

         AS
        ( SELECT
           PURCHASENUMBER
         , StartDt                                 AS EffDt
         , CAST(SOLD - sell AS DECIMAL(18,4)) AS NetSOLD
         , CAST(Purchase - sell AS DECIMAL(18,4)) AS NetPurchase
         , renewal_seq       AS PostSeqNum
         , CAST(0 AS                INTEGER)       AS TransCode
         , CAST(0.0 AS              DECIMAL(18,2)) AS Amt
         , CAST('INIT_POSTITION' AS VARCHAR(400))  AS TransDescription
         , rnum
         , t_flg
         , SpecPeriodFlg
         , MaxSOLDLimit
        FROM
           TESTDB.SOLD_CurveInitPoint

        UNION ALL

        SELECT
           b.PURCHASENUMBER
         , b.EffDt
         ,NetSOLD 
         , NetPurchase 
         , b.PostSeqNum
         , b.TransCode
         , b.Amt
         , b.TransDescription END
         , b.rnum
         , b.t_flg
         , Flg
         , b.MaxSOLDLimit
        FROM
           TEST a
         , TESTDB.SOLD_Transactions_TESTCALC b
        WHERE
           b.rnum        = a.rnum+1
      

Open in new window

0
Hi ,
 
      Looking to convert below SQL (INSERT and UPDATE ) statement into Hive Query Language .

INSERT INTO HL_TEST.T_Bck
    SELECT a.PURCHASE_NUMBER
         , a.POST_DATE
         , a.EFF_DATE
         , a.SEQ
         , a.TR_CODE
         , a.TR_SIGN
         , a.IND
         , a.TRANSACTION_AMOUNT
         , a.isValidTransaction
         , CASE WHEN la.FINAL_DRAWDOWN_IND = 'Y' THEN 'Y' ELSE 'N' end AS isBDated 
      FROM backD_trans a   
      JOIN HL_TEST.LOANA la
        ON la.LoanNumber = a.PURCHASE_NUMBER
       AND a.EFF_DATE BETWEEN la.src_strt_dt AND la.src_end_dt
	 GROUP 
		BY a.PURCHASE_NUMBER
         , a.POST_DATE
         , a.EFF_DATE
         , a.SEQ
         , a.TR_CODE
         , a.TR_SIGN
		 , a.IND
         , a.TRANSACTION_AMOUNT
         , a.isValidTransaction
         , CASE WHEN la.FINAL_DRAWDOWN_IND = 'Y' THEN 'Y' ELSE 'N' end;

    UPDATE tgt
      FROM HL_TEST.Transactions tgt, HL_TEST.T_Bck t2
       SET EFF_DATE = t2.POST_DATE
     WHERE tgt.PURCHASE_NUMBER = t2.PURCHASE_NUMBER 
      AND tgt.POST_DATE = t2.POST_DATE
      AND tgt.EFF_DATE = t2.EFF_DATE
      AND tgt.TR_CODE = t2.TR_CODE
      AND tgt.SEQ = t2.SEQ
      AND tgt.TR_SIGN = t2.TR_SIGN 
      AND tgt.TRANSACTION_AMOUNT = t2.TRANSACTION_AMOUNT 
      AND tgt.IND = t2.IND
      AND t2.isBDated = 'Y';

Open in new window

Would appreciate if some one can please help me to run above statement in Hive .

Thanks
0
Hello,

I need to develop python code using cm_client . We have 15 clusters on different cloudera managers . I have to develop one script that collects all information from different cloudera manager . How it is possible ? Please let me know .

At least give me syntax how to connect one cloudera manager one after another .

Thanks
Shaiukh
0
Hadoop Sqoop Export to MS-SQL database ; DRIVER ( placement and configuration )

Research:
https://community.hortonworks.com/questions/1941/sqoop-connector-for-microsoft-sql-server.html

You will need to copy it into /usr/hdp/current/sqoop-client/lib/

New to hdfs where and how can i navigate to the above directory ?
How do i login to have access to the above directory
0
Why do companies purchase a product known as Complex Event Processor (TIBCO Streambase CEP, IBM Infosphere CEP) or download Open Source (Siddhi, Esper)?

I understand why companies use real-time analytics in general to make sense of real-time data streams, but I don't understand why CEP.

"Complex event processing (CEP) uses patterns to detect composite events in streams of tuples."
CEP also joins many streams and finds patterns among the whole.

But I don't get it why use CEP and not Spark? Is there any use case you can explain this on?
0
Hi Expert,

Could anybody please guide me how to load data from Oracle DB to Hadoop HDFS and the result back to Oracle DB.

Thanks in Advance!
0
Issue with high number of  TCP CLOSE_WAIT socket connections on Hortonwork(HDP2.6.4) NameNodes & Metastore Server.
We frequently have very high number of CLOSE_WAIT  socket connections on hadoop servers, as a result hadoop services are unavailable on Namenode servers. This happen after heavy ingestion of data in cluster. As a result, I need to restart the cluster after re-booting concerned servers.
I tried re-setting  value of several TCP attributes on the servers, but this had not solve the problem.
Using lsop | grep CLOSE_WAIT, I can identified concerned processes which had CLOSE_WAIT socket connections, I killed the concerned process & try to re-start hadoop services but this had also not solve the problem.
I had monitored the servers for number of CLOSE_WAIT socket connections & whenever number of these keep rising , it's point to symptom that the hadoop services on NameNode are going to down in couple of minutes.
Any idea to solve this issue is welcome.
0
So i have a user schema like this:

var user_schema = new Schema({
   username:{type:String,required: true, unique : true, trim: true},
   college:{type:String,required: true},
   password:{type:String,required: true, trim: true},
   email:{type:String,required: true, unique : true, trim: true},
   phone:{type:Number,required: true, unique : true, trim: true},
   dp:{type:String},
   tag:{type:String},
   description:{type:String},
   friends:[{type:String}],
   pending:[{type:String}],
   skills:{type:String},
   bucket:[{type:String}]
  });

Open in new window

and my objective is, to search the all the documents in the collection to get people based on the following conditions:

1. They should not be in the users' "friends" array.
2. They should not be in the users' "pending" array.
3. They should have the same "tag" (a string value) as the user.

So, basically I have to compare the users' fields ("friends","pending" and "tags"), with fields of all documents in the whole collection.

How do I do it, using mongoose (nodejs mongodb library)
0
My requirement is to store multiple data types in same column of a hive table.
And also be able to read that data .e.g. if one records has array value for that column and another record has struct value or string , I may be able to fetch that value accordingly.

I could store json data as avro in hive table with data type of string for that particular column( since spark/sqlcontext inferred the data type as string for multiple data types in the same column).I am not able to operat on that data.I just can simply read by select columnname from table.

I have tried to use uniontype to load that column-table data(string) into another table with uniontype as data type for that column but it erred saying mismatch.

Error while compiling statement: FAILED: SemanticException [Error 10044]: line 18:36
Cannot insert into target table because column number/types are different 'test_uniontyp':
Cannot convert column 1 from string to uniontype<array<string>, string,struct<abd:array<struct<ax:string,bx:string>>>>
.


Any suggestion.
0
OWASP: Forgery and Phishing
LVL 19
OWASP: Forgery and Phishing

Learn the techniques to avoid forgery and phishing attacks and the types of attacks an application or network may face.

Hello Experts,

I have created the following Hadoop Hive Script.

The script is attempting to store the results of a query into the following location:

LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/sampleout/';

However, I keep on getting the following error:

FAILED: ParseException line 9:0 Failed to recognize predicate 'ROW'. Failed rule: 'identifier' in table or column identifier
18/01/30 16:08:06 [main]: ERROR ql.Driver: FAILED: ParseException line 9:0 Failed to recognize predicate 'ROW'. Failed rule: 'identifier' in table or column identifier
org.apache.hadoop.hive.ql.parse.ParseException: line 9:0 Failed to recognize predicate 'ROW'. Failed rule: 'identifier' in table or column identifier

Open in new window


The Hive script is as follows:

[code]DROP TABLE IF EXISTS geography;
CREATE EXTERNAL TABLE geography
(
 anonid INT,
 eprofileclass INT,
 fueltypes STRING,
 acorn_category INT,
 acorn_group STRING,
 acorn_type INT,
 nuts4 STRING,
 lacode STRING,
 nuts1 STRING,
 gspgroup STRING,
 ldz STRING,
 gas_elec STRING,
 gas_tout STRING
)
ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/sampleout/'
TBLPROPERTIES ("skip.header.line.count" = "1");

Create table acorn_category_frequency
 as
select acorn_category,
 count(*) as acorn_categorycount
from geography
group by acorn_category,
ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/sampleout/';

Open in new window

[/code]

Can someone please help figure out where I'm going wrong in the script?

Thanks
0
Hello Experts,

I have created the following Hadoop Hive HQL script, however, I keep on getting the following error

FAILED: ParseException line 21:44 missing EOF at ',' near ')'
18/01/29 21:37:15 [main]: ERROR ql.Driver: FAILED: ParseException line 21:44 missing EOF at ',' near ')'

Open in new window


The script is as follows:
DROP TABLE IF EXISTS HiveSampleIn;
CREATE EXTERNAL TABLE HiveSampleIn
(
 anonid INT,
 eprofileclass INT,
 fueltypes STRING,
 acorn_category INT,
 acorn_group STRING,
 acorn_type INT,
 nuts4 STRING,
 lacode STRING,
 nuts1 STRING,
 gspgroup STRING,
 ldz STRING,
 gas_elec STRING,
 gas_tout STRING
)
ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/samplein/';
tblproperties ("skip.header.line.count"="2");

DROP TABLE IF EXISTS HiveSampleOut; 
CREATE EXTERNAL TABLE HiveSampleOut 
(    
    acorn_category int
    
) 
ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/sampleout/';

INSERT OVERWRITE TABLE HiveSampleOut
Select 
   acorn_category
   
FROM HiveSampleIn Group by acorn_category;

Open in new window


Any help in fixing this problem will be greatly appreciate.
0
Hello Community,

The Hive script I have created keeps throwing the following error:

Time taken: 2.634 seconds
FAILED: ParseException line 17:2 missing EOF at 'COLUMN' near ')'
18/01/29 10:29:53 [main]: ERROR ql.Driver: FAILED: ParseException line 17:2 missing EOF at 'COLUMN' near ')'
org.apache.hadoop.hive.ql.parse.ParseException: line 17:2 missing EOF at 'COLUMN' near ')'

Open in new window


Can someone please take a look at the Hive script and let me know where I might be going wrong?

[code]DROP TABLE IF EXISTS HiveSampleIn; 
CREATE EXTERNAL TABLE HiveSampleIn 
(
 anonid int,
 eprofileclass int,
 fueltypes STRING,
 acorn_category int,
 acorn_group STRING,
 acorn_type int,
 nuts4 STRING,
 lacode STRING,
 nuts1 STRING,
 gspgroup STRING,
 ldz STRING,
 gas_elec STRING,
 gas_tout STRING
) COLUMN FORMAT DELIMITED FIELDS TERMINATED BY (',') LINES TERMINATED BY '10' STORED AS TEXTFILE LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/samplein/'; 

DROP TABLE IF EXISTS HiveSampleOut; 
CREATE EXTERNAL TABLE HiveSampleOut 
(    
    acorn_category int
    
) COLUMN FORMAT DELIMITED FIELDS TERMINATED BY (',') LINES TERMINATED BY '10' STORED AS TEXTFILE LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/sampleout/';

INSERT OVERWRITE TABLE HiveSampleOut
Select 
   acorn_category,
   count(*) as acorn_categorycount 

FROM HiveSampleIn Group by acorn_category

Open in new window

[/code]
Cheers

Carlton
0
Hello Experts,

I have run a hql called samplehive.hql, see attached. However, the script fails with the following error:

FAILED: ParseException line 1:2 cannot recognize input near 'D' 'R' 'O'
18/01/17 20:46:46 [main]: ERROR ql.Driver: FAILED: ParseException line 1:2 cannot recognize input near 'D' 'R' 'O'
org.apache.hadoop.hive.ql.parse.ParseException: line 1:2 cannot recognize input near 'D' 'R' 'O'

I'm very new to Hadoop Hive, can someone take a look at the script and let me know where I'm going wrong

Thanks
samplehive.txt
0
Hello,

I am new to Hadoop.  I have a question regarding yarn memory allocation.  If  we have 16GB memory in cluster,  we can have least 3 4GB cluster an keep 4 GB for other uses.  If a job needs 10 GB RAM, would it use 3 containers or  use one container and will start using the ram rest of the RAM ?
0
Hello,
I am new to Hadoop,  when you configure hive server and yarn.  Can we pick any  node or  need a special node for it? Or can we use the name node?
0
Hello Guys,

We would like to keep Hadoop prod , dev and QA with standard settings and configurations should sync.   What is the best practise to keep them same?  Since we have 100+ data nodes in PROD and only 8 nodes in Dev and 8 Nodes in QA.

We need to make sure all of them are in sync. What is best practise to keep them same?
0
Hi,

To process a mainframe file whose data is string except one filed which is 2 byte binary, where the 2 bytes needs to be ignored, what input format should be used.
I tried as Text input format but the 2 bytes sometimes appear as single char or 2 char or none in windows and hdfs.
Any suggestion in this regard pls.
0
Hello,

When we create datanodes ,  for the disks do we need to use local disks or SAN disks?  Most of them are recommending the local disks. Why do we need to have local disks?
0
I have large numbers PDF document, from which I need to extract text. The extracted text I use for further processing. I did this for a small subset of documents using Tesseract API in a linear approach and I get the required output. However, this takes a very long time when I have a large number of documents.

I tried to use the Hadoop environment processing capabilities (Map-Reduce) and storage (HDFS) for solving this issue. However, I am facing problem to implement Tesseract API into the Hadoop (Map-Reduce) approach. As Teserract converts the files into intermediate image files, I am confused as to how intermediate result Image files of Tesseract-API-process can be handled inside HDFS.

I have searched and unsuccesfully tried a few options earlier like:

    I have extracted text from PDF by extending FileInputFormat class into my own PdfInputFormat class using Hadoop-Map-Reduce, for this i used Apache PDFBox to extract text from pdf, but when it comes to scanned-pdf's which contains image, this solution does not give me the required results.

    I found few answers on the same topic stating to use -Fuse and that will help or one should generate image files locally and than upload those into hdfs for further processing. Not sure if this is the correct approach.

Would like to know approaches around this.
0
Hii, I am using apache nutch- 2.3.1 and apache Tika-1.14.I want to integrate apache-Tika with apache nutch. and save the result into solr -4.1.0.


Thanks in advance.
0

Hadoop

Apache™ Hadoop® is an  open-source framework that allows large data sets to be processed and distributed across commodity cluster computers.

Top Experts In
Hadoop
<
Monthly
>

No Top Experts for this time period. Answer questions to earn the title!