[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x

Big Data

105

Solutions

270

Contributors

Big data describes data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying and information privacy. The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set.

Share tech news, updates, or what's on your mind.

Sign up to Post

Kylin not getting started

here is the below error which I'm getting


kafka dependency is /opt/apache-kylin-2.2.0-bin/lib/kafka-clients-1.0.0.jar
Retrieving Spark dependency...
Error: Could not find or load main class exists
ERROR: Unknown error. Please check full log.
0
Exploring SQL Server 2016: Fundamentals
LVL 12
Exploring SQL Server 2016: Fundamentals

Learn the fundamentals of Microsoft SQL Server, a relational database management system that stores and retrieves data when requested by other software applications.

I have a public API endpoint that I am pulling a json file every 30 mins. Right now I am using a python pandas dataframe to pull and upload the file to a cloud storage bucket and then sending to pub sub to process and place into BQ. The problem with this is that the file name stays the same and even though I have  gcs text stream to pub sub if it reads the file once it never reads it again even though the file attributes have changed. My question here is can any one help me with code that will pull from an api web link and stream the data directly to pub sub?

Sample code below:
import json
import pandas as pd
from sodapy import Socrata
from io import StringIO
import datalab.storage as gcs
from google.oauth2 import service_account

client = Socrata("sample.org", None)
results = client.get("xxx")

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results, columns =['segmentid','street','_direction','_fromst','_tost','_length','_strheading','_comments','start_lon','_lif_lat','lit_lon','_lit_lat','_traffic','_last_updt'])
# send results to GCP
gcs.Bucket('test-temp').item('data.json').write_to(results_df.to_json(orient='records', lines=True),'text/json')
0
Hi I will like to know which could be the best DB open source for blockchain
0
Hi, i would like to compare any 2 text files. These files may be some what bigger in size.  Would like to do using spark dataframes. Sample requirement shared below.

Ideally, take first record from file1 and search in entire file2 and it should bring all matched occurrences and export to output file by putting proper flag as i mentioned (like update/delete/insert/same) by using Pyspark. Similarly all other records from file1 also should follow same approach.

DataSet1 - (file1.txt)
NO  DEPT NAME   SAL
1   IT  RAM     1000    
2   IT  SRI     600
3   HR  GOPI    1500    
5   HW  MAHI    700

DataSet2 - (file2.txt)
NO  DEPT NAME   SAL
1   IT   RAM    1000    
2   IT   SRI    900
4   MT   SUMP   1200    
5   HW   MAHI   700

Output Dataset - (outputfile.txt)
NO  DEPT NAME   SAL FLAG
1   IT  RAM     1000    S
2   IT  SRI     900     U
4   MT  SUMP    1200    I
5   HW  MAHI    700     S
3   HR  GOPI    1500    D

Thanks
0
Big Data
Discover how big data can help with us all, and improve various industries.
0
Hive. How can I search for all tables in a database that contain a Column Name?
0
Hi dears,
I install BDE plugin in vsphere 6.5 with default setting. When I want create a cluster, I 'v got an error:
Failed to query IP address
In our lab environment, I installed and configure DNS server and DHCP server on the same Host that work perfectly.
I expect nodes take their ip and set dns entries automatically according our configuration, but it doesn't happen.
I guess node send ip request query to DHCP when they haven't still Name (Host name of node). Therefore DHCP can't update DNS entry.
How can I create BD Cluster with hundreds of nodes  automatically by DHCP and DNS?
I appreciate in advance.
error-bde1.jpg
0
Hello.. what type of storage architecture would be most useful for kafka, zookeeper running in a kubernetes and containerized environment. Would glusterfs that manages storage persistence, replication, resiliency, and HA be an overkill when the distributed bigdata services do this on their own anyway? We're debating internally why glusterfs would be needed for distributed bigdata systems like Cassandra, Mongo, Kafka, Zookeeper, etc. Thus trying to keep the architecture simpler. Just wondering if anyone has thoughts or experience to share that would help us a bit also.
0

Big data encompasses the collection, storage, and analysis of massive stores of information. It’s helping users conduct research, create technological innovations, improve operational efficiency and drive organizations of all types toward their objectives. In conjunction with big data systems, enterprises are leveraging new technologies, such as advanced analytics, artificial intelligence (AI), the Internet of Things (IoT) and machine learning, to extract the value hidden in information. In the marketplace, these technologies are revolutionizing the way that enterprises conduct business. 


As big data implementation expands, a relatively new management technique called cross-functional integration allows firms to make operational adjustments in an information-driven market where time-sensitive revelations emerge quickly. The new management paradigm allows various business units to collaborate efficiently and make better decisions. As big data implementations grow increasingly prominent among the world’s enterprises, more business leaders are implementing cross-functional organizational structures to make the most of the insights provided by big data reporting. 


Big data will not, however, replace humans as strategic business advisors. Although there are astonishing new technologies that enterprise leaders can leverage to make informed decisions, there will always be a need for specialists who can interpret big data reports and determine what that information means for proprietors and organizations.


A Transformative and Disruptive Technology


Enterprises use big data systems to implement informed marketing operations and better understand their clients and consumers. It’s a powerful resource for improving the productivity and prosperity of businesses. Resultingly, enterprises are aggressively investing in digital marketing and new technology infrastructures.


Big data systems have allowed enterprises to make meaningful use of data that they’ve collected and stored for years, and as firms gain experience in deriving value from data, they will undoubtedly invest in more advanced architectures to extract increased value from their proprietary information.


Technology is changing how business leaders view the marketplace, and they are making adjustments accordingly. As a result, business leaders are rethinking management and organizational structures. The fast and powerful impact that big data has made across nearly all disciplines has left many business leaders unprepared to develop effective strategies to implement the technology. However, the innovation is at the forefront of thought of nearly all executives that seek new ways to improve the performance of their organizations.


New Ways to Steer the Ship With Data


Because of big data systems, there’ve been enormous improvements in important fields, such as education, healthcare, finance, and marketing. Depending on the organizational mission and market share, enterprises leaders are investing in different technologies. For the most part, big data analysis is currently the primary driver of change, and most big data implementations have taken place in marketing and operational capacities. 


However, as the big data market matures, enterprises leaders will discover new and powerful ways to leverage the technology. For instance, sentiment analysis is an emerging discipline within many fields and industries. Additionally, business leaders are making extensive use of predictive and structured data analysis to discover opportunities to improve operations and expand their clientele.


Automated video indexing is another relatively new and promising application for big data analytics systems. As more businesses and consumers create video content, big data will prove an invaluable resource for extracting meaningful information from video archives. Audio analytics and metadata content could further increase the value of this promising application.


In the retail sector, enterprises are using camera footage to analyze in-store foot traffic. The proprietors use the technology to analyze characteristics such as consumer movement patterns, checkout flow, and traffic volume. The businesses use this information to make improvements in areas such as product placement, promotional campaigns and store layouts. 


More advanced big data retail research involves the video analysis of group buying behavior. This kind of study allows retailers to gather information about shopper buying patterns that go unnoticed at the cash register. Using video analytics, retail enterprises uncover missed opportunities by making a detailed analysis of this kind of group activity.


Big data analysts are also making great strides in the analysis of data generated by social media users. This field brings together fields that appear unrelated at first glance, such as anthropology, computer science, economics, physics, psychology, and sociology.


Big data is transforming the landscape of nearly all industries - and entire economies. Enterprise leaders will use the technology and other advanced analysis resources, as well as the IoT, to transform their brands, develop powerfully effective strategies and gain a competitive advantage in the marketplace. As time goes on, the early adopters of these technologies will reap rewards that will directly, and positively, impact their bottom lines.



0
Hi I'm looking for information on Data center growth in Australia. In particular publications on Industry trends on value and growth in Australia.
Anything that talks about isolation of Perth also beneficial. Thanks
0
HTML5 and CSS3 Fundamentals
LVL 12
HTML5 and CSS3 Fundamentals

Build a website from the ground up by first learning the fundamentals of HTML5 and CSS3, the two popular programming languages used to present content online. HTML deals with fonts, colors, graphics, and hyperlinks, while CSS describes how HTML elements are to be displayed.

Hi, I'm building a big data streaming pipeline that takes streams from a camera through kinesis to trigger a lambda function. The lambda function will then use AWS machine learning to detect objects and the images are stored in S3 and their metadata is stored in DDB. My problem is that the first frame of the video is being stored on S3 and DynamoDB repeatedly (same image is being stored). Here is the lambda code (the main function):

def process_image(event, context):

    #Initialize clients
    rekog_client = boto3.client('rekognition')
    s3_client = boto3.client('s3')
    dynamodb = boto3.resource('dynamodb')

    s3_bucket = ...
    s3_key_frames_root = ...

    ddb_table = dynamodb.Table(...)
    rekog_max_labels = ...
    rekog_min_conf = float(...)
    label_watch_list = ...
    label_watch_min_conf = ...

    #Iterate on frames fetched from Kinesis
    for record in event['Records']:
        
        frame_package_b64 = record['kinesis']['data']
        frame_package = cPickle.loads(base64.b64decode(frame_package_b64))
        img_bytes.append(frame_package["ImageBytes"])
        frame_count = frame_package["FrameCount"]

        rekog_response = rekog_client.detect_labels(
            Image={
                'Bytes': img_bytes
            },
            MaxLabels=rekog_max_labels,
            MinConfidence=rekog_min_conf
        )

        #Iterate on rekognition labels. Enrich and prep them for storage in DynamoDB
        labels_on_watch_list = []
        

Open in new window

0
How does big data affect operations in manufacturing?
0
hi,

What is the support and feature on Data scientist and big data DB2 offer ?
0
hi,

anyone use polybase on MSSQL for hadoop ? is scale out feature of Polybase working fine? load balancing working well ?
0
hi,

for big data and data science solution of MS SQL, how many piece of puzzle we need? I know MS SQL has R service/server included in the MS SQL installation, I don't think only one server can do the job, what is the full picture?

each need separate cost ?
0
hi,

do oracle has graphical data model? so that for expensive operation like inner join, we can do it in graphical way so that inner join stuff can be by pass and use graphical process.

This is some time MS SQL use it for big data query.

Also will oracle trying to any kind of extra technology like GPU to speed up big data queries ?
0
hi,

What is the component/application for Oracle to do big data analysis?

What is the Oracle product for AI and data scientist research ?

when I look at this:

https://www.oracle.com/artificial-intelligence/platform.html

it said:

Additional libraries and tools include: Jupyter, pandas, scikit-learn, Pillow, OpenCV, and NumPy.
Deep learning frameworks include: TensorFlow, Keras and Caffe.
Elastic AI and Machine Learning Infrastructures include NVIDIA, Flash Storage, and Ethernet.

Open in new window

.

which component is free and no issue with Oracle DB. ?
0
hi,

any good reference , e.g. URL and books on designing , debugging and architecture on data lake with big data and data science?
0
I’m trying to crack my head around another key challenge at the moment. That is to Increase
AI’s accuracy to be able to identify demographic gender groups from 70% to 80%.
Currently accuracy of the information is at 70%.

I’m running a Data Management Platform (DMP) which collects demographic tag
information and these information are saved in Treasure Data.

I’m thinking of setting some rules in AI to be able to better
differentiate gender. For example, E-commerce data parameters that tells AI that
visitors who have viewed cosmetics have a higher chance of being female while
gadget viewers would have a higher chance of being male.

Do you think these are good ideas/methods? If yes, why. If no, any better
suggestions please?

The Data I have at the moment is not very large, average database of 100,000
users. I’m not able to increase this number any bigger at the moment.

Would be great if anyone could also advise or share on how best to
approach the challenges above and help shed some light on the challenges faced.

Many thanks in advance!
0
CompTIA Security+
LVL 12
CompTIA Security+

Learn the essential functions of CompTIA Security+, which establishes the core knowledge required of any cybersecurity role and leads professionals into intermediate-level cybersecurity jobs.

I’m trying to crack my head around a key challenge at the moment. That is to increase
AI’s accuracy to be able to identify demographic age groups to plus minus 5 years
old.

Current AI settings only allow categorization of age groups of 20-24 years old,
25-29 years old…55 to 59 years old, 60 years old and above. Not plus minus.

I’m running a Data Management Platform (DMP) which collects demographic tag
information and these information are saved in Treasure Data.

I’m thinking of filtering groups of 10 years old , example 20-24, 25-29
as one group then feeding it to AI to try to increase accuracy instead of
inserting the full raw data CSV for AI to compute.

Do you think these are good ideas/methods? If yes, why. If no, any better
suggestions please?

The Data I have at the moment is not very large, average database of 100,000
users. I’m not able to increase this number any bigger at the moment.

Would be great if anyone could also advise or share on how best to
approach the challenges above and help shed some light on the challenges faced.

Many thanks in advance!
0
Hi, i'm need to know it's possible to create Live Data warehouse.
to be transfer data when adding new records or any changes immediately
0
There is a partitioned(batch_date is partition)  hive  table(Table 1) with 3 columns and 1 partition

I'm trying to execute  INSERT INTO TABLE PARTITION(batch_date='2018-02-22')  select column 1, column 2, column 3 from  Table 1 where column 1 = "ABC";

It returns zero records and in hdfs it is creating 3 empty files.

Can you please suggest me a solution on how to prevent these small files  getting created in hdfs??


Note: Before running the INSERT statements I have set the below hive properties.

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
0
I have some data I am trying to normalize/ weight. I have 4 regions, the number of people who have missing training certificates, and number of people in the region. Originally I was going to divide number of missing training certificate by number of people for each region to normalize the data. However, the data looks really small when I do that - like 100/40000. I don’t really want to graph such small numbers but need some way to bring in the number of people. Should I multiply by 100 and then just say this data is per 100 employees? Would that make sense?

Any other suggestions?
1
Hello Experts,

I have created the following Hadoop Hive Script.

The script is attempting to store the results of a query into the following location:

LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/sampleout/';

However, I keep on getting the following error:

FAILED: ParseException line 9:0 Failed to recognize predicate 'ROW'. Failed rule: 'identifier' in table or column identifier
18/01/30 16:08:06 [main]: ERROR ql.Driver: FAILED: ParseException line 9:0 Failed to recognize predicate 'ROW'. Failed rule: 'identifier' in table or column identifier
org.apache.hadoop.hive.ql.parse.ParseException: line 9:0 Failed to recognize predicate 'ROW'. Failed rule: 'identifier' in table or column identifier

Open in new window


The Hive script is as follows:

[code]DROP TABLE IF EXISTS geography;
CREATE EXTERNAL TABLE geography
(
 anonid INT,
 eprofileclass INT,
 fueltypes STRING,
 acorn_category INT,
 acorn_group STRING,
 acorn_type INT,
 nuts4 STRING,
 lacode STRING,
 nuts1 STRING,
 gspgroup STRING,
 ldz STRING,
 gas_elec STRING,
 gas_tout STRING
)
ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/sampleout/'
TBLPROPERTIES ("skip.header.line.count" = "1");

Create table acorn_category_frequency
 as
select acorn_category,
 count(*) as acorn_categorycount
from geography
group by acorn_category,
ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'wasb://adfgetstarted@geogstoreacct.blob.core.windows.net/sampleout/';

Open in new window

[/code]

Can someone please help figure out where I'm going wrong in the script?

Thanks
0
Hello Experts,

I would like to run a query on the attached file, but I don't know what type information is on the file in order to run a query on it.

Can someone let me know how to determine the information included in the file?

Regards

Carlton
VANQ_TRIAD_COLLS_20180118
0

Big Data

105

Solutions

270

Contributors

Big data describes data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying and information privacy. The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set.

Top Experts In
Big Data
<
Monthly
>