Link to home
Create AccountLog in
Avatar of Bri
BriFlag for United States of America

asked on

Help transforming text string data in order to split into six parts, calculate average of each section and put averages in new columns. Sample data provided

I'm having trouble figuring out how to create a view in SQL SSMS v18.11.1 that does what I need.

Here is my sample data:
DB Fiddle - SQL Database Playground (db-fiddle.com)


What I'm trying to accomplish:

-Get SampleData out of text string

-Divide each number in the SampleData by 10,000

-Split the SampleData into six sections
-Get average of each section and put these averages in columns labeled Section1AVG,

   Section2AVG, etc.

-Only grab the data with a Source = 'Thickness[0,0]

Any help is appreciated. Thanks. 

Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

You mention SSMS which I normally associate with SQL Server.

The Fiddle is MySQL.

I'd also like to see your expected results.  For the first one "Get SampleData out of text string" I'm guessing you want the CSV turned into individual rows.

How to do that varies greatly between database products and versions.

What database product and version does this need to work on?
Are you talking about MS SQL or MySQL?

"Split the SampleData into six sections"
What rules should be used for the grouping?
T-SQL version could look like this SQL script:
CREATE TABLE dbo.Thickness (
  PKId int NOT NULL,
  Source varchar(50) NOT NULL,
  DateCreated datetime NOT NULL,
  SampleSize int NOT NULL,
  SampleData varchar(max) NOT NULL
)

INSERT INTO dbo.Thickness (PKId, Source, DateCreated, SampleSize, SampleData) VALUES
  (1, 'Thickness[0,0]', '2023-09-01 06:08:40.050', '11', '121558, 121624, 121587, 121678, 122576, 123454, 123476, 121456, 128742, 129889, 123654'),
  (2, 'Thickness[0,1]', '2023-09-01 06:08:40.050', '11', '121558, 121624, 121587, 121678, 122576, 123454, 123476, 121456, 128742, 129889, 123654'),
  (3, 'Thickness[0,0]', '2023-09-01 06:07:36.050', '12', '125555, 121558, 121624, 121587, 121678, 122576, 123454, 123476, 121456, 128742, 129889, 123654'),
  (4, 'Thickness[0,1]', '2023-09-01 06:07:36.050', '12', '125555, 121558, 121624, 121587, 121678, 122576, 123454, 123476, 121456, 128742, 129889, 123654'),
  (5, 'Thickness[0,0]', '2023-09-01 06:06:25.050', '11', '121558, 121624, 121587, 121678, 122576, 123454, 123476, 121456, 128742, 129889, 123654'),
  (6, 'Thickness[0,1]', '2023-09-01 06:06:25.050', '11', '121558, 121624, 121587, 121678, 122576, 123454, 123476, 121456, 128742, 129889, 123654'),
  (7, 'Thickness[0,0]', '2023-09-01 06:05:55.050', '15', '123456, 123213, 122111, 123458, 121558, 121624, 121587, 121678, 122576, 123454, 123476, 121456, 128742, 129889, 123654'),
  (8, 'Thickness[0,1]', '2023-09-01 06:05:55.050', '15', '123456, 123213, 122111, 123458, 121558, 121624, 121587, 121678, 122576, 123454, 123476, 121456, 128742, 129889, 123654');

SELECT PKId, [1] Section1AVG, [2] Section2AVG, [3] Section3AVG, [4] Section4AVG, [5] Section5AVG, [6] Section6AVG FROM (
SELECT PKId, CAST(value AS INTEGER)/10000. newVal, CAST(NTILE(6) OVER (PARTITION BY PKId ORDER BY value) AS CHAR(1)) grp
  FROM dbo.Thickness t
 CROSS APPLY STRING_SPLIT(t.SampleData, ',') 
 WHERE Source = 'Thickness[0,0]'
) s
PIVOT
(
AVG(newVal)
FOR grp IN ([1],[2],[3],[4],[5],[6])
) AS pt

Open in new window

>>T-SQL version could look like this SQL script

Only if running SQL Server 2016 or above.

That is why I asked for Product and Version.
Avatar of Bri

ASKER

Whoops! Here's a link to one that is using SQL Server.

SQL Fiddle 


Here is an example of my expected results:


SQL Fiddle



My version of SQL is:

Microsoft SQL Server 2017 (RTM) - 14.0.1000.169 but I am creating the view in SSMS v18.11.1 



For the grouping, I just need them to be as equal as possible. There are normally around 500 data points in the SampleData so it won't make too big of a difference if some groups have a couple more data points than others. 


I would double check your expected results.  Not sure how you take numbers in the hundred thousands, divide by 10,000 and get an average in single digits.

Assuming a typo in there somewhere, take a look at the SQL Pavel posted.  It should be close.
OK, the SQL version was just a bet...

More important - The grouping is still unclear... NTILE needs some order and it seems your sample groups are simply taken as they appear in the string. Unfortunately SQL 2017 STRING_SPLIT does not support such sorting and it would need SQL 2022...
Avatar of Bri

ASKER

I need them to be grouped as they appear in the string. 


So if my SampleData has 120 data points in it:

Section1AVG needs the first 20 data points.

Section2Average needs points 21-40

And so on.. 


I would need SQL 2022 to do this?

SOLUTION
Avatar of Scott Pletcher
Scott Pletcher
Flag of United States of America image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
ASKER CERTIFIED SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
AVG should also work:
;WITH CteThickness AS (
    SELECT *, NTILE(6) OVER(PARTITION BY PKId ORDER BY ItemNumber) AS Ntile
    FROM dbo.Thickness
    CROSS APPLY (
        SELECT CAST(SampleData AS varchar(8000)) AS SampleDataChar
    ) AS ca1
    CROSS APPLY dbo.DelimitedSplit8K(SampleDataChar, ',') AS ds
    WHERE Source = 'Thickness[0,0]' AND
        LEN(CAST(SampleData AS varchar(max))) <= 8000
)
SELECT PkId,
    AVG(CASE WHEN Ntile = 1 THEN CAST(CAST(Item AS decimal(9, 2)) / 10000.0 AS decimal(9, 2)) ELSE NULL END) AS Avg1,
    AVG(CASE WHEN Ntile = 2 THEN CAST(CAST(Item AS decimal(9, 2)) / 10000.0 AS decimal(9, 2)) ELSE NULL END) AS Avg2,
    AVG(CASE WHEN Ntile = 3 THEN CAST(CAST(Item AS decimal(9, 2)) / 10000.0 AS decimal(9, 2)) ELSE NULL END) AS Avg3,
    AVG(CASE WHEN Ntile = 4 THEN CAST(CAST(Item AS decimal(9, 2)) / 10000.0 AS decimal(9, 2)) ELSE NULL END) AS Avg4,
    AVG(CASE WHEN Ntile = 5 THEN CAST(CAST(Item AS decimal(9, 2)) / 10000.0 AS decimal(9, 2)) ELSE NULL END) AS Avg5,
    AVG(CASE WHEN Ntile = 6 THEN CAST(CAST(Item AS decimal(9, 2)) / 10000.0 AS decimal(9, 2)) ELSE NULL END) AS Avg6
FROM CteThickness
GROUP BY PKId

Open in new window

I think only if you use NULL, which can cause warning messages to be issued, I believe.


An avg is nothing more than the SUM / COUNT, as people tend to forget since they are used to using avg.

@Bri: If you need to deal with data lens > 8000, let me know.  If it's only up to 16000, you could just split it into 2 8ks and still use delimitedsplit8k.  If it goes beyond that, it gets trickier.

BTW, does the NTILE work correctly?
The following query does not order the groups correctly (data are the same as in my first post):
SELECT PKId, CAST(value AS INTEGER)/10000. newVal, NTILE(6) OVER (PARTITION BY PKId ORDER BY value) grp
  FROM dbo.Thickness t
 CROSS APPLY STRING_SPLIT(t.SampleData, ',') 
 WHERE Source = 'Thickness[0,0]'

Open in new window

I would say the group 6 should always contain the highest value or two from the string split but the results are showing something else:
User generated image
OK, explained - the value may contain the leading space...
Corrected query:
SELECT PKId, CAST(value AS INTEGER)/10000. newVal, NTILE(6) OVER (PARTITION BY PKId ORDER BY LTRIM(value)) grp
  FROM dbo.Thickness t
 CROSS APPLY STRING_SPLIT(t.SampleData, ',') 
 WHERE Source = 'Thickness[0,0]'

Open in new window

Avatar of Bri

ASKER

Sorry for the late response. I have been sick and am just getting back to working on this.


Thanks for all the help. The solution is working as expected.