La démystification du Big Data

 

Voici un petit article regroupant le WebCast, les Slides et les quelques scripts HiveQL, commandes PowerShell, … présentés lors de notre session La démystification du Big Data aux TechDays 2014.

 

Les Slides

 

Les Retours

Nous avons eu de bons retours suite à notre présentation, les différents commentaires vont nous permettre d’améliorer certains points.

Au total 48 participants ont donné un avis sur la session, la note moyenne est de 3,9/5 ce qui nous place dans le milieu du classement 🙂

 

Les Scripts

 
HiveQL : Script d’agrégation, de transformation et de chargement des Tweets + Analyse des sentiments + Word Cloud

-----------------------------------------------------------------------------------------------------------
-- Author : Romain Casteres
-- Blog : https://pulsweb.azurewebsites.net/
-- Date : 12/02/2014
-- This Script PowerShell is used to create a new HDInsight Services Cluster 
-- TechDays 2014
-----------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------
-- HIVE - SQL-like language called HiveQL
-----------------------------------------------------------------------------------------------------------
SET hive.exec.dynamic.partition=true;				-- Allow dynamic partition
SET hive.exec.dynamic.partition.mode=nonstrict;	 		-- Allow dynamic partition 

-- Schema on Read / Schema on Write
CREATE EXTERNAL TABLE IF NOT EXISTS KeywordList (PKID int, keywords string, SoftwareID int, PlatformID int, SocialNetwork string, Target string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location 'wasb://keywords@apvxdcube.blob.core.windows.net/';

CREATE VIEW Twitterkeyword AS SELECT PKID as IdKeywords, Keywords, SoftwareID, PlatformID, Target FROM Keywordlist WHERE SocialNetwork ='TWITTER';

-- Create Sentiment Dictionary (GZip File)
CREATE EXTERNAL TABLE dictionary (word string, polarity float) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION 'wasb://hivescript@apvxdcube.blob.core.windows.net/dictionary';

-- Create Time Zone Map to aggregate the data by countries
CREATE EXTERNAL TABLE time_zone_map (time_zone string,country string,notes string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION 'wasb://hivescript@apvxdcube.blob.core.windows.net/time_zone_map';

--Create the tweets_raw table containing the records from Twitter : 3 October 2013
CREATE EXTERNAL TABLE IF NOT EXISTS tweets_raw ( json_response string ) partitioned by (dt string) stored AS textfile;
ALTER TABLE tweets_raw add if NOT EXISTS PARTITION (dt='2013-10-03') location 'wasb://twitterxbox@apvxdcube.blob.core.windows.net/2013-10-03';

CREATE VIEW tweets_simple AS 
SELECT 
	get_json_object(json_response, '$.id_str') AS id,
	dt,
	substring(get_json_object(json_response, '$.created_at'),12,2) AS hour,
	regexp_replace(regexp_replace(regexp_replace(get_json_object(json_response, '$.text'),'\\t',''),'\\n',''),'\\r','') AS text, --regular expression
	cast (get_json_object(json_response, '$.retweet_count') AS int) AS retweet_count,
	get_json_object(json_response, '$.user.screen_name') AS screen_name,
	cast (get_json_object(json_response, '$.user.followers_count') AS int) AS followers_count,
	get_json_object(json_response, '$.user.verified') AS verified,
	get_json_object(json_response, '$.user.profile_image_url') AS profile_image_url,
	get_json_object(json_response, '$.user.time_zone') AS time_zone
FROM tweets_raw 
WHERE (LENGTH(get_json_object(json_response, '$.text'))>1);

-- Create Clean view and add country
CREATE VIEW tweets_clean AS SELECT t.id, t.dt, t.hour, t.text, t.retweet_count, t.screen_name, t.followers_count, t.verified, t.profile_image_url, CASE WHEN LENGTH(m.country)5 THEN 'positive' WHEN AVG(polarity) tw.word) AND (LENGTH(tw.word) > 3) AND (word RLIKE '[a-zA-Z]') AND NOT(word RLIKE '[0-9]') --Word without numeric character
HAVING (COUNT(tw.id) > 5);

-- Output : External Table ext_tweets
CREATE EXTERNAL TABLE IF NOT EXISTS ext_tweets (Keywords string, SoftwareID int, PlatformID int, Target string, Hour string, dt string, country string, verified boolean, sentiment string, AvgPolarity float, TweetCount bigint)
	ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' 
	STORED AS TEXTFILE LOCATION 'wasb://outputdata@apvxdcube.blob.core.windows.net/Twitter/ext_tweets';
INSERT OVERWRITE TABLE ext_tweets
SELECT 
	k.Keywords, 
	k.SoftwareID, 
	k.PlatformID, 
	k.Target, 
	td.Hour,
	td.dt, 
	td.country,
	td.verified,
	td.sentiment,
	AVG(td.AvgPolarity) AS AvgPolarity,
	COUNT(td.id) AS TweetCount
FROM Twitterkeyword k
	INNER JOIN ext_tweets_details td ON lower(k.Keywords) = lower(td.Keywords)
GROUP BY
	k.Keywords, 
	k.SoftwareID, 
	k.PlatformID, 
	k.Target, 
	td.Hour,
	td.dt, 
	td.country,
	td.verified,
	td.sentiment;

-- UDF : Add Java Functions
-- Execute on cluster : hive -i C:\apps\dist\_MyScript\Twitter_tech2014.hql

-----------------------------------------------------------------------------------------------------------
-- SQOOP : SQL To Hadoop
-----------------------------------------------------------------------------------------------------------
sqoop export 
	-connect "jdbc:sqlserver://dcubesql2.cloudapp.net:1433;username=###;password=###;database=APVX" 
	-table stg_words 
	-export-dir "wasb://outputdata@apvxdcube.blob.core.windows.net/Twitter/ext_keyword_word" 
	-fields-terminated-by "\t" 

sqoop export 
	-connect "jdbc:sqlserver://dcubesql2.cloudapp.net:1433;username=###;password=###;database=APVX" 
	-table stg_tweets 
	-export-dir "wasb://outputdata@apvxdcube.blob.core.windows.net/Twitter/ext_tweets" 
	-fields-terminated-by "\t"

 
PowerShell : Copie du script HiveQL dans un Blob Storage Azure

Set-AzureSubscription -SubscriptionName '###' -CurrentStorageAccount '###'
$context = New-AzureStorageContext -StorageAccountName '###' -StorageAccountKey '###'
Set-AzureStorageBlobContent -Container '###' -File 'C:\Users\Administrateur\Desktop\test.txt' -Context $context -Force 

 
PowerShell : Execution du script HiveQL :

Set-AzureSubscription -SubscriptionName '###'
Write-Host "Executing HiveQL Jobs" -f yellow
$clustername = "###"
$clusterSubscriptionId = (Get-AzureSubscription -Current).SubscriptionId
$clusterAdmin = '###'
$clusterPassword = '###'
$passwd = ConvertTo-SecureString $clusterPassword -AsPlainText -Force
$clusterCredentials = New-Object System.Management.Automation.PSCredential ($clusterAdmin, $passwd)
$ScriptFolder = "wasb://###/test.hql" 
$hiveJobVT = New-AzureHDInsightHiveJobDefinition -JobName "MyJobHQL" -File $ScriptFolder
$startedHiveJob = $hiveJobVT | Start-AzureHDInsightJob -Credential $clusterCredentials -Cluster $clusterName
$startedHiveJob | Wait-AzureHDInsightJob -Credential $clusterCredentials

 
PowerShell : Execution des commandes SQOOP

Set-AzureSubscription -SubscriptionName '###'
Write-Host "Executing Sqoop Jobs" -f yellow
$clustername = "###"
$clusterSubscriptionId = (Get-AzureSubscription -Current).SubscriptionId
$clusterAdmin = '###'
$clusterPassword = '###'
$passwd = ConvertTo-SecureString $clusterPassword -AsPlainText -Force
$clusterCredentials = New-Object System.Management.Automation.PSCredential ($clusterAdmin, $passwd)
$sqoop = New-AzureHDInsightSqoopJobDefinition -Command 'sqoop export -connect "jdbc:sqlserver://###.cloudapp.net:1433;username=###;password=###;database=###" -table test -export-dir "wasb://outputdata@###.blob.core.windows.net/Twitter/ext_keyword_trend" -fields-terminated-by "\t" ' 
$sqoopJob = $sqoop | Start-AzureHDInsightJob -Credential $clusterCredentials -Cluster $clusterName
$sqoopJob | Wait-AzureHDInsightJob -Credential $clusterCredentials
$sqoop = New-AzureHDInsightSqoopJobDefinition -Command 'sqoop export -connect "jdbc:sqlserver://###.cloudapp.net:1433;username=###;password=###;database=###" -table stg_tweets -export-dir "wasb://outputdata@###.blob.core.windows.net/Twitter/ext_tweets" -fields-terminated-by "\t"'
$sqoopJob = $sqoop | Start-AzureHDInsightJob -Credential $clusterCredentials -Cluster $clusterName
$sqoopJob | Wait-AzureHDInsightJob -Credential $clusterCredentials

 
SSIS : Architecture du package

SSIS_HDInsight

SSIS_HDInsight

 

Un article sur l’analyse des Tweets effectuer durant les TechDays 2014 est en cour de rédaction …

 

Comments are closed.

En savoir plus sur Pulsweb - Romain Casteres

Abonnez-vous pour poursuivre la lecture et avoir accès à l’ensemble des archives.

Continue reading